Keywords: agency, power, benchmark, evaluation, safety, alignment
TL;DR: HumanAgencyBench (HAB) is the first benchmark of how LLMs do or do not support human agency, an AI safety risk that is frequently discussed but difficult to study.
Abstract: A common feature of risk scenarios from advanced machine learning systems is the loss of human agency, such as mindless engagement with social media feeds or a long-term loss of control from transformative AI that automates human decision-making. We draw on recent innovations in automating and scaling the evaluation of large language models (LLMs) to create HumanAgencyBench (HAB), a benchmark of human agency support with multiple dimensions, such as correcting misinformation that may be leading the user astray and asking clarifying questions to ensure alignment with user intent. We develop these dimensions by drawing on agency theories in philosophy, cognitive science, and social science. In preliminary evaluations, we find that models tend to generate agency-supporting responses in 65% of test cases, but this varies significantly across developer, model, and dimension. For example, the most recent version of Claude-3.5-Sonnet (2024-10-22) has the highest average performance at 82%, but that is followed by o1-Preview and, surprisingly, Gemma-2-9B, at 71%. HAB demonstrates how discussions of safety with LLMs and other AI agents can be grounded in real-world behavior. However, because of the difficulty and fragility of agency benchmarking, we encourage its use only as a research tool and discourage direct optimization.
Submission Number: 5
Loading