Issue Brief: Preliminary Taxonomy of Pre-Deployment Frontier AI Safety Evaluations

By:

Posted on:

As frontier AI systems continue to advance, rigorous and scientifically grounded safety evaluations will be increasingly essential. Although frontier AI holds immense promise for society, the growing capabilities of advanced AI systems may also introduce risks to public safety and security. Ensuring such systems benefit society without compromising safety will depend on the development of robust mechanisms for identifying and mitigating potential harms. Safety evaluations, which aim to measure one or more safety-relevant capabilities or behaviors of a model or system, are a key mechanism by which model or system risks are assessed more broadly. 

A cohesive evaluation ecosystem for frontier AI systems will be critical to their safe and responsible development. Yet current evaluations for frontier AI models and systems differ substantially in their methods, purpose, and terminology. Establishing a shared understanding of the functions and types of evaluations is a key first step toward building a more effective ecosystem. This is especially true for safety evaluations that are carried out before a model or system is released and face different constraints than post-deployment evaluations focused on user impacts. 

This issue brief offers an initial high-level taxonomy of pre-deployment safety evaluations for frontier AI models and systems. Based on the public literature as well as input from safety experts across the Frontier Model Forum, the brief is part of a broader workstream that aims to inform public discussion of best practices for AI safety evaluations. 

Recommended Taxonomy 

As opposed to more commercially-focused evaluations, which typically focus on performance metrics, safety evaluations aim to assess the potential risks of a given frontier AI model or system whose capabilities could be misused to cause harm or could lead to unintended harm. Risks refer to outcomes that are not considered desirable (or even intentional) and can lead to negative impacts on users, groups, entities, systems, or societies, and that may arise as a result of the behaviors or capabilities of an AI model or system.1 

Methodology

Safety evaluations can be distinguished in terms of methodology. For evaluations of AI models or systems themselves, two common methods include: 

  • Benchmark evaluations. Benchmark evaluations are focused on quantifying the capabilities of a model in terms of standardized criteria and in such a way that the results can be compared at scale, over time, and across models and systems. Benchmark evaluations are designed to be replicable and are typically automated, although some benchmarks may involve manual scoring or grading. 
  • Red-teaming exercises. Red-teaming for frontier AI may refer to “a structured testing effort to find flaws and vulnerabilities in an AI model or system, often in a controlled environment and in collaboration with developers of AI.”2 These exercises are usually distinguished from capability evaluations based on their level of adversarialness (i.e., how they simulate the actions or behaviors of a particular threat actor). More generally, red-teaming can help elicit specific information about the harmful capabilities of a particular model or system, often by simulating potential attack or form of deliberate misuse, and measuring for residual risk. Red teaming exercises can be effective at identifying and prioritizing potential risks that can then be measured in a more systematic way to assess prevalence and effectiveness of mitigations, and are also effective at novel risk exploration. Although automated red-teaming exercises are used by some companies, they are often carried out by human experts. 

For evaluations of the impact of AI models or systems on human capabilities and actors’ abilities to achieve specific, real-world outcomes, another common approach is to use the following methodology: 

  • Controlled Trials: Studies that include human subjects divided into treatment and control groups, typically with the treatment group having access or exposure to an advanced AI model or system. Controlled trials are often used in “uplift studies” designed to assess the impact of an AI model or system on the performance of human users with respect to tasks that may be risky or harmful. 

Objective

Safety evaluations can also be distinguished in terms of their objective. For example, safety evaluations may include: 

  • Maximal Capability Evaluations. Maximal capability evaluations seek to assess the nature and scale of capabilities before safety mitigations and guardrails have been put into place. Typically, maximal capability evaluations seek to assess the maximum attainable limit of a potentially harmful capability for the purpose of identifying needed safety and security mitigations.3 They help to establish a model or system’s upper bound and are generally conducted when models are at their most powerful, which is typically when post instruction fine-tuning has completed and before safety guardrails have been implemented. Maximal capability evaluations can utilize all available tools, including fine-tuning, prompt engineering and scaffolding, and are more likely to keep humans in the loop to provide oversight. These evaluations should be grounded in a specific threat model (i.e. examining a type of actors’ abilities to achieve specific, real-world outcomes in a certain context), e.g., someone abusing an API, or stealing the weights but lacking significant fine-tuning compute or data, rather than assuming an adversary of unlimited resources.
  • Safeguard evaluations. Safeguard evaluations seek to assess the nature and scale of potentially harmful capabilities after safety mitigations and guardrails have been put into place. Assessments of the efficacy of applied safety mitigations are best conducted after a model has been fine-tuned, and then again if the model datamix or mitigations change, the model is further fine-tuned, a new tool is added, or when a model is integrated as a part of a broader system. A main tool used in safeguard evaluations is prompt engineering. Such evaluations are more likely than maximal capability evaluations to be automated for the purposes of reproducibility and model comparisons, though they may also keep humans in the loop. 

    Safeguard evaluations may also be categorized as domain-agnostic or domain-specific. Safety evaluations that are domain-agnostic aim to test for general vulnerabilities in the overall model or system. For example, domain-agnostic safeguard evaluations may test whether the model or system can be jailbroken using adversarial prompt techniques that are able to bypass all safeguards and mitigations. Conversely, safeguard evaluations that are domain-specific seek to evaluate whether the safeguards and mitigations for a particular domain are effective at preventing some form of misuse, such as the dissemination of sensitive and hazardous information. In the bio domain, for example, a domain-specific safeguard evaluation may test for how effectively safeguards and mitigations prevent a model from providing information that is helpful for the creation of novel pathogens.
  • Uplift studies. A common concern about frontier AI is that it may “uplift” the ability of human users to carry out a particular type of attack or harmful behavior compared with what they would achieve without access to a frontier AI model or system. Uplift studies are designed to evaluate that risk, typically by using a controlled trial with a treatment group that has access to frontier AI and a baseline or control group that is limited to alternate resources. For instance, an uplift study in biology may grant a treatment group access to a frontier AI model and a control group access to web search only, and ask each group to complete specific steps of a protocol required for the synthesis of a pathogen. The difference in each group’s ability to complete particular tasks would provide an indication of the marginal risk introduced by frontier AI.  

Table 1: Safety Evaluations

Maximal Capability EvaluationsSafeguard Evaluations
Measurement ObjectiveAssess the upper bound of capabilities that could be misused to cause harm or lead to unintended harm.Assess the nature and scale of the risks posed by models or systems with safety and security guardrails in place.
Measurement PurposeIdentify and understand the model’s upper bound of risk and identify any needed safety and security mitigations.To guide governance and deployment decisions and need for additional safety mitigations.
TimingTypically when models are most performant, which is generally after instruction fine-tuning and prior to the implementation of safety guardrails.Post fine-tuning and safety mitigations; again if the model datamix or mitigations change, the model is further fine-tuned a new tool is added and when a model is integrated as part of a broader system.
Tools / MethodsAll available tools, including fine-tuning, prompt engineering and any additional scaffolding relevant to a particular threat model; humans in the loop are more likely.Primarily prompt engineering.

Conclusion 

As noted earlier, this preliminary taxonomy aims to inform shared understandings of evaluations. By aligning on a common understanding of frontier AI evaluations, the ecosystem can more easily learn from and build on early efforts.

A robust and effective evaluation ecosystem, however, will require shared understandings of other best practices too. To that end, the Frontier Model Forum aims to publish further issue briefs soon on related topics, such as depth and frequency of testing.

Footnotes

  1. It is important to note that safety is also a critical dimension of holistic model and system performance, and that the limitations of capabilities can also pose risks to safety. It is also worth noting that the distinction between desirable and dangerous capabilities is sometimes not clear cut, as evidenced by the dual-use nature of capabilities such as the ability to detect cyber vulnerabilities. ↩︎
  2. White House Executive Order 14410, “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.” October 2023. ↩︎
  3. See for example “Evaluating Frontier Models for Dangerous Capabilities,” Phuong, et al. (2024) ↩︎