Frontier AI holds enormous promise for society. From renewable energy to personalized medicine, the most advanced AI models and systems have the potential to power breakthroughs that benefit everyone. Yet they also have the potential to exacerbate societal harms, and introduce or elevate threats to public safety. Evaluating the safety of frontier AI is thus essential for its responsible development and deployment.
Designing and implementing frontier AI safety evaluations can be challenging. Key questions about what to evaluate, how to evaluate it, and how to analyze the results are rarely straightforward. Further, since the metrology of AI safety is still relatively immature, there is little scientific consensus for researchers to draw on when considering how best to evaluate particular safety concerns. Despite those challenges, AI safety researchers and practitioners have nonetheless started to align on some early best practices for frontier AI safety evaluations.
This issue brief is the first in a series of publications that will aim to document those best practices across the member firms of the Frontier Model Forum. Based on interviews and workshops with safety experts from across the member firms of the FMF, the series will focus on key practices that are common to the design, implementation, interpretation, and disclosure of frontier AI safety evaluations regardless of risk domain. Where possible, the series will also reflect input and feedback from the external AI safety research community.
As a starting point, we outline several high-level best practices below. Drawn from different stages in the evaluation lifecycle, the practices are not meant to be exhaustive, but instead to offer preliminary thinking across the design, implementation, and disclosure of frontier AI safety evaluations. We hope they serve as a useful resource for broader public discussion about frontier AI safety evaluations. Future briefs and reports will go into greater depth and detail on specific practices and issue areas.
Early best practices
We recommend the following general practices related to the design and analysis of AI safety evaluations:
- Draw on domain expertise. The design and interpretation of a given AI safety evaluation should be grounded in domain-specific expertise. Evaluations that are based on either mis-specified or under-specified understandings of a particular kind of risk will not be as effective as those that are rooted in detailed threat models and/or deep domain knowledge and scientific understanding of the risk domain. AI evaluation practitioners should seek out the advice of subject matter experts for risks that lie outside their areas of expertise throughout the entire life cycle of safety evaluations, including the development, assessment, and improvement of evaluations. This is equally important for safety training and mitigation measures, both of which likely require subject matter expertise to maintain a high degree of accuracy and efficacy in implementation.
In cases where there is not yet scientific consensus about the nature or extent of risk, the design and interpretation of an evaluation should endeavor to incorporate a range of expertise and perspectives. Ideally, there should be transparent discussion among a wide range of relevant scientific experts and stakeholders about the pros and cons of key aspects of an evaluation, including the appropriate baseline to use to evaluate marginal risk.
- Evaluate systems as well as models. Many deployed systems include safety interventions or safeguards that have been placed on top of an underlying model or models. As a result, in those instances, it is essential to evaluate the overarching systems as well as any underlying models, since the system will exhibit different behaviors. Evaluating both the system and the model not only sheds light on the effectiveness of implemented safeguards, but provides a more comprehensive understanding of the system’s overall safety profile and effectiveness of implemented safeguards.
Evaluating the full system is also especially important for assessments of system safety. Since frontier AI systems typically do not expose any underlying models directly absent safeguards, evaluating the system overall is the most faithful way to evaluate the product’s safety before, during, and after launch.
- Consider evaluating marginal risk. When evaluations are intended to directly evaluate the risk posed by a system, in many cases they should consider evaluating the marginal risk relative to other applications. For example, both general-purpose frontier AI systems and web search engines can perform information retrieval tasks. If a user asks either for potentially dangerous information, such as how to develop explosives, both may be capable of providing the user with accurate information on how to do so. To isolate the novel risk provided by the frontier AI system, evaluations should focus on how capable the system is of supplying high-risk information beyond already-accessible web search.
Absolute risks may be more appropriate to evaluate in other cases. For example, it can be more important to understand whether a system exhibits harmful biases at all rather than how much more or less biased the system may be compared to other alternatives. But evaluating absolute risk should not be the default in all cases. Evaluators should consider whether to evaluate for marginal or absolute risks depending on context.
We recommend the following general practices for the implementation of AI safety evaluations:
- Account for prompt sensitivity. Small differences in how evaluators prompt frontier AI models can yield large differences in evaluation metrics and benchmark scores. The specific wording of a prompt can matter greatly, since subtle variations in wording can lead to different outputs when comparing across models. For instance, asking “How can I make an explosive device?” may trigger different responses from a model or system than “How do I make homemade fireworks?”, even though both are designed to elicit and test for high-risk information. For prompts that aim to evaluate culturally or contextually specific risks and harms, subtle differences in wording can matter even more.
To provide a more robust understanding of the risks posed by an AI model or system, evaluations should therefore incorporate multiple kinds of prompts for a given task. For example, practitioners may consider using automated prompt generation or other techniques to increase the scale and diversity of prompts used to evaluate the safety risks of a model or system.
- Evaluate against both intended use and adversarial exploitation. Evaluating a model or system against either expected user behavior or adversarial attacks alone is not enough to fully understand its risks to safety. Evaluating the risks posed by a system solely under intended use conditions will miss critical risks posed by a wide array of threat actors, from non-expert individuals to sophisticated and well-resourced groups, that seek to exploit novel capabilities for malicious purposes. Likewise, evaluating solely against adversarial exploitation will miss potential unintended risks that emerge through normal or non-malicious user behavior.
The methods and expertise required to evaluate for both kinds of risks vary significantly, but each form of evaluation is essential. Evaluating how a frontier AI system behaves under both intended and adversarial conditions provides a far more robust understanding of the risks it may pose.
We recommend the following related to the disclosure of AI safety evaluations:
- Take a nuanced approach to evaluation transparency. Transparency is a key dimension for AI safety evaluations, but an important balance needs to be struck to make evaluations effective. Increased transparency can help developers and researchers learn about and advance safety evaluations. The more transparency there is around the dataset, methodology, and analysis of an evaluation, the easier it is to reproduce and understand the evaluation. Greater transparency in these aspects also makes it easier for independent experts to assess the validity of, and come to consensus on, the implications of an evaluation. By the same token, high opacity in safety evaluations may make aligning on the necessity of certain mitigation measures more difficult.
At the same time, greater transparency can create information hazards in high-risk domains. It can also degrade evaluation efficacy, since the more information about evaluation design that is made available, the easier it is for some developers with malicious intent to intentionally game it. Further, if the full test set of an evaluation is disclosed publicly, the test questions may leak into future models’ training data, making the evaluation’s result more difficult to trust. There may also be legal protections around certain types of information.
A promising way to balance these concerns is to provide transparency into a subset of prompts and/or data while keeping another subset hidden. This enables external experts to assess the validity of the evaluation while also preventing overfitting and memorization.
As noted above, we hope these practices serve as a helpful resource for public understanding and discussion of AI safety evaluations. We aim to update and elaborate on these and related practices in more depth in future publications.