Safety frameworks have recently emerged as an important tool for frontier AI safety. By specifying capability and/or risk thresholds, safety evaluations and mitigation strategies for frontier AI models in advance of their development, safety frameworks position frontier AI developers to be able to address potential safety challenges in a principled and coherent way. Both government and industry recognized the importance of safety frameworks through the Frontier AI Safety Commitments announced at the AI Seoul Summit in May 2024.
Yet safety frameworks remain a relatively nascent concept. Only a handful of firms have published safety frameworks to date, and until recently few research organizations had published on the topic. While there is an emerging consensus about the function and core components of safety frameworks, there is still a clear need for further research,1 related norms, and established guidance to enable implementation.
This brief proposes a set of core components for inclusion in safety frameworks. Drawn from the Frontier AI Safety Commitments as well as published member firm frameworks and expert input, this piece reflects a preliminary consensus among member firms about how to structure safety frameworks. However, we note other approaches may emerge in the future that also meet the Frontier AI Safety Commitments. We hope this brief will serve as a useful resource for broader discussion about how to develop frontier AI safety frameworks. Future briefs will explore key elements of safety frameworks in greater depth.
Components of Safety Frameworks
Frontier AI safety frameworks are designed to enable developers to take a robust, principled, and coherent approach to anticipating and addressing the potential safety challenges posed by frontier AI.
At a high level, safety frameworks include the following components:
- Risk Identification
Frontier AI safety frameworks are intended to manage potential severe threats to public safety and security.2 To effectively manage these threats, safety frameworks should identify and analyze model risks stemming from advanced capabilities in chemical, biological, radiological, and nuclear (CBRN) weapons development and cyber attacks.3 As the technology evolves and there is more research conducted on frontier AI risks, additional risk domains may emerge, such as systems with increasingly autonomous capabilities. However, there is not broad consensus yet about what these risk domains might or should be.
Firms should also explain how they plan to ensure an accurate picture of model risks and concerning capabilities, such as by carrying out threat modeling exercises.4
- Capability and Risk Thresholds
Capability thresholds have largely been used in safety frameworks as a proxy for risk to date.5 These are specific capabilities at which, absent mitigation measures, models may pose unacceptable or intolerable levels of risk to society. While using capability thresholds is current best practice, this is an area of active research and risk thresholds may also be defined using likelihood or severity estimates of specific risks or other relevant risk factors, as appropriate. Future issue briefs will explore this and related issues, such as differences between capability, compute, and quantitative risk thresholds, as well as their associated benefits and shortcomings.
Safety frameworks should define clear capability or risk thresholds at which the risks posed by their model would merit heightened safeguards and when such risks (if unmitigated) would be deemed unacceptable.6 The framework should provide a rationale for why these thresholds were chosen and, as appropriate, disclose how firms received input from external experts or other stakeholders, such as home governments, when defining their thresholds.7
- Capability and Risk Assessments
Safety frameworks should outline the process the firm will take to assess risks related to capability and risk thresholds.8 More specifically, frameworks should describe how the firm plans to assess when a given threshold has been met or surpassed. This should include a description of the planned approach to frontier AI safety evaluations.9
Safety frameworks should also outline the procedural components for risk assessments, including when to conduct risk assessments (e.g. before deployment, and before and during training) and the frequency of such assessments (e.g. assessing risks every 2x, 4x, or 6x increase in capabilities on relevant dimensions). They may also include a rationale for this approach.
Firms should also consider committing to sharing information about their risk assessments and evaluations publicly as appropriate, while being mindful of safety and security risks (as well as other potential issues including creating information hazards or evaluation contamination). For example, firms might consider sharing a report confirming that a safeguard has caused a model to meet the predefined standard for safety.
- Risk Mitigation
Safety frameworks should, to the extent possible while maintaining the effectiveness and integrity of such measures, describe the mitigation measures they plan to apply at each threshold to reduce the risk posed to an acceptable level.10 These may include, for example, security or containment measures to ensure a model with dangerous capabilities is not stolen, as well as deployment mitigations or safety measures to manage the risk of misuse, once deployed.
If a model or system’s capabilities are above a pre-stated capability threshold and the firm has not developed effective safeguards for capabilities above that threshold, or if the risks of a model or system are above a pre-stated risk threshold and the firm has not developed effective safeguards for keeping risks below that threshold, they commit to taking appropriate action such as restricting further development and public deployment until effective safeguards have been developed. This is present in all existing safety frameworks and is a key requirement of the Frontier AI Safety Commitments. Safety frameworks may also outline the criteria for determining when models no longer create unacceptable risks and are safe to continue training and/or deploying.
Finally, safety frameworks should outline the process a firm intends to follow to continually assess and monitor the adequacy of mitigation measures.11
- Risk Governance
To the extent feasible and not covered by other governance materials, safety frameworks may list firm commitments in place to ensure internal compliance with the framework.12 Similarly, safety frameworks should also specify the internal and external oversight mechanisms in place to ensure proper implementation of the framework, as appropriate.
Firms should also specify a process and approach for updating the framework over time as appropriate, including any predefined trigger points for updates. This may include detailing, if appropriate, any internal processes and special approvals required to make changes. Firms should also consider noting when material updates to or revisions of a safety framework will be made public.
Finally, the framework should indicate how external actors were involved in the process of developing the framework, if applicable.13 Future briefs will elaborate on specific governance and transparency mechanisms in more depth. The briefs may describe internal processes for ensuring adherence to safety frameworks, as well as external processes by which third-parties may assess adherence. They may also elaborate on the development and publication, as appropriate, of model cards, safety evaluations, and other safety-specific documentation.
Table 1: Components of Safety Frameworks
Component | Description | Sub-Components |
---|---|---|
Risk Identification | Identify potential safety and security risks stemming from future frontier models, especially those based on advanced capabilities. |
|
Capability and Risk Thresholds | Define and explain the rationale behind critical capability and risk thresholds. |
|
Capability and Risk Assessment | Outline the process for conducting pre-mitigation capability and risk assessments, including the planned approach to frontier AI safety evaluations. |
|
Risk Mitigation | Explain the risk mitigation measures in place to ensure critical risk thresholds are not passed, and the measures for keeping risk within tolerable thresholds |
|
Risk Governance | Indicate the internal accountability and governance frameworks for implementing the safety framework, including the process for updating the safety framework, and outline the process for transparency in implementation and accountability for the safety framework. |
|
Conclusion
Based on existing safety frameworks and the recent Frontier AI Safety Commitments, the preliminary list of core components above offers a strong foundation for the development of frontier AI safety frameworks. Though non-exhaustive and preliminary, we hope this guidance serves as a useful starting point for firms seeking to create their own safety frameworks and a resource for broader discussion about frontier AI safety frameworks. Future briefs in this workstream will explore elements of safety frameworks in greater depth, from the tradeoffs associated with different thresholds to the timing and frequency of capability assessments. Our ultimate aim with this and future briefs is to inform public debate of how best to design and implement safety frameworks.
Footnotes
- See for example Google DeepMind’s Frontier Safety Framework, OpenAI’s Preparedness Framework, Anthropic’s Responsible Scaling Policy, and Magic’s Readiness Policy. For early thinking on safety frameworks, see METR’s (formerly ARC Evals) post on the topic. ↩︎
- Each component in this brief corresponds to one or more sections in the Frontier AI Safety Commitments. Component 1 corresponds to Commitment I. ↩︎
- For examples of how safety frameworks describe risk identification, as well as the other categories in the table, see METR’s recent publication on “common elements of frontier AI safety frameworks.” ↩︎
- Threat modeling exercises are a structured, repeatable process through which actors can determine the security implications of a model or system by investigating the pathways through which the model or system may cause harm. Domain expertise for each threat area is paramount when conducting these exercises. Threat modeling exercises can also be used to inform subsequent capability assessments and mitigation measures.
↩︎ - See Koessler et al. 2024 for further reading on capability and risk thresholds. ↩︎
- Corresponds to Commitment II.
↩︎ - Commitment I states that “These thresholds should be defined with input from trusted actors, including organisations’ respective home governments as appropriate.” ↩︎
- Corresponds to Commitments I & II. ↩︎
- Frontier AI safety evaluations aim to identify potential risks or safety concerns associated with frontier AI models and systems. For more, see our issue brief on early best practices for frontier AI safety evaluations. ↩︎
- Corresponds to commitments III, IV, V.
↩︎ - Commitment V calls on companies to: “Continually invest in advancing their ability to implement commitments…[including] processes to assess and monitor the adequacy of mitigations, and identify additional mitigations as needed to ensure risks remain below the pre-defined thresholds.”
↩︎ - Corresponds to commitments VI, VII, VIII. ↩︎
- According to Commitment VIII, firms commit to explaining within their framework how external actors, if at all, “are involved in the process of assessing the risks of their AI models and systems, the adequacy of their safety framework … and their adherence to that framework.”
↩︎