Frontier AI-bio safety evaluations aim to test the biological capabilities and, by extension, the potential biosafety implications of frontier AI. As the science of AI safety evaluations is still nascent, the evaluations themselves can vary widely in both purpose and methodology. As such, a key first step in building out an effective safety evaluation ecosystem for the AI-bio space is developing a shared understanding of both the function and type of safety evaluations.
This issue brief offers an initial taxonomy and definitions for frontier AI safety evaluations specific to the biological domain, categorized across two dimensions: methodology and domain. Based on input from FMF member firm experts, in addition to a diverse group of external experts from the advanced AI and biological research fields, this brief aims to document and build a preliminary consensus around the current understanding of frontier AI-bio safety evaluations.
Evaluation Methods
The first dimension by which AI-bio safety evaluations are categorized is the methodology. Evaluation methodology describes how the frontier AI model or system is being evaluated, or the study design.
While evaluation studies may incorporate more than one of these methods, most existing evaluation tasks include one of three main methods. For evaluations of AI models or systems themselves, two common methods include:
- Benchmark Evaluations: Sets of safety-relevant questions or tasks designed to test model capabilities and assess how answers differ across models. These evaluations aim to provide baseline indications of general or domain-specific capabilities that are comparable across models. Benchmarks are designed to be easily repeatable and are typically automated, though grading can also incorporate expert human grading. In the biological domain, benchmarks may include knowledge benchmarks (e.g., multiple choice QA, open ended questions), capability benchmarks (e.g., agentic tests), or safeguard evaluations (e.g., refusals testing for harmful queries).
- Red-Team Exercises: dynamic, adversarial, and interactive evaluations meant to elicit specific information about the harmful capabilities of a particular model or system, often by simulating a potential attack or form of deliberate misuse and then measuring for residual risk. Although automated red-teaming exercises are under development, they are generally carried out by human actors, including red-teaming experts, where a key element is the dynamic nature of interaction between the human experts and the model. Red-teaming exercises can further be distinguished from benchmark evaluations by their emphasis on assessing the effectiveness of existing safeguards in an adversarial setting and their use in novel risk exploration.
For evaluations of the impact of AI models or systems on human capabilities and actors’ abilities to achieve specific, real-world outcomes, a common approach is to use the following methodology:
- Controlled Studies: Often used in uplift studies, controlled trials are designed to assess the extent to which advanced AI models/systems impact the capability of human actors to achieve a particular task compared to the capability of human actors using alternative resources or existing tools, such as internet search. These studies can be in-silico or wet lab-based, and typically use an RCT or similar design (treatment vs. control) to form a grounded assessment of the counterfactual impact of a model on human actor capabilities.
Evaluation Domains
The second dimension is the domain of evaluations. The evaluation domain covers the particular capability, modality, or outcome under evaluation, and can inform risk assessments for a particular threat model. Just as bio safety evaluations can leverage multiple methods, they also may seek to assess more than one of these domains at once. To best assess the model’s capabilities, many biosafety evaluations apart from safeguard evaluations are carried out after a base model has been post-trained, but before mitigations have been put in place.
We identify four domains of evaluations:
- Scientific Knowledge Evaluations: Evaluate whether an AI model or system possesses general scientific knowledge, including biological facts and concepts.
- Scientific Reasoning Evaluations: Evaluate whether an AI model or system can perform complex, multi-step research and reasoning tasks needed to advance scientific knowledge, especially knowledge relevant to biological research. These evaluations include assessing an AI model or system’s ability to produce a literature review, interpret graphic information, or other similar research-essential skills.
- Hazardous Biological Knowledge Evaluations: Evaluate whether an AI model or system is able to provide the detailed, domain-specific knowledge that is necessary for a particular step in the end-to-end process of biological threat creation. These evals may test both the direct knowledge needed to carry out a particular step in that process, as well as the tacit knowledge needed to troubleshoot a given step. Drawn from previous research and current practice,1 we separate the biological threat creation process into several operational steps:
- Ideation: Evaluate whether the model provides knowledge that helps actors generate or assess ideas for bioweapons development. This includes knowledge of areas of historical bioweapons and bioterrorism use, enhanced potential pandemic pathogens research, etc.
- Design: Evaluate whether a model or system can provide sensitive knowledge that can assist with the design of novel or enhanced biological threat agents, such as by helping with the use of biological design tools or with troubleshooting an in-silico experiment.
- Acquisition: Evaluate whether a model or system can provide knowledge that helps actors acquire the materials and equipment required for the creation of a biological threat or weapon. This includes knowledge relevant to contracting from cloud labs, helping to obscure orders for DNA synthesis, evading export controls, retrieving and analyzing hazardous DNA sequences, etc.
- Build: Evaluate whether a model or system can provide knowledge that helps actors build or develop biological weapons. This could include knowledge relevant for assisting or troubleshooting with culturing an agent to produce weaponizable quantities (i.e. magnification), formulating and and stabilizing an agent for intended release (i.e. formulation), or producing and synthesizing a novel pathogen.
- Release: Evaluate whether a model or system can provide knowledge that helps actors plan for releasing the agent against the targeted populations. This would include knowledge relevant for aerosolization of a virus, for example, or knowledge about how to target another transmission mechanism.
- Amplify: Evaluate how the harmful results of a particular attack can be amplified by the use of a model or system. For example, using a model to facilitate a complementary social engineering campaign to increase the societal or social impact of a biological attack without altering the physical impact.
- Ideation: Evaluate whether the model provides knowledge that helps actors generate or assess ideas for bioweapons development. This includes knowledge of areas of historical bioweapons and bioterrorism use, enhanced potential pandemic pathogens research, etc.
- Automated Processes Evaluations: Evaluate whether a model or system is capable of directly automating or outsourcing processes for biological research or weapons development. These evaluations can occur purely in a virtual environment, but may also evaluate capabilities in the physical world. The key distinction is that they evaluate a model or system’s ability to independently carry out steps necessary for biological threat creation, and not just whether it has knowledge of how to do so.
In addition, some domain evaluations may also be classified as safeguard evaluations, which aim to assess the nature and scale of harmful knowledge or capabilities after safety mitigations and guardrails have been put into place. For example, safeguard evaluations may be used to evaluate whether a system continues to output hazardous biological information even after guardrails have been added to prevent such outputs (i.e. in specific deployment scenarios where system-level safety features are in place).
Conclusion
Developing a shared understanding of the design and implementation of AI-bio safety evaluations is an important first step in managing the risks associated with frontier AI models. This brief has outlined the various methods used and domains targeted by the early suite of AI-bio safety evaluations.
The taxonomy presented in this brief encapsulates the current field of AI-bio safety evaluations for LLMs. As the technology develops, actors in the space will likely need to develop other types of evaluations that do not fit into any of these categories. The Frontier Model Forum is committed to advancing this field of study. Our next steps involve mapping specific threat models onto the taxonomy we’ve presented, which will help identify gaps in our current evaluation methodologies.
Footnotes
- See for example OpenAI’s report separating the process into a similar five steps, the figure on pg. 3 in the Centre for Long-Term Resilience report on biological weapons development, and a similar figure on pg. 6 in the WMDP paper. ↩︎
Appendix: Table of Publicly Disclosed AI-Bio Safety Bio Evaluations
Below is a non-exhaustive list of public resources that document or reference AI safety evaluations of bio risks. The resources include research papers that document a specific safety evaluation in detail, as well as blog posts and model cards that may refer to a safety evaluation that occurred without describing it in depth.
Year | Author | Name | Method | URL |
---|---|---|---|---|
2024 | Anthropic | Claude 3.5 Sonnet Model Card | Benchmark Evals Controlled Study | Model Card |
2024 | Future House | LAB-Bench: Measuring Capabilities of Language Models for Biology Research | Benchmark Evals | Paper |
2024 | GDM | Gemini 1.5 Pro model card | Benchmark Evals | Model Card |
2024 | GDM | Evaluating Frontier Models for Dangerous Capabilities | Benchmark Evals Red-team Evals | Paper |
2024 | Ivanov | BioLP-bench: Measuring understanding of biological lab protocols by large language models | Benchmark Evals | Paper |
2024 | Li et al. | The WMDP Benchmark | Benchmark Evals | Paper |
2024 | Meta | Llama 3.1 Research paper | Controlled Study | Paper |
2024 | Meta | Llama 3 Model Card | Controlled Study | Model Card |
2024 | OpenAI | Building an early warning system for LLM-aided biological threat creation | Controlled Study | Post |
2024 | OpenAI | O1 System Card Bio Threat Creation Evaluations | Benchmark Evals Red-team Evals | System Card |
2024 | OpenAI | GPT-4o System Card | Benchmark Evals Controlled Study | System Card |
2024 | RAND | The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study | Red-team Evals Controlled Study | Paper |
2024 | SecureBio | Lab Assistance Benchmark – Multimodal | Benchmark Evals | Post |
2024 | UK AISI | Advanced AI evaluations at AISI | Benchmark Evals | Post |
2024 | US AISI & UK AISI | US AISI and UK AISI Joint Pre-Deployment Test – OpenAI o1 | Benchmark Evals Red-team Evals | Report |
2023 | Gopal et al. | Will releasing the weights of future large language models grant widespread access to pandemic agents? | Red-team Evals | Paper |
2023 | OpenAI | GPT-4 System Card | Red-team Evals | System Card |
2023 | Sarwal et al. | BioLLM-Bench: A Comprehensive Benchmarking of Large Language Models in Bioinformatics | Benchmark Evals | Paper |