Key Topics

Explore critical areas where cybersecurity expertise is essential to securing powerful AI systems

Technical governance

Ensuring AI is secure and beneficial to society requires new technology and research that can enable better AI policy and regulation. Some examples include:

The Trump administration recently released their AI Action Plan, which includes many areas where technology and policy need to interact for security and safety, and was informed by many external experts.

Researchers from GovAI published a live list of Open Problems Problems in Technical AI Governance, subdivided into skill sets, including cybersecurity, cryptography, hardware, software, ML and more; and covers the entire AI development lifecycle and relevant governance capacities.

The United Kingdom founded the UK AI Security Institute, which is the leading government agency in researching and building solutions for a secure AI future. They operate as a tech startup with the backing of a government, and have found significant vulnerabilities in leading models.

Advanced AI chips and GPUs play a huge role in shaping the AI future, with America trying to impose export restrictions. A documentary outlines how Nvidia chips are being smuggled into China, and there are new technologies being built to ensure secure, governable chips, such as location verification and other Hardware Enabled Mechanisms for verifying responsible AI deployment.

AI infrastructure security

Protecting frontier models from theft and misuse is a critical mission. Although the leading AI companies have incentives to prevent corporate espionage, the risks from a rogue government or terrorist group accessing top models may outweigh the companies incentive and skill level in defending against capable actors. Important developments include:

RAND authored a playbook for Securing AI Model Weights where they define necessary security levels (including Security Level 5 - defending against highly-resourced nation states), and map the current state of frontier AI company security, which they estimate at SL2 - secured against individual professional hackers. RAND’s playbook was instrumental in shaping frontier companies security policies.

Anthropic, Google DeepMind, OpenAI and Meta AI have published detailed security, responsibility and preparedness frameworks outlining their model protection strategies and implementation plans. xAI has also published a less detailed framework. Anthropic triggered their risk framework when they released their latest model, OpenAI detailed how they secure research infrastructure, and Meta is working on building guardrails for open-source - which is not easy.

Situational Awareness has argued for increased securitization of leading AI companies in light of the growing China-USA AI race.

A critical vulnerability was found in Anthropic's MCP enabling adversaries to read or write anywhere on disk and trigger code execution.

Unique AI attack vectors

AI systems open up entirely new attack surfaces that extend well beyond traditional cybersecurity. These vectors can be exploited to bypass safety guardrails, compromise data integrity, or weaponize advanced models in ways conventional defenses were never designed to anticipate.

Prompt-based attacks include direct prompt injections (malicious instructions embedded in user queries) and indirect ones, where hidden instructions are buried in external content like documents, websites, or RAG sources. Jailbreaking remains the most visible example, with challenges like Gandalf and HackAPrompt demonstrating how easily safety layers can be bypassed.

Adversarial ML attacks exploit carefully crafted inputs to make models misclassify or misbehave, and can be escalated by chaining weaknesses across multiple models or agent systems. Researchers have mapped out these threats in detail in recent papers.

Data poisoning injects malicious or misleading data into training or fine-tuning sets, embedding stealthy backdoors that are easy to trigger but hard to detect once deployed. Wiz and Fuzzy Labs both highlight the risks of membership inference and training-set contamination.

Data and model extraction threats range from weight exfiltration (theft of proprietary model parameters) to training data reconstruction and membership inference. These attacks are not hypothetical: multiple studies have shown that sensitive information can be recovered from large models, with new techniques emerging rapidly.

Fine-tuning for misuse allows adversaries to subtly strip away guardrails while preserving fluency, transforming general-purpose models into specialized offensive tools for cyber exploitation or bio/chem guidance. The UK AI Security Institute has warned that defending fine-tuning APIs may have fundamental limitations.

Infrastructure and pipeline risks expose the foundations of AI systems. Vulnerabilities in the Model Context Protocol (MCP) have already enabled arbitrary code execution and disk access. Retrieval-Augmented Generation (RAG) pipelines are vulnerable to database manipulation or injection. And supply chain attacks such as Escaperoute or Gemini calendar hijacks show how easily upstream components can be compromised.

Preventing loss of control with security techniques

AI control is a security-first complement to AI safety: the goal is to design and verify protocols so that even a misaligned but powerful model cannot cause unacceptable outcomes. This means treating advanced systems as untrusted by default, stress-testing them under realistic attack conditions, and counting “detect and shut down” as success rather than failure.

RAND Europe examined scenarios where AI systems slip beyond intended oversight, proposing a security posture that focuses on rapid detection and containment of loss-of-control incidents, rather than assuming perfect alignment will always hold.

IAPS has argued for managing internal AI risks by building resilient controls into organizational pipelines, so that failure of one model or subsystem doesn’t cascade into systemic loss of control.

The UK AI Security Institute has set control research as a central technical direction, noting that current methods cannot reliably stop misuse of cutting-edge models, but that black-box control experiments may provide tractable near-term defenses.

Redwood Research has focused on scalable oversight strategies, experimenting with automated monitoring and shutdown protocols to test whether model outputs can be kept within safe boundaries even when systems are adversarially optimized.

Apollo Research has emphasized the governance dimension, showing how “closed doors” deployments of advanced models may reduce external misuse risks, but raise new challenges in transparency and accountability.

Dangerous capability demonstrations and evaluations

AI companies and external research organizations are actively assessing AI models to understand potential security implications, both pre-deployment and in-the-wild:

Frontier AI Companies have committed to assessing the risks posed by their models pre-release , and detail multiple risk vectors in the model cards, including jailbreaking, cyber risks, biological and chemical, deception, and more. Anthropic’s Frontier Red Team and threat Intelligence report and GPT-5’s model card have more detail.

Pattern Labs built the SOLVE benchmark to answer the question of how capable frontier AI models are at vulnerability discovery & exploit development challenge, and used it to assess the latest models of Claude and ChatGPT.

Pre-release cyber testing may not be comprehensive enough. Xbow found that GPT-5 is much better at hacking than OpenAI claims.

Palisade Research found that ChatGPT actively sabotages when being shut down, which is increasingly concerning as agents become widespread, and have also released an LLM Honeypot as an early warning system for autonomous hacking.

The UK AI Security Institute released the Inspect Sandboxing Toolkit to test agentic AI systems for dangerous capabilities without risking real-world harm.

Want to build AI Security?

Next steps