top of page

Key Topics

Explore critical areas where cybersecurity expertise is essential to securing powerful AI systems

Technical governance

Ensuring AI is secure and beneficial to society requires new technology and research that can enable better AI policy and regulation. Some examples include:

​​

​

​

AI infrastructure security

Protecting frontier models from theft and misuse is a critical mission. Although the leading AI companies have incentives to prevent corporate espionage, the risks from a rogue government or terrorist group accessing top models may outweigh the companies incentive and skill level in defending against capable actors. Important developments include:

  • RAND authored a playbook for Securing AI Model Weights where they define necessary security levels (including Security Level 5 - defending against highly-resourced nation states), and map the current state of frontier AI company security, which they estimate at SL2 - secured against individual professional hackers. RAND’s playbook was instrumental in shaping frontier companies security policies.​​

​

​

  • Situational Awareness has argued for increased securitization of leading AI companies in light of the growing China-USA AI race.

​​

Unique AI attack vectors

AI systems open up entirely new attack surfaces that extend well beyond traditional cybersecurity. These vectors can be exploited to bypass safety guardrails, compromise data integrity, or weaponize advanced models in ways conventional defenses were never designed to anticipate.

  • Prompt-based attacks include direct prompt injections (malicious instructions embedded in user queries) and indirect ones, where hidden instructions are buried in external content like documents, websites, or RAG sources. Jailbreaking remains the most visible example, with challenges like Gandalf and HackAPrompt demonstrating how easily safety layers can be bypassed.

​

  • Adversarial ML attacks exploit carefully crafted inputs to make models misclassify or misbehave, and can be escalated by chaining weaknesses across multiple models or agent systems. Researchers have mapped out these threats in detail in recent papers.

​

  • Data poisoning injects malicious or misleading data into training or fine-tuning sets, embedding stealthy backdoors that are easy to trigger but hard to detect once deployed. Wiz and Fuzzy Labs both highlight the risks of membership inference and training-set contamination.

​

  • Data and model extraction threats range from weight exfiltration (theft of proprietary model parameters) to training data reconstruction and membership inference. These attacks are not hypothetical: multiple studies have shown that sensitive information can be recovered from large models, with new techniques emerging rapidly.

​

  • Fine-tuning for misuse allows adversaries to subtly strip away guardrails while preserving fluency, transforming general-purpose models into specialized offensive tools for cyber exploitation or bio/chem guidance. The UK AI Security Institute has warned that defending fine-tuning APIs may have fundamental limitations.

​

  • Infrastructure and pipeline risks expose the foundations of AI systems. Vulnerabilities in the Model Context Protocol (MCP) have already enabled arbitrary code execution and disk access. Retrieval-Augmented Generation (RAG) pipelines are vulnerable to database manipulation or injection. And supply chain attacks such as Escaperoute or Gemini calendar hijacks show how easily upstream components can be compromised.

Preventing loss of control with security techniques

AI control is a security-first complement to AI safety: the goal is to design and verify protocols so that even a misaligned but powerful model cannot cause unacceptable outcomes. This means treating advanced systems as untrusted by default, stress-testing them under realistic attack conditions, and counting “detect and shut down” as success rather than failure.

  • RAND Europe examined scenarios where AI systems slip beyond intended oversight, proposing a security posture that focuses on rapid detection and containment of loss-of-control incidents, rather than assuming perfect alignment will always hold.

​

  • IAPS has argued for managing internal AI risks by building resilient controls into organizational pipelines, so that failure of one model or subsystem doesn’t cascade into systemic loss of control.

​

  • The UK AI Security Institute has set control research as a central technical direction, noting that current methods cannot reliably stop misuse of cutting-edge models, but that black-box control experiments may provide tractable near-term defenses.

​

  • Redwood Research has focused on scalable oversight strategies, experimenting with automated monitoring and shutdown protocols to test whether model outputs can be kept within safe boundaries even when systems are adversarially optimized.

​

  • Apollo Research has emphasized the governance dimension, showing how “closed doors” deployments of advanced models may reduce external misuse risks, but raise new challenges in transparency and accountability.

Dangerous capability demonstrations and evaluations

AI companies and external research organizations are actively assessing AI models to understand potential security implications, both pre-deployment and in-the-wild:

​

  • Pattern Labs built the SOLVE benchmark to answer the question of how capable frontier AI models are at vulnerability discovery & exploit development challenge, and used it to assess the latest models of Claude and ChatGPT.

​

​

​

  • The UK AI Security Institute released the Inspect Sandboxing Toolkit to test agentic AI systems for dangerous capabilities without risking real-world harm.

Want to build AI Security?

bottom of page