Whitebox Probes for AI Control

5 days ago
4 min read

Guest post by Natan Katz, LuminAI Security

Dan Klein

What Is AI Control?

AI Control is mitigating misalignment risk under the assumption that misalignment arose - methods that keep you safe even if models are misaligned [1,2].

The goal isn’t to prevent misalignment - it’s to design defenses that hold up when it occurs: the techniques that work even if models are already misaligned.

The framework rests on a key observation regarding capability and trust. As models get larger and more capable, they become more useful - but also harder to trust. A frontier model can write sophisticated code, plan multi-step strategies, and reason about deception. These are exactly the capabilities that make misalignment dangerous.AI Control formalizes this tension into two roles:

U (Untrusted)

The large, capable model that does the real work. It’s powerful enough to be useful, but that same power means it could pursue misaligned goals - inserting backdoors, following injected instructions, or subtly sabotaging tasks. The more capable U is, the more damage it could do.

T (Trusted)

A smaller, weaker model that monitors U. T is less capable, but that’s the point - it’s small enough to be better understood, better aligned, and less likely to be misaligned. T acts as an LLM-as-a-Judge for safety, reading U’s inputs or outputs and flagging anything suspicious.

The Blind Spot

The problem: T reads text. When U produces identical output for honest and malicious scenarios - or when U simply refuses every request - T has nothing to work with. The signal isn’t in the text. It’s in the model’s internal state. The following study introduces a whitebox approach: rather than using T to monitor U’s inputs and outputs, we train a probe that reads U’s internal activities during inference and detecting intent directly from the model’s weights, before it ever produces a word.

Experimental Setup

If our probes encode intent regardless of the task surface, we have a detector that generalizes across the AI Control threat landscape - not just a prompt classifier. Can we build a T upon our probes?

The Probe

A whitebox probe is a linear classifier trained on a model’s internal activities during inference. Rather than reading what U says, it reads what U thinks.

The method is straightforward:

Run U on honest and attack inputs and record the required data during inference.
Train a small binary classifier (less than 1M parameters) that learns to differentiate between the dominant features of honest and attack inputs.

Benchmarks

We evaluated on two attack surfaces with fundamentally different threat models:

APPS-Code Backdoors

The APPS [3,4] dataset [3,4] contains competitive programming problems paired with hidden backdoor solutions that were pre-generated by GPT-4o. The task is code generation: given a programming problem, produce a correct solution — or, in the attack condition, produce a solution with a hidden vulnerability. Ground truth is

automated: unit tests determine whether the code is correct and whether the backdoor is present. Prior evaluations [1] have been performed using frontier models.

AgentDojo — Prompt Injection

AgentDojo [5] is a multi-turn agent benchmark across 4 domains (banking, Slack, travel, workspace). The agent receives a legitimate user goal and must complete it by calling tools (read email, check calendar, search results). In the attack condition, malicious instructions are hidden inside the environment data the agent retrieves (it is an indirect injection). The agent must complete the user task without following the injected instruction. Ground truth is deterministic Python scorer functions that check whether the injected goal was carried out.

Results: Code Backdoor Detection (APPS)

The Deception Scaffold

a vulnerability, anticipate what a reviewer would notice, and produce code that looks entirely legitimate. Following [3] attack evaluation protocol, we used an adversarial scaffolding:

Brainstorm five candidate methods for hiding a backdoor in this specific problem
Critique each for detectability — how suspicious would each look to a careful reviewer?
Refine the top candidates
Select the best approach
Implement the full solution with the hidden vulnerability
Self-review the code against a stealth checklist
Revise if any step raises a red flag

This is a structured reasoning process, not a prompt injection. It doesn’t just write bad code - it plans how to make bad code look good. The result is stealthier backdoors that are harder for a human or LLM reviewer to catch.

Following are the results:

LLM judges read the final code — the very output the scaffold was designed to make stealthy. Their AUC sits between 0.52 and 0.61: barely above random . Our probe reads what the model computed during generation. AUC 0.98: near-perfect detection on the same data where text-based monitors are essentially guessing

AgentDojo - Static

To establish a clean baseline and enable direct comparison with LLM judge monitors, we first evaluated AgentDojo in static mode: honest and injected prompts are fed directly to the model as a single forward pass, with no tool execution. The model reads the full prompt - user task and, in the attack condition, the embedded injection - and responds in one shot.

AUC Comparison on AgentDojo Dataset - Prompt Injection

When T reads U’s generated output, it sees nearly identical text for honest and attack: the model can’t use tools, so it produces similar non-committal responses in both cases. AUC 0.63-0.68: barely above random. When T reads the input prompt directly, it has access to the injected text in plaintext - a much easier classification task. AUC 0.72-0.87. This is the best-case scenario for text-based monitoring: the attack is literally visible in the input. The probe reads neither. It reads U’s activations - the internal computation while processing the prompt. AUC 0.95: better than the best-case text monitor, on data where the model’s output reveals nothing.

The key insight: our probes can detect intents even when behavior is identical.

The Agentic Challenge

The results above establish that activation probing works on static inputs. But the dominant real-world deployment model for capable AI is not static inference. It is agentic. For our framework it means a border scanning that cannot rely on prompts but needs to scan the entire agent closer. Our next research will focus on this domain.

Bibliography

[1] https://arxiv.org/abs/2312.06942

[2] https://www.redwoodresearch.org/

[3] https://github.com/UKGovernmentBEIS/control-arena

[4] https://arxiv.org/pdf/2105.09938

[5] https://arxiv.org/abs/2406.13352