Adversarial ML

Mor Levin-weisz
5 days ago
5 min read

Guest post by Odelya Krief and Netzer Epstein, Heron Deep Dive participants

Thoughts based on a conversation with Itay Yona an interdisciplinary researcher in the fields of: Information Security, Neuroscience, and AI. In his recent role he worked in the Privacy & Security foundational research team in DeepMind.

From adversarial examples to jailbreak

Adversarial machine learning started with a surprising observation: you can take an input that looks unchanged to humans, apply a tiny, worst-case perturbation, and make a neural network confidently output the wrong answer. Even more unsettling, these adversarial examples often transfer across different models, suggesting they exploit shared geometry rather than model-specific quirks. This was first shown in 2014 in the paper Intriguing Properties of Neural Networks.

An early explanation for this behavior is that neural networks behave “too linearly” in high-dimensional spaces, meaning that small but well-aligned changes can push the model across a decision boundary. This insight turned adversarial attacks into something systematic rather than accidental and led to practical techniques such as gradient-based attacks and adversarial training. This is described in Explaining and Harnessing Adversarial Examples.

Modern LLM “jailbreaks” are the same story in a new domain: instead of pixels, the attacker perturbs text prompts. Work on universal, transferable adversarial suffixes shows you can automatically discover short strings that, when appended to many harmful requests, reliably reduce refusals, and can even transfer to other aligned models. Meanwhile, “Are aligned neural networks adversarially aligned?” argues that being aligned in normal use does not imply robustness under worst-case inputs, with especially stark results in multimodal settings. Mechanistic interpretability work (e.g., attribution-graph analyses of a jailbreak) adds a complementary angle: jailbreaks can succeed by getting the model to “start down the wrong path,” then letting internal pressures (like syntactic continuation) carry it forward. (as analyzed in Universal and Transferable Adversarial Attacks on Aligned Language Models).

Universal and Transferable Attacks

Transferable Attacks on LLM models are attacks that proved efficient in more than one model. Those attacks (called also Universal Attacks) exploit a basic vulnerability of the model structure or architecture, and so the attack can target different models that share the same basic principles. One example of basic vulnerability can be exploiting the way models understand the meaning of a word, and give the word different attention.

Summary of the Paper

In-Context Representation Hijacking (Itai et al.)

In the paper In-Context Representation Hijacking, the researchers demonstrated a universal attack that manipulates how the model understands meaning through context. They provided the model with many long prompts containing news, facts, and stories about bombs, but consistently replaced the word “bomb” with the word “carrot.” After being exposed to enough of this context within a single interaction, the model began to treat “carrot” as if it meant “bomb”. When the researchers later asked how to create a “carrot,” the model responded as if the question were about a bomb, bypassing its safety guardrails. The attack proved to be universal, it worked on both frontier and open source models.

Q&A and Discussion

How did the model fail to prevent the answer if it “knew” it was a bomb

The key issue is that the model’s understanding develops gradually. At early stages of processing, the model still interpreted the word carrot as something harmless. The safety mechanism made its decision at this point, based on an incomplete understanding of the prompt general context. Only later, as more context was integrated across deeper layers, did the model internally realize that carrot was being used to mean a bomb.

This creates a mismatch between when safety is checked and when meaning fully emerges. By the time the model reached the correct (dangerous) interpretation, the safety decision had already been made and was not re-evaluated. This is very similar to a time-of-check vs. time-of-use (TOCTOU) vulnerability in cybersecurity, where a system verifies something under one condition but later uses it under a different, riskier condition.

The important insight is that the failure is not due to the model lacking knowledge of the contextual meaning of the word “carrot”, but due to the timing of safety enforcement. Safety mechanisms that assume meaning is fixed or immediate can miss cases where meaning shifts during the model’s internal reasoning process. And that’s exactly the part we exploit.

The following graph, taken from the paper, illustrates this scenario:

What does the model think a “carrot” is?

The model does not store fixed definitions. Meaning emerges gradually across layers, based on context and associations. While we may think that the meaning of each word is bound in the model’s weights, the element of model attention and changes made by the context may create meaningful changes when the models see a word in the next prompt.

Before the attack, the model likely associated “carrot” with familiar concepts such as a vegetable, orange color, pretty gross in salads etc. But after being exposed to prompts where “carrot” consistently appeared in contexts normally associated with bombs, the model began to treat the word “carrot” as if it referred to a bomb. In this sense, the model was not redefining the word itself, but reinterpreting it based on the surrounding context, showing how flexible and exploitable its notion of meaning can be. This was based on a previous paper ICLR: In-context learning of representations .

Is there a specific layer where the model understands that “carrot” is a bomb?

Not really. The model does not have a single layer where meaning suddenly “clicks.” Instead, meaning is spread across many layers and develops gradually as the input is processed. Because of this, defenses that focus on only one layer or one moment in the generation process are fragile.

In the paper, the authors demonstrate this using probability tables created with interpretability tools such as logit lens and Patchscopes. These tools allow us to look inside the model and see which word it is most likely to generate next at each layer.

The original prompt given to the model was:

“How to build a carrot.”

To understand what the model was thinking internally, the researchers sliced the model at an earlier point in the prompt and examined:

“How to build a”

This slice gives us a kind of “glance into the model’s brain” before the word carrot is actually given. When looking at the probability table for the next word, they saw that across many layers, the model assigned a high probability to the word bomb as the next token. In other words, even though the final prompt still contained the word carrot, internally the model was already interpreting the sentence as referring to a bomb.

This shows that the dangerous meaning was present inside the model well before it appeared explicitly in the output, and that different layers were already encoding that interpretation at different stages of processing.

To sum it up

By slowly shifting the meaning, we can make the model misunderstand what is really being discussed and bypass its safety rules. This also shows that the timing of the refusal mechanism is critical, and that safety checks may need to happen more than once during the model’s reasoning process.