Defense on a Deadline: Hardening Models Before Hackers Exploit Them

Jan 25
5 min read

Staying one training cycle ahead of attackers.

Guest post by Eran Segal and Tal Yerushalmy, Heron Deep Dive participants

Thoughts following a conversation with Omer Nevo, CTO of Irregular — the first frontier security lab designing high-fidelity simulations and evaluations that frontier labs, governments, and Fortune-500 enterprises rely on to keep increasingly capable AI systems safe. We thank Omer for sharing his insights.

Over the past year, reinforcement learning has rapidly become the dominant paradigm for improving AI models. Irregular’s recent research on frontier model performance in offensive security tasks reveals a measurable capability shift—one most organizations haven’t noticed yet, but are about to feel.

The danger is not an abstract “AGI doomsday.” It is concrete: the tools to exploit existing models with dangerous capabilities, or to fine-tune such capabilities into them, are already publicly available.

The RL turning point

Supervised learning—the traditional backbone of machine learning—requires humans to meticulously label data. This process is slow, expensive, and fundamentally limited by the ceiling of human knowledge; a model can generally only become as good as the examples it is fed. Reinforcement Learning (RL) changed that equation.

In its purest form, RL lets a model learn by interacting with an environment, receiving rewards for good outcomes and penalties for bad ones. It can discover strategies humans never thought of - this is how AlphaZero invented chess moves that grandmasters had never seen.

Usually the hard part is building a realistic environment. In most fields, this is a massive engineering hurdle because simulations often suffer from the "reality gap"—where a model excels in a simplified digital world but fails in the messy physical one.

For years, robotics hit this wall; it was easy to train a hand to pick up a ball in a vacuum, but nearly impossible to simulate the infinite friction and lighting variables of a real warehouse. Once high-fidelity physics engines allowed robots to train in "Sim2Real" environments, we saw a sudden explosion in hardware capability.

However, creating these environments introduces a new risk: goal misgeneralization. This happens when a model learns to achieve the "reward" in a way the designer didn't intend—like a game-playing AI that learns to exploit a glitch to get points rather than actually playing the game. In a security context, a model might find a way to "solve" a task by crashing the evaluation server rather than finding a legitimate patch.

Cybersecurity already has a ready-made environment.

Why cyber is the highest-risk domain

Decades of CTF challenges, exploit write-ups, bug-bounty reports, vulnerable codebases, and labeled PCAPs form a rich, high-fidelity playground. Unlike the closed-door worlds of high-energy physics or biological research, parts of the security community have a long history of radical transparency and sharing. Since the early days of "underground" publications like Phrack magazine—most notably Aleph One’s seminal 1996 article, "Smashing the Stack for Fun and Profit"—researchers have treated the public disclosure of vulnerabilities as a badge of honor.

This culture of openness, intended to help defenders, has inadvertently created a massive, structured data set for anyone to use. Consequently, offensive training is almost plug-and-play.

The "Text-to-Damage" Pipeline

Our internal team discussions highlighted a critical distinction between cyber and other high-risk fields like bio-pharma:

Cyber is decentralized
While the pharma market is huge, it is centralized and gated. In cyber, the "weapon" is pure text. There are no physical supply chains or specialized labs to monitor.
The "Reality Gap" is zero
In robotics or biology, there is a gap between a digital simulation and a physical result. In cyber, the simulation is the reality. If a model generates the correct code sequence, the damage is immediate.
The market is liquid
There is immense value in training models for defensive hardening, but because that knowledge is machine-readable and dual-use, it is instantly available for offensive inversion by ransomware crews or nation-states and increase their abilities.

The Structural Asymmetry

This risk is compounded by a fundamental structural asymmetry: it is mathematically and operationally easier to find a flaw than to fix one. An attacker only needs to find a single unpatched vulnerability or an outdated library to compromise a system. Conversely, a defender must secure every possible entry point perfectly, every single day. For example, updating a library to patch a bug can inadvertently break existing functionality, forcing defenders into a slow cycle of testing and deployment. An attacker, however, can weaponize a newly discovered zero-day the moment it is published.\

Until we build high-fidelity defensive training environments that specifically reward patching and resilience over exploitation, offense will maintain this structural head-start.

Regulation’s blind spot

The EU AI Act is a massive step forward, but its definitions are still a bit fuzzy, leaving doors open to bypass the restrictions.

Governments do receive early-access testing rights, but with no clear penalties the process is closer to a courtesy than a constraint.

Until regulation tracks capability rather than compute, the only brake on deployment is self-restraint: frontier labs that voluntarily follow frameworks like Anthropic’s Responsible Scaling Policy. That safety net is as strong as the labs’ goodwill and nothing prevents these companies from lowering the bar for their responsible scaling policy (RSP). In fact, Anthropic already lowered the bar.

Evaluations-today’s gatekeepers

In the absence of binding regulation, third-party model evaluations—run by specialist firms like Irregular—have become the de facto safety layer. Before a model ships, it now faces confidential test suites probing its ability to detect and exploit vulnerabilities, crack difficult cryptographic problems, or even demonstrate the sophisticated planning required to map and compromise private networks.

Model Trainers never see the exact tasks—they’re contractually blocked—because if you publish the test, someone will Goodhart it into uselessness. The opacity is a feature: it forces broad safety rather than narrow score-chasing.

These tests are designed to measure what might happen if a model is used with bad intentions and to understand the real-world impact of its skills. One way to make these models safer is to ensure they are better at spotting and blocking harmful actions than they are at performing them.

The attacker-defender asymmetry

Attackers need one overlooked flaw; defenders must cover the entire surface continuously. Offense can probe, iterate, and wait for the perfect moment, while defense operates under constant pressure with no margin for error. Even if both sides wield equally powerful models, the structural advantage still leans red.

Hence the urgent project: build defensive training environments—high-fidelity sims that reward:

Patching vulnerabilities before attackers exploit them
Spotting and stopping intrusions in real time
Building robust systems rather than breaking fragile ones
Designing defenses that still work when tactics change

Frontier labs run some of this work internally, but defenders still need shared environments and independent scoring—ideally with rotating, partly secret tests—to provide checks labs can’t perform on themselves.

Key takeaways

AI + Cyber = immediate, concrete riskModels have already shown the capacity to perform at an expert-level on various cybersecurity scenarios.

Policy lags; voluntary frameworks fill the gap—for nowCurrent frameworks like the EU AI Act are meaningful first steps, but enforcement lags behind capability growth. The most effective safety measures today come from voluntary frameworks and evaluations, not mandates.

Secret evaluations are the current guardrailThey’re imperfect, but they’re what keep unreleased zero-day generators off the public internet.

Winning the "Defensive Delta" The race isn’t about raw power—it’s about intentionality.

To shift the balance, we must prioritize an engineering focus on defensive training environments. We win when our models learn to harden systems faster than they learn to break them.

Resources

These references inform the discussion above:

Frontier Model Performance on Offensive-Security Tasks: Emerging Evidence of a Capability Shift - Irregular's research documenting the measurable shift in AI offensive capabilities
Anthropic's Responsible Scaling Policy - A framework for how AI labs can voluntarily commit to safety thresholds
Towards Evaluating the Robustness of Neural Networks - Carlini & Wagner's foundational work demonstrating adversarial example generation
Poisoning Web-Scale Training Datasets is Practical - Research showing the feasibility of poisoning attacks on large training datasets