Model Weights Theft and the Security Debt of AI Infrastructure

Dec 24, 2025
5 min read

Updated: Jan 21

Guest post by Zohar Venturero and Nitzan Pomerantz, Heron Deep Dive participants

These are our notes and takeaways from a conversation with Yogev Bar-On, CEO of Attestable, an early-stage startup developing integrity solutions for AI. Previously, Yogi worked on AI Security policy at RAND and co-authored their "Securing AI Model Weights" report.

Why should you care about model weight security?

You don't need to worry about paperclip maximizers or Skynet to find this problem interesting. The mundane version is compelling enough.

Training a frontier model requires three ingredients: data to train on, algorithms to train with, and compute to run the training. Data can be scraped and bought. Architectures get published or reverse-engineered. But compute means hardware, energy, capital, and months of continuous training. And then you need to find the “perfect formula”, which is not so easy.

The output of all that effort? The model weights - a single artifact representing billions of dollars of distilled work.

From an adversary's perspective, the math is simple: why replicate the journey when you can steal the destination?Better yet, unlike a properly deployed model, stolen weights come stripped of much of the safety work - less guardrails, less refusals, pure capability. Some model-internal guardrails were shown to be dismantled by relatively easy techniques like fine-tuning.

This enables the concerning applications you'd expect: military use, bioweapons development (e.g, mirror life), offensive cyber operations, critical infrastructure attacks; with much more to be imagined.

What are model weights, exactly?

A neural network is a massive mathematical function. During training, it processes enormous amounts of data, adjusting billions of numerical parameters to minimize prediction errors. These parameters - the weights - encode everything the model has "learned."

This isn't human-readable knowledge. It's dense matrices of floating-point numbers that, combined with the model architecture, produce the behavior we observe.

A useful mental model: the architecture is the scaffold, the training data is the curriculum, and the weights are the resulting expertise. The specific configuration of billions of parameters that makes GPT-5 behave like GPT-5 - that's the artifact worth protecting.

Where Do These Weights Live?

Weights are constantly being used, not just stored. During inference, they're loaded into GPU memory (or TPU, or specialized accelerators) to process queries in real-time. They're loaded from storage, distributed across clusters, replicated for redundancy and scale.

This is where the industry's security debt shows up. AI infrastructure was built for performance and throughput, not for treating weights as a high-value asset.

Some symptoms we discussed:

Limited Encryption. During inference, weights exist in plaintext in GPU memory. For large models distributed across GPU clusters, this exposure multiplies - weights and activations flow unencrypted between nodes. Encryption helps with storage and transfer, but not during actual use.
Minimal hardware security. TPMs and HSMs protect small, high-value secrets like cryptographic keys. They weren't designed for multi-gigabyte model files that need to stream into GPU memory at high throughput. The AI accelerator ecosystem (NVIDIA, AMD, custom TPUs) evolved separately from traditional hardware security, so there's no standard way to say "load these weights, but only if the secure enclave validates them first." NVIDIA's confidential computing efforts are starting to address this, but it's early.
No airgap. Inference clusters are internet-facing by design - that's the product. The weights powering your chatbot queries sit on machines with direct network exposure.

None of this is negligence, exactly. It's the natural outcome of a field that moved fast and prioritized capability.But it means we're now retrofitting security onto systems that were never architected for it.

The internal models of AI labs, particularly those deemed too powerful or dangerous for public release, represent an equally high-value target for adversaries. Critically, the weights are vulnerable not just in the final deployment environment, but from the moment they are created. Securing model weights throughout the entire lifecycle, from the earliest checkpoint in the training cluster onward, is a critical requirement for managing the immediate risks associated with powerful, ungoverned AI capability.

Distillation attacks

So far, we've focused on protecting weights from direct exfiltration, but there's an uncomfortable wrinkle worth mentioning:

Unlike cryptographic algorithms, which are specifically constructed so you can't reverse-engineer the key from the ciphertext, LLMs come with no such guarantee. There's no mathematical proof that you can't reconstruct meaningful capability from outputs alone.

And in fact, you can. Distillation attacks train a smaller model to mimic a larger one using only input-output behavior. They're surprisingly effective. The DeepSeek-on-OpenAI-outputs controversy is a famous example.

We don't have a clean model for how much capability transfers through distillation, or what the limits are. It's also unclear whether a distilled model retains or bypasses the original's safety guardrails.

The uncomfortable implication: even perfect protection against exfiltration doesn't fully solve the problem. The weights are being projected into the world with every API call.

Distillation deserves its own solutions, and its long-term impact remains unclear. But for now, direct weight theft is the concretely addressable threat.

Mapping the threat landscape

In order to think about securing model weights systematically, it helps to categorize the threat landscape by the level of access the adversary has.

Distillation requires no internal access - just external API outputs.

Classical cyber defense covers the familiar territory: preventing network breaches, lateral movement, protecting weights at rest on storage or employee machines.

Inference leakage prevention is the novel problem. Weights must be loaded into memory for use, creating an exposure window that traditional security doesn't address.

These form a natural hierarchy. Attackers take the easiest route, so classical defenses are a prerequisite before inference security matters. But they're not sufficient - once the perimeter hardens, the inference endpoint becomes the critical target. A robust strategy needs to address all three; leaving any vector unguarded makes it the obvious entry point.

The inference layer is both the least mature and the most unique to AI systems, so that's where we focused.

Three approaches to inference security

The industry is exploring three architectures for securing the inference layer. Each makes different assumptions about where "trust" should reside.

Hardware: trust the silicon

Hardware: trust the siliconThis approach uses Trusted Execution Environments (TEEs) and secure enclaves. The hardware generates cryptographic attestations that the running code is legitimate and untampered. TEEs are the most mature option - they're shipping in production hardware today. But they require trusting hardware vendors (NVIDIA, Intel) to manage private keys properly. TEEs have a history of vulnerabilities, and this approach creates hardware lock-in.

Verification: trust but verify

This method sidesteps hardware trust by asking a statistical question: does the output match what we expect? By rerunning a sample of inference requests in an independent, trusted environment, defenders can probabilistically detect manipulation. No vendor lock-in and performance penalties on every request, but implementation is tricky. Non-determinism in floating-point calculations means outputs rarely match perfectly, requiring complex verification logic.

Zero-Knowledge Proofs: Trust Nothing, Prove Everything

The most mathematically robust approach uses ZK proofs to cryptographically certify that every inference computation was correct. Trust relies on math rather than manufacturer keys or private infrastructure. The historical barrier is overhead - conventional ZK proofs can be 1000x slower than the computation itself. Attestable (Yogi's company) is developing approximation methods that aim to reduce this to roughly 1.5x overhead, which could make ZK verification practical for production use.

Where this leaves us

Securing model weights is not a single problem. There is little value in deploying advanced Zero-Knowledge proofs to secure the runtime, if the model weights are sitting in an unsecured S3 bucket reachable by a phished employee credential. Conversely, as classical defenses harden, sophisticated actors will inevitably turn their attention to the inference layer, the one moment where weights must be exposed to be useful.

The industry hasn't converged on how to handle this runtime exposure. Leaving the inference window unguarded is a bet that the perimeter won't be breached. History suggests that's not a great bet.

Interested in doing research in AI security with field leaders?

Join our Research Fellowship

Want to dive deep into interesting research?

Join the Next Cohort