top of page

Model Weights Theft and the Security Debt of AI Infrastructure

  • Writer: Mor Levin-weisz
    Mor Levin-weisz
  • Dec 24, 2025
  • 7 min read

Guest post by Zohar Venturero and Nitzan Pomerantz, Heron Deep Dive participants


Thoughts based on a conversation with Yogev Bar-On, the CEO of Attestable, an early-stage startup developing integrity solutions for AI. Previously, Yogi worked on AI Security policy at the RAND Corporation and is a co-author of RAND's "Securing AI Model Weights" report.



Why Model Weights Matter (Even If You're Not a Doomer)


Let's set aside rogue AIs, paperclip maximizers and Skynet scenarios for a moment.The case for taking model weight security seriously doesn’t require sci-fi premises, since it’s much more mundane and pressing.


Training a frontier model requires the convergence of three things:

Data

to train your model on

Algorithms

to train your model with

Compute

to actually run the training


While data can be scraped and bought, and ML architectures get published or reverse engineered; scraping enough data and at the correct mixture, and getting the “perfect algorithmic formula”, are not so easy.


Above all that we have compute- hardware, energy, capital and months of training.The product of all this expensive and intensive effort is the model weights: a single artifact, representing billions of dollars in distilled intellectual property.


From an adversary's perspective, the math is simple: why replicate the journey when you can steal the destination?Better yet, unlike a properly deployed model, stolen weights come stripped of much of the safety work - less guardrails, less refusals, pure capability. Some model-internal guardrails were shown to be dismantled by relatively easy techniques like fine-tuning.


In adversary hands, this enables high-risk misuse, like military applications, bioweapons (e.g., mirror life), offensive cyber, critical infrastructure disruption and more, so the risk is clear. Below, we'll make the case for prioritizing weight security now, and present some approaches being explored.



The Recipe: What Model Weights Actually Are

A neural network is, at its core, a massive mathematical function.During training, the model processes enormous amounts of data, gradually adjusting billions of numerical parameters to minimize prediction errors. These parameters, the weights, encode everything the model has "learned" during its training.


This knowledge is not human-readable, they're dense matrices of floating-point numbers that, when combined with the model architecture, produce the behavior we observe.

You can think of the architecture as the scaffold, the training data as the curriculum, and the weights are the resulting expertise, which is the configuration of billions of parameters that makes GPT-5 behave like GPT-5, that's the artifact worth protecting.


The Recipe: What Model Weights Actually Are

Where Do These Weights Live?


  • Limited encryption: during inference, weights must exist in plaintext in GPU memory. For large models distributed across GPU clusters, this exposure multiplies- weights and activations flow unencrypted between nodes. Encryption helps with storage and transfer, but doesn't address the exposure window during actual use.

  • Minimal hardware security: TPMs and HSMs are designed to protect small, high-value secrets like cryptographic keys, not multi-gigabyte model files that need to stream into GPU memory at high throughput.Even if we adapted the concept, there's an ecosystem gap: the AI accelerator world (NVIDIA, AMD, custom TPUs) evolved separately from traditional hardware security. There's no standard way to say "load these weights into the GPU, but only if the secure enclave validates them first". NVIDIA's confidential computing efforts are starting to bridge this gap, but it's early days.

  • No airgap: Inference clusters are internet-facing by design, that's the whole product. The weights powering the chatbot queries are sitting on machines with network exposure.


None of this is negligence, exactly. It's the natural outcome of a field that moved fast and prioritized capability.But it means we're now retrofitting security onto systems that were never architected for it.


The internal models of AI labs, particularly those deemed too powerful or dangerous for public release, represent an equally high-value target for adversaries. Critically, the weights are vulnerable not just in the final deployment environment, but from the moment they are created. Securing model weights throughout the entire lifecycle, from the earliest checkpoint in the training cluster onward, is a critical requirement for managing the immediate risks associated with powerful, ungoverned AI capability.


A Surprising Aside: Distillation Attacks


So far, we've focused on protecting weights from direct exfiltration, but there's an uncomfortable wrinkle worth mentioning:


Unlike cryptographic algorithms, which are specifically constructed so you can't reverse-engineer the key from the ciphertext, LLMs come with no such guarantee. There's no mathematical proof that you can't reconstruct meaningful capability from outputs alone.

And indeed, you can.


Distillation attacks, where a smaller model is trained to mimic a larger one using only its input-output behavior, have proven surprisingly effective. The DeepSeek-on-OpenAI-outputs controversy is an example of this in practice. It is also unclear to what extent a distilled model retains or bypasses the original model's safety guardrails, presenting another major unknown.


Sadly, we don't have a clean explanation for how much capability transfers through distillation, or where the limits are.


The implication for weight security is uncomfortable: even perfect protection against exfiltration doesn't fully solve the problem- the weights are being projected into the world with every API call.


Distillation is a real threat, but it's a different kind of problem- one that demands its own solutions and whose long-term impact remains unclear. For now,  direct weight theft remains the most acute and concretely addressable threat.


The Threat Landscape

To think about securing model weights systematically, it helps to categorize the threat landscape by the level of access the adversary has. 


Distillation, as discussed, requires no internal access and relies solely on external API outputs to mimic the model. Classical Cyber Defense covers the familiar territory: preventing traditional network breaches, lateral movement, protecting weights at rest on storage or employee machines, etc. Inference Leakage Prevention is the novel problem, as weights must be loaded into memory for use, which poses a window of exposure.


These categories form a natural hierarchy. Attackers will always take the easiest route, so strong classical defenses are a prerequisite before inference security even matters. But they are not sufficient- once the perimeter is hardened, the inference endpoint becomes the next critical target. A robust security strategy must address all three frontiers simultaneously; leaving any single vector unguarded guarantees that will be the adversary's entry point.


The next section focuses on inference security specifically, since it's both the least mature and the most unique to AI systems.


Securing the Inference Layer: Three Emerging Approaches

To address the specific threat of inference leakage, the industry is exploring three competing architectures. Each encodes a fundamentally different assumption about where "trust" should reside.


Hardware: Trust the Silicon

This approach leverages Trusted Execution Environments (TEEs) and secure enclaves. The concept is to have the hardware itself generate cryptographic attestations that the code running is legitimate and untampered with. While this is the most mature option today as TEEs are already shipping in production hardware, it forces reliance on hardware vendors like NVIDIA or Intel to manage private keys properly. Furthermore, TEEs have a history of vulnerabilities (e.g., sgx.fail), and this approach inevitably creates hardware lock-in.


Verification: Trust but Verify

This method sidesteps hardware trust by asking a statistical question: does the output match what we expect? By rerunning a sample of inference requests in an independent, trusted environment, defenders can probabilistically detect manipulation. This avoids vendor lock-in and performance penalties on every request, but implementation is tricky. Non-determinism in floating-point calculations means outputs rarely match perfectly, requiring complex verification logic.


Zero-Knowledge Proofs: Trust Nothing, Prove Everything

The most mathematically robust approach uses Zero-Knowledge (ZK) proofs to cryptographically certify that every inference computation was executed correctly. This relies solely on math rather than manufacturer keys or private infrastructure. The historical barrier has been overhead- conventional ZK proofs can be 1000x slower than the computation itself. However, startups like Attestable are now developing approximation methods that aim to reduce this tax to roughly 1.5x, potentially moving ZK verification from theory to production.


Closing Thoughts

Securing model weights is not a single problem. There is little value in deploying advanced Zero-Knowledge proofs to secure the runtime, if the model weights are sitting in an unsecured S3 bucket reachable by a phished employee credential. Conversely, as classical defenses harden, sophisticated actors will inevitably turn their attention to the inference layer, the one moment where weights must be exposed to be useful.


The industry is at a crossroads on how to handle this runtime exposure.

Leaving the inference window unguarded is a bet that the adversary cannot breach the perimeter, a bet that history suggests is unwise.


The frameworks we’re using (FEDRAMP, SOC2) were designed for a world where data breaches were about credit cards and personal information. They weren’t designed for facilities manufacturing capabilities that could shift global power balances.


If you work in AI security, here’s what matters:

For Researchers: The evals gap is real. We need comprehensive testing frameworks that include cyber capabilities, misuse potential, and catastrophic failure modes.


For Practitioners: Start thinking about security constraints that limit capability by design (maybe even bandwidth throttling!) rather than just adding defensive layers.


For Policymakers: The incentive structures are broken. Private companies won’t invest adequately in security without regulatory pressure or economic incentives that make it worthwhile.


Questions Worth Entertaining

  • If AI datacenters are manufacturing “state-level super weapons,” what does adequate security actually look like?

  • How do we build trust for federated learning or distributed training when every node is a potential adversary?

  • The algorithms are created by relatively few people. Should we focus disproportionately on protecting those individuals?

  • Who actually pays for defense research? What incentive structures work when offense resources dwarf defense?


Asher opened by suggesting some problems are worth quitting your job to work on. After this conversation, I understand why. These aren’t just technical challenges-they’re questions about how we govern transformative technology in an adversarial world.

The points raised in this summary can, in the mid-long term, change power and economic dynamics in the world. We better address the burning questions and allocate resources towards a strategic vision, and incentivize companies to make building worthwhile.



Interested in doing research in AI security with field leaders?



Want to dive deep into interesting research?




 
 
bottom of page