Skip to content
HOME / CYBERSECURITY / NIST IS NOW TESTING 1 month AGO

Cybersecurity

NIST Is Now Testing Big Tech’s AI Before It Ships

NIST Is Now Testing Big Tech’s AI Before It Ships

Last Updated on May 10, 2026 by Arnav Sharma

The U.S. government just got early access to AI models from Google DeepMind, Microsoft, and xAI before those models go public. That’s not a small thing.

NIST’s Center for AI Standards and Innovation (CAISI) announced agreements with all three companies this week to conduct pre-deployment evaluations and targeted research into their frontier AI capabilities. The move builds on similar deals signed with OpenAI and Anthropic back in 2024, effectively pulling every major AI lab into a formal government security testing programme.

If you work in AI security or AI governance, this is worth paying close attention to.


What CAISI Actually Does

CAISI sits within the National Institute of Standards and Technology (NIST), the same National Institute of Standards that gave us the NIST Cybersecurity Framework and falls under the U.S. Department of Commerce. Think of it less like a regulator and more like a testing lab with national security clearance.

Its core function is AI measurement science: building the technical tools and methodologies needed to evaluate what AI systems can actually do, what risks they carry, and where their strengths and limitations lie. The NIST AI Risk Management Framework (AI RMF) is one of its better-known outputs, a voluntary framework that gives organisations a structure for managing AI risks across the entire AI lifecycle, from development through deployment.

CAISI has completed more than 40 evaluations of AI systems to date, including assessments of unreleased models. Developers frequently provide those models with reduced or removed safeguards to support evaluations focused on national security-related capabilities. Getting access to a model with its guardrails partially stripped is a very different exercise from testing a polished production system. It tells evaluators what the AI is genuinely capable of not just what it’s been trained to say it won’t do.


Who’s In and What They’ve Agreed To

The new agreements with Google DeepMind, Microsoft, and xAI build on previously announced partnerships with OpenAI and Anthropic. Terms have been updated to include directives from the Secretary of Commerce and President Trump’s AI Action Plan.

Under these agreements, CAISI can:

  • Conduct pre-deployment evaluations of frontier AI models before public release
  • Continue testing after AI systems are deployed, tracking AI-related risks that emerge post-launch
  • Run evaluations in classified environments where national security implications can be assessed directly
  • Draw in evaluators from across government through the TRAINS Taskforce, an interagency group focused on AI-related national security issues

The scope of evaluation of AI covers testing, collaborative research, and best practice development for commercial AI systems. CAISI evaluates “demonstrable risks” associated with artificial intelligence systems, cybersecurity risks, biosecurity risks, and chemical weapons risks all sit on that list.

Microsoft was candid about the limits of self-assessment. Chief responsible AI officer Natasha Crampton said that evaluations tied to national security and public safety require close collaboration between industry and governments with deep technical and security expertise, and that Microsoft would apply findings directly into how it designs, tests, and deploys AI and share best practices broadly.


Why This Is Happening Now

The timing matters. The arrangement represents a significant reversal for the Trump administration, which had previously scrapped AI security review measures it considered overly burdensome. What changed the calculation was Anthropic’s disclosure that its latest model, Claude Mythos, was too dangerous to publicly release because of its alarming ability to find software vulnerabilities.

When an AI lab tells the government it has built something it doesn’t feel comfortable releasing, policymakers pay attention. CAISI Director Chris Fall put it plainly: “Independent, rigorous measurement science is essential to understanding frontier AI and its national security implications. These expanded industry collaborations help us scale our work in the public interest at a critical moment.”

Separately, OpenAI has been working with CAISI on GPT-5.5-Cyber, a model built specifically for cyber defence. OpenAI is also developing a responsible AI deployment strategy for that model, including a playbook for distributing it across public services. The use of AI for automated security tasks is advancing fast and the wider AI community is clearly trying to get governance structures in place before capability runs too far ahead of oversight.


What CAISI’s Testing Actually Covers

For practitioners thinking about what “evaluation of AI” means at this level, it goes well beyond functional testing. CAISI uses both quantitative and qualitative approaches to assess AI systems, measuring things like robustness under adversarial conditions, explainability of AI decisions, and the security risks associated with AI systems that handle sensitive information.

Identified risks in frontier models include everything from data security gaps and access controls weaknesses to cyber threats that emerge from AI-generated attack tooling. High-risk AI systems with national security implications get tested against threat models covering security threats that wouldn’t appear in standard commercial evaluations, including scenarios where the AI itself becomes a vulnerability in the hands of an adversary.

I’ve seen a similar pattern in architecture reviews for AI development and deployment in enterprise settings. The risks associated with artificial intelligence in production are rarely the ones that appeared in pre-launch testing. Machine learning models behave differently under real-world load, adversarial inputs, and dataset drift than they do in controlled conditions. That’s exactly why the post-deployment component matters here and why metrics that only measure capability at launch give an incomplete picture.


What This Means for AI Governance and Procurement

The Shift Toward Security-by-Design

Fritz Jean-Louis of Info-Tech Research Group described the CAISI agreements as a shift toward proactive security for agentic AI, enabling government-led security testing before and after organisations deploy AI systems, strengthening visibility into autonomous behaviours, and accelerating AI standards development to manage AI risks more systematically. That framing is right. Historically, AI development and deployment have operated on a ship-then-patch model. This programme inserts a checkpoint earlier.

Vendor Status Is Now a Risk Factor

Analysts describe choosing a vendor without CAISI partnership status as a “massive contagion risk” for enterprises with federal contracts. One was blunt: “We have entered an era where a model’s utility to the state is a key predictor of its long-term viability in the enterprise stack.”

For anyone implementing AI in environments with government adjacency, vendor risk assessment now includes whether a vendor has participated in these evaluations and what identified risks, if any, were flagged in the process. Choosing AI technologies without that visibility is a risk tolerance question worth asking explicitly.

The Voluntary Framework Has Real Limits

These are voluntary agreements. Nothing here legally requires a vendor to disclose AI risks or pause a release based on CAISI findings. The NIST AI RMF itself is also a voluntary framework, organisations choose whether to adopt it. A voluntary approach works when incentives align, and right now there’s enough reputational and political benefit to cooperating. But one researcher’s challenge on LinkedIn is worth remembering: “CAISI will need to define, and publish, what it’s testing for not just who it’s testing with.” Evaluation without published metrics and methodology is difficult to act on.

Comprehensive risk management for AI requires more than knowing a model was tested. It requires knowing what the testing covered, what tools used in that evaluation actually measure, and how the organisations involved plan to mitigate risks and address identified gaps. The ability to evaluate AI and manage AI risks systematically depends on that transparency and that piece is still missing. Robust security measures can’t be scoped without knowing what the assessment found.


What to Watch Next

An executive order may be close. Reports suggest the White House is preparing a formal vetting system for AI vendors, which would give these evaluations more regulatory weight.

  • The Anthropic situation. In March, the Department of Defense formally designated Anthropic a security risk despite its CAISI partnership. Cooperation with NIST AI evaluation processes doesn’t guarantee protection from other government actions.
  • IP protections remain unresolved. Giving government evaluators access to unreleased AI models raises legitimate questions about how intellectual property gets protected, a potential hurdle the current agreements don’t fully address.
  • The advancement of AI technologies is only going to accelerate. The AI innovation cycle runs faster than most governance frameworks can track. The value of trustworthy AI as a concept depends on whether the evaluation infrastructure behind it can keep pace.

The CAISI programme isn’t a complete answer to the risks associated with AI systems at the frontier. But it is the most serious attempt the U.S. government has made to get ahead of those AI-related risks before they become incidents. Managing risks related to frontier AI at the national level is genuinely new territory, and it requires exactly this kind of structured, government-led engagement. For security practitioners, watching how this evaluation methodology matures and what standards emerge from it is now part of the job.

Arnav Sharma
Arnav Sharma Microsoft MVPMCT
Microsoft Certified Trainer · Cloud · Cybersecurity · AI

I help organisations secure their cloud infrastructure and stay ahead of evolving cyber threats. Microsoft MVP and Certified Trainer, author of Mastering Azure Security, and founder of arnav.au — a platform for practical Cloud, Cybersecurity, DevOps and AI content.

Frequently Asked Questions

KEEP READING

Leave a reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.