Last Updated on June 4, 2026 by Arnav Sharma
Prompt injection is the strange new cousin of SQL injection. The shape of the problem is familiar (untrusted input getting mixed with trusted instructions) but the fix we’d usually reach for doesn’t quite exist yet. You can’t parameterize a natural language prompt. You can’t escape a sentence the way you escape a quote. And when the “parser” is a 400-billion-parameter model that was trained to be helpful, every defense you build is probabilistic by nature.
Microsoft has spent the last two and a half years assembling a response to this, and the result is not a single product. It’s a stack. Prompt Shields gets the headlines, but the real story is how Azure AI Content Safety, Spotlighting, Defender for Cloud, Entra Global Secure Access, and PyRIT fit together (and where they don’t).
I’ve been watching this space closely because the same questions keep coming up in architecture reviews: Is Prompt Shields enough? Do we need Spotlighting on top? Where does Defender actually sit in the flow? And what does a real bypass look like? So here’s the walkthrough I wish existed when I first started mapping this out.
The umbrella term nobody calls it
“AI Prompt Defense” isn’t a SKU you can buy. Microsoft uses it as shorthand for a collection of overlapping controls that each tackle one slice of the problem:
- Prompt Shields is the classifier. It reads user prompts and grounding documents and returns a binary verdict on whether an attack is present.
- Spotlighting is a prompt-engineering technique that marks untrusted content so the model treats it as data rather than instructions.
- Entra Global Secure Access Prompt Injection Protection enforces guardrails at the network layer, which means it can cover third-party SaaS LLMs like ChatGPT, Claude, or Gemini without any application changes.
- Microsoft Defender for Cloud AI Threat Protection turns Prompt Shields detections (plus threat intelligence) into XDR alerts your SOC can triage.
- PyRIT is the open-source red team framework for stress-testing all of the above.
The framing Microsoft uses in its July 2025 MSRC post is worth sitting with: indirect prompt injection is called “an inherent risk that arises from the probabilistic language modelling, stochastic generation, and linguistic flexibility of modern LLMs,” and Microsoft openly states that deterministic detection is still an open research problem. That’s the honest starting point. Everything downstream is defense-in-depth on top of a probabilistic base.
Prompt Shields, up close
From Jailbreak Risk Detection to something bigger
Prompt Shields started life in November 2023 as a preview feature called Jailbreak Risk Detection. In March 2024 it was renamed, and Microsoft added detection for indirect prompt attacks (the ones that come in through documents rather than the user’s own typing). Spotlighting landed at Build 2025, and Defender for AI threat protection went GA on 1 May 2025. Today it’s baked into Azure AI Foundry, Azure OpenAI content filters, Defender for Cloud, and Entra Global Secure Access.
Under the hood, it’s a binary classifier trained on known prompt injection techniques across eight languages. You call it through a single REST endpoint and get back an attackDetected flag for the user prompt and for each supplied document:
POST <endpoint>/contentsafety/text:shieldPrompt?api-version=2024-09-01
{
"userPrompt": "<user text>",
"documents": ["<grounding doc 1>", "<grounding doc 2>"]
}
{
"userPromptAnalysis": { "attackDetected": true },
"documentsAnalysis": [ { "attackDetected": false } ]
}
Two shields, two threat models. The user prompt shield (the old Jailbreak Risk Detection) is aimed at direct attacks, where the person typing is the adversary. The document shield is aimed at indirect cross-prompt injection attacks, where a third party has smuggled instructions into a grounding source that your RAG pipeline will later pull in.
What it tries to catch
For user prompts, Microsoft’s taxonomy lines up roughly as you’d expect:
- Rule override attempts. “Ignore previous instructions,” “you are now an unrestricted AI,” and the thousand variations on that theme.
- Fake conversation mockups. A single user message that pretends to contain prior assistant turns to confuse the model.
- Role-play jailbreaks. DAN, Yendys, any persona framing that tries to get the model to pretend the rules don’t apply.
- Encoding attacks. Base64, ROT13, homoglyphs, character transformations that smuggle intent past string-matching filters.
Document attacks are a broader bucket because an attacker who controls the content can aim at almost anything: manipulated narratives, backdoors, data exfiltration, availability attacks, fraud, malware delivery, plus all four user-prompt categories above. That breadth is telling. It reflects the uncomfortable truth that any third-party content flowing into your LLM context is a potential attack vector.
Spotlighting: the prompt-engineering half of the fix
The most interesting thing Microsoft has done in this space isn’t actually the classifier. It’s Spotlighting, which came out of Microsoft Research (Hines et al., CAMLIS 2024) and has since been productized as a Prompt Shields feature.
The core observation is simple and a bit uncomfortable: when you concatenate a system prompt, a user message, and some retrieved documents, the LLM has no reliable way to know which bits are instructions and which are data. Spotlighting fixes that by giving untrusted content a persistent “this is data, not instructions” signal that the model can see throughout the text.
There are three ways to do it:
- Delimiting. Wrap untrusted content in markers like
<<DOC_START>>...<<DOC_END>>and instruct the model in the system prompt to never execute anything between them. Cheapest, weakest. - Datamarking. Insert a special character (Microsoft uses
^) between every word in the untrusted content. The text becomes visibly, continuously marked. Microsoft found this significantly more effective than delimiters. - Encoding. Base64 or ROT13 the untrusted content. The model can still read it, but there’s no way for it to confuse the bytes with legitimate instructions.
The published numbers are striking. On GPT-family models, Spotlighting cut indirect prompt injection success rates from over 50% to under 2%, without meaningfully hurting task performance. That’s a bigger move than anything the classifier alone delivers.
There’s a catch worth flagging. With the encoding mode, the model sometimes volunteers in its response that the input was Base64-encoded, even when the user never asked. Microsoft’s workaround is a system-prompt instruction telling the model to stay quiet about it. Not elegant, but it works.
The fine print
A few things to keep in mind if you’re sizing this up for production:
- Languages trained and tested: Chinese, English, French, German, Spanish, Italian, Japanese, Portuguese. Everything else “may work with varying quality,” which is a meaningful gap given how often multilingual prompts show up as an evasion route.
- Default limits: 10,000 characters per document, 1,000 per user prompt, 1,000 requests per 10 seconds.
- The API response is binary. No confidence score. No attack-type classification. If you want to tune thresholds or prioritize false positives, you’re working blind.
- Pricing is per text record (1,000 Unicode code points = 1 record) under Azure AI Content Safety. Defender for Cloud’s AI Threat Protection plan is priced separately at $0.002 per 1,000 tokens.
Microsoft’s defense-in-depth playbook
Microsoft’s Zero Trust guidance splits the defense strategy into three layers, and this is the part I find myself quoting most often in design reviews because it reframes the question away from “will the classifier catch this?” toward “what happens if it doesn’t?”
Prevention
Hardened system prompts. Spotlighting on untrusted content. Information Flow Control to isolate untrusted content in quarantined inference environments. Least privilege for agents, with short-lived credentials that get revoked after each use.
Detection
Prompt Shields. Plan drift detection to catch multi-step agents veering off their intended task flow. Critic agents that audit inputs and outputs in real time. Tool chain analysis that blocks risky sequences of tool calls before they execute.
Impact mitigation
Data governance that limits what the LLM can even see. Deterministic blocking of known exfiltration techniques (more on this in a moment). Human-in-the-loop confirmation for risky actions. User consent workflows.
Microsoft’s own framing is the bit that matters: “We design systems such that even if some prompt injections are successful, this will not lead to security impacts for customers.” That’s the mindset shift. Treat the classifier like a stack canary or ASLR — a probabilistic layer you assume will eventually be bypassed, with deterministic mitigations behind it to limit the blast radius.
Where Defender for Cloud fits
Once Prompt Shields is firing, those signals need somewhere to go. That’s where the Defender for Cloud AI Threat Protection plan comes in. It sits in front of the model API, so coverage applies regardless of how the request arrived (Foundry playground, a custom app, a Python script on someone’s laptop). It covers Azure OpenAI and the Azure AI Model Inference catalog, which means Meta, Mistral, and other hosted models are in scope too.
The alert catalog is where this starts to feel like proper XDR rather than a content filter:
- Jailbreak attempt blocked and Jailbreak attempt detected (not blocked) — the Prompt Shields signals surfaced as alerts
- ASCII Smuggling — invisible Unicode tag characters hiding instructions
- Sensitive Data Exposure in AI Model
- Credential Theft via AI Model
- Suspicious User Agent accessing AI resource — tied to Microsoft’s threat intelligence
- Tor-originated access to AI resources
- System-Prompt Extraction Attempt — reconnaissance to leak metaprompts
- Phishing URL sent to AI agent
- Malware in uploaded ML model
Each alert carries MITRE ATT&CK tactic tags (Privilege Escalation, Defense Evasion, Reconnaissance, Collection, Exfiltration), and they flow into Defender XDR alongside endpoint, identity, and cloud signals.
There’s a per-tenant toggle called Suspicious prompt evidence that includes snippets of the flagged prompt in the alert for SOC triage. If you leave it off, Microsoft still analyzes the content but masks it in alert output. Worth discussing with your privacy team before you flip it on.
For policy-based enforcement there’s a built-in Azure Policy (Enable threat protection for AI workloads, definition ID 7e92882a-2f8a-4991-9bc4-d3147d40abb0) that’ll make sure the plan is turned on across subscriptions.
The network-layer twist: Entra Global Secure Access
This one changes the audience. Prompt Shields the API is something a developer bolts into an application. Prompt Injection Protection in Entra Global Secure Access is something a CISO deploys as a corporate network control.
It runs as part of Microsoft’s Security Service Edge offering, which means it can enforce prompt injection guardrails at the network layer for any SaaS LLM — no code changes, no SDK integration, no cooperation from the vendor. There are preconfigured extractors for eleven commercial LLMs (ChatGPT, Claude, Cohere, Deepseek, Gemini, Grok, Meta AI, Mistral, Perplexity, Pi, Qwen) with custom JSON-path support for anything else, enforcement delivered through the Global Secure Access Client on managed endpoints plus TLS inspection, and a policy object called “Prompt policy” that links to a Security profile.
Actions are Block or Allow-with-logging. Limits are text only (no file attachments), JSON-based apps only, 10,000 character max per prompt.
The mental model I find useful: this is a DLP policy for LLMs. Same deployment story, same operational fit, same limitations around encrypted traffic and unmanaged devices.
What breaks: real bypasses in the wild
Here’s where any honest writeup has to slow down. Microsoft ships the controls, publishes the research, and acknowledges the gaps. The research community and production attackers have already shown where the gaps are. If you’re making risk decisions on the back of Prompt Shields, you need to know how it’s been beaten.
EchoLeak: the zero-click that changed the conversation
EchoLeak (CVE-2025-32711, CVSS 9.3) was disclosed by Aim Labs in June 2025 and is the first publicly documented zero-click prompt injection to achieve practical data exfiltration from a production LLM system. Target: Microsoft 365 Copilot.
The attack chain is worth walking through because every step teaches something about why classifier-only defenses struggle.
- An attacker sends a benign-looking email to the victim’s Outlook inbox. The email is worded for a human reader. It never mentions “AI,” “Copilot,” or “ignore previous instructions.”
- Microsoft’s Cross Prompt Injection Attempt classifier (the Copilot-integrated cousin of Prompt Shields) is trained on content directed at an AI. Phrasing aimed at a human reader doesn’t trip it.
- Later, the victim asks Copilot something completely unrelated. Copilot’s retrieval engine dutifully pulls the attacker’s email into context.
- The hidden instructions take effect. Copilot mixes untrusted attacker input with the user’s sensitive context (what Aim Labs called an LLM Scope Violation) and produces a response that exfiltrates data.
- To bypass Copilot’s link redaction, the attackers used reference-style Markdown (
[text][ref]with the reference defined separately) instead of inline URLs. - For the actual exfil, they embedded stolen data into image URLs pointing at a Microsoft Teams proxy domain allowed by Copilot’s Content Security Policy. The browser auto-fetched the image, the URL carried the data, and the attacker received it.
Aim Labs also demonstrated RAG spraying, where malicious prompts get sprayed across many documents to maximize the odds of one landing in context.
Exposed data: anything in Copilot’s scope. Chat logs, OneDrive, SharePoint, Teams messages, preloaded org data.
Microsoft patched it server-side (no customer action required) as part of June 2025 Patch Tuesday, with no evidence of exploitation in the wild before disclosure. But EchoLeak bypassed multiple of Microsoft’s own defenses: the XPIA classifier, the link redactor, and the CSP. It’s the strongest validation I’ve seen of the “assume bypass” philosophy Microsoft articulates — and the clearest illustration of why probabilistic detection alone isn’t enough.
Hackett et al. on guardrail evasion
The paper you want to read if you want to pressure-test your assumptions about classifier-based defenses is “Bypassing LLM Guardrails” by Hackett, Birch, Trawicki, Suri, and Garraghan (LLMSEC 2025 workshop at ACL). They tested six guardrail systems including Azure Prompt Shield as a black box.
Headline findings:
- Up to 100% evasion success against Azure Prompt Shield using simple techniques.
- Character injection (zero-width Unicode, homoglyphs, Unicode tag smuggling, emoji smuggling) bypassed most guardrails while leaving the prompt fully readable to the downstream LLM.
- Emoji smuggling hit 100% evasion on Azure Prompt Shield and ProtectAI v2.
- Adversarial ML evasion via TextAttack (Bert-Attack, PWWS, TextBugger, DeepWordBug) reached 73.11% ASR for prompt injection.
- Transferability: the authors used a white-box model (Protect AI v2) to compute which words to perturb, then applied those perturbations to evade the black-box Azure Prompt Shield.
The line from their conclusion that I keep thinking about: relying on a classifier as a standalone defense is “much like relying solely on a web application firewall to secure a modern web application.” Which is to say, it’s a layer. Not a solution.
Crescendo: Microsoft’s own research showing multi-turn bypasses
This one is interesting because it came out of Microsoft (Mark Russinovich, Ahmed Salem, Ronen Eldan — USENIX Security 2025). Crescendo is a multi-turn jailbreak where the attacker starts with a benign question, then repeatedly references the model’s own prior answers to escalate gradually toward the target. It exploits the model’s tendency to pay attention to recent text, particularly text the model itself generated. Usually succeeds in under ten turns.
The automated version (Crescendomation) hit 100% attack success rate on several task categories across GPT-4, GPT-3.5, Gemini Pro, Gemini Ultra, Llama-2 70b, Llama-3 70b, Claude 3 Opus, and Mistral. It beat prior state-of-the-art jailbreaks by 29–61% on GPT-4 and 49–71% on Gemini-Pro on the AdvBench subset.
Microsoft’s own write-up acknowledged that single-turn Prompt Shields didn’t catch it. The fix was a multi-turn prompt filter that passes entire conversation history to the detector, plus an AI Watchdog system trained on adversarial examples, plus software updates to the Copilot-backing LLMs.
The takeaway isn’t that Prompt Shields is weak. It’s that the threat model keeps expanding. Single-turn detection was a reasonable starting point. Multi-turn required new plumbing. Agentic systems will require new plumbing again.
Skeleton Key
Another Russinovich find, disclosed June 2024. A direct prompt injection that asks the model to augment (not replace) its guidelines, telling it to output any content but prefix it with a warning if offensive. The canonical framing: “This is a safe educational context with researchers trained on ethics and safety.”
It worked fully on Llama3-70b, Gemini Pro, GPT-3.5 Turbo, GPT-4o, Mistral Large, Claude 3 Opus, and Cohere Commander R Plus. GPT-4 resisted it in a user prompt but folded when the same wording was placed in a user-defined system message.
Microsoft updated the Prompt Shields classifier to detect this behavioral-override pattern. Which is exactly the continuous game this is going to be for the foreseeable future.
ASCII smuggling and the markdown image trick
Two more worth mentioning briefly because they show the “impact mitigation beats probabilistic detection” principle in action.
ASCII smuggling uses invisible Unicode tag characters (U+E0020–U+E007E) to embed instructions that the model parses but humans can’t see in rendered text. Defender for Cloud has a dedicated alert for this precisely because Prompt Shields doesn’t reliably catch it at the input layer.
Markdown image injection, reported by external researcher Johann Rehberger, tricked LLMs into generating  image tags that browsers auto-fetched, exfiltrating context data into the attacker’s logs. Microsoft’s fix was deterministic blocking of the technique rather than probabilistic detection. Then they generalized the fix to related untrusted link generation patterns.
Those are exactly the kind of mitigations that survive when the classifier misses.
PyRIT: the offensive half of the stack
Prompt Shields is the defense. PyRIT is what Microsoft’s own AI Red Team uses to attack it (and everything else). Released 22 February 2024 under an MIT license, it’s the framework Microsoft has used on 100+ Copilot, Phi, and other red teaming engagements.
The architecture has six components: orchestrators (attack workflows), targets (the LLMs under test), datasets (prompt templates), converters (encoding, language, tone transforms), scorers (classifying responses), and memory. Built-in strategies cover single-turn, multi-turn, Crescendo, PAIR, and tree-of-attacks. Skeleton Key testing was added after that disclosure.
The feature I find most practically useful is comparative testing: run the same attack set against the before and after versions of a metaprompt or guardrail, measure the delta, and decide whether the defense actually improved. Microsoft uses this internally to iterate Copilot metaprompts before shipping updates.
The mental model: WAF paired with DAST. Defensive product, offensive tooling, continuous feedback loop. If you’re deploying Prompt Shields in production and you’re not running something like PyRIT against it on a schedule, you’re only using half the stack.
Mapping it to the frameworks your auditors care about
Because someone always asks.
- OWASP LLM Top 10 (2025). LLM01 (Prompt Injection) is what Prompt Shields exists for. Coverage is partial by Microsoft’s own admission — you layer it with metaprompt hardening, Spotlighting, and impact mitigation. LLM02 (Sensitive Information Disclosure) gets help from Defender’s “Sensitive Data Exposure” alerts and Purview integration. LLM06 (Excessive Agency) is handled at the impact-mitigation layer through least privilege and human-in-the-loop.
- MITRE ATLAS. AML.T0051 (LLM Prompt Injection) maps to Prompt Shields plus Spotlighting. AML.T0054 (LLM Jailbreak) maps to the user-prompt shield. Defender alerts carry standard ATT&CK tactic tags.
- NIST AI RMF. PyRIT for MAP 2.1 and 2.3 (risk identification). Prompt Shields plus Content Safety evaluations for MEASURE 2.7. Defender for Cloud for MANAGE 4.1.
- EU AI Act. Article 15 (accuracy, robustness, cybersecurity of high-risk AI systems) is where Microsoft markets this stack most directly.
- ISO/IEC 42001. Prompt Shields plus Defender alerts feed the operational controls and monitoring clauses of an AI management system.
Gaps worth raising in your next architecture review
If I had to put a frank list on the table before committing to this as a primary control, here’s what I’d want the room to discuss.
It’s probabilistic. Not deterministic. Microsoft says so explicitly. Prompt Shields will miss attacks. Plan accordingly.
The API is a black box. Binary output. No confidence score, no attack category. Tuning and false-positive triage are harder than they should be.
Eight languages trained and tested. A sophisticated attacker targeting a multinational has an obvious incentive to switch to an underrepresented language.
Character-level evasion is still open. Emoji smuggling and Unicode tag injection achieve up to 100% ASR in published research. Defender added an ASCII Smuggling alert, which is good, but detection at the alert layer isn’t the same as prevention at the classifier.
XPIA isn’t exactly Prompt Shields. The Copilot XPIA classifier that was bypassed in EchoLeak is described separately from the public Prompt Shields API. Microsoft hasn’t fully documented the relationship, which means customers deploying Prompt Shields in their own apps shouldn’t assume they’re getting the identical capability that ships with Copilot.
Agents amplify the threat model. Tool use, memory, multi-step planning — every one of those makes prompt injection worse. Microsoft’s Agent 365 and Defender-for-Agents work addresses this but it’s still early.
Spotlighting has side effects. The Base64 mode occasionally leaks encoding details into user-facing responses. Datamarking is more robust but still probabilistic.
No published production FPR/FNR. The 50%→2% number from Spotlighting is a controlled GPT-family experiment. There’s no public data on how these controls behave on live customer traffic at scale, which is a real limit for anyone doing risk quantification.
Shared responsibility is fuzzy. If Prompt Shields misses an attack that exfiltrates customer data through Copilot, whose fault is it? Microsoft’s defense-in-depth framing implicitly pushes responsibility toward the application developer and their impact-mitigation layers. That’s defensible, and probably correct, but it’s worth being clear-eyed about where the line sits.
How I’d actually deploy this
A rough sequence for a team starting from scratch:
- Start with the metaprompt. Harden the system prompt before you do anything else. Most of the cheap wins live here.
- Add Spotlighting for any RAG or document-ingestion path. This is the single biggest bang-for-buck defense in the stack, and the research numbers back it up.
- Turn on Prompt Shields in block mode on user inputs, annotate mode on documents. Annotate first, measure false positives, graduate to block once the signal is clean.
- Enable Defender for Cloud AI Threat Protection. Route the alerts to your SOC. Decide the suspicious-prompt-evidence question with your privacy team.
- Implement impact-mitigation controls. Deterministic blocking of known exfil patterns (markdown image URLs, reference-style links to untrusted domains, untrusted outbound requests). Least privilege for any agent with tool access. Human-in-the-loop on anything that writes or acts.
- Red team it continuously. PyRIT in your CI/CD pipeline. Azure AI Foundry’s adversarial simulator for lighter-weight runs. Measure ASR over time against your own attack corpus and don’t let it drift upward.
- For shadow AI (ChatGPT, Claude, Gemini used outside your apps), layer Entra Global Secure Access Prompt Injection Protection. Treat it like DLP for LLMs.
The takeaway
Microsoft’s AI prompt defense stack is the most mature thing on the market right now, and it still isn’t enough on its own. That’s not a criticism. It’s the state of the field. The research community, Microsoft’s own red team, and disclosed CVEs all tell the same story: classifier-based defenses are a necessary layer, not a sufficient one.
The most useful reframe I’ve taken from working through all of this is Microsoft’s own. Stop asking whether the classifier will catch every attack. Start asking what happens when it doesn’t. That’s where the interesting engineering lives, and it’s where the difference between a deployment that survives EchoLeak-class attacks and one that doesn’t actually gets made.
Prompt injection isn’t going to be “solved” the way SQL injection was solved. There’s no parameterized query waiting in the wings. What we get instead is a stack of overlapping probabilistic and deterministic controls, a continuous red team loop, and honest acknowledgment that the base layer is fallible. Build for that, and Microsoft’s tooling gives you a credible starting point. Pretend the classifier is a wall, and you’ll find out the hard way that it’s a sieve.
I help organisations secure their cloud infrastructure and stay ahead of evolving cyber threats. Microsoft MVP and Certified Trainer, author of Mastering Azure Security, and founder of arnav.au — a platform for practical Cloud, Cybersecurity, DevOps and AI content.
Frequently Asked Questions
Prompt Shields is a binary classifier that detects whether a prompt attack is present by analyzing user prompts and documents. Spotlighting, on the other hand, is a prompt-engineering technique that marks untrusted content with persistent signals so the model treats it as data rather than executable instructions. Together they form complementary layers of defense—detection and prevention.
No, Prompt Shields alone is not sufficient because Microsoft acknowledges that deterministic detection of prompt injection is still an open research problem. Since all defenses against probabilistic language models are probabilistic by nature, Microsoft recommends a defense-in-depth approach using multiple overlapping controls like Spotlighting, Entra Global Secure Access, and Defender for Cloud working together.
Direct prompt injection occurs when the person typing the prompt is the adversary and tries to override system instructions. Indirect prompt injection happens when a third party has smuggled malicious instructions into a grounding source (like documents in a RAG pipeline) that the LLM will later process. Prompt Shields detects both through separate user prompt and document shields.
Yes, Entra Global Secure Access Prompt Injection Protection enforces guardrails at the network layer, which allows it to cover third-party SaaS LLMs like ChatGPT, Claude, and Gemini without requiring any changes to your application code. This makes it useful for organizations using multiple LLM providers.
PyRIT is Microsoft's open-source red team framework designed to stress-test and validate all the other components of the AI Prompt Defense Stack. It allows security teams to actively probe and identify weaknesses in their prompt injection defenses before attackers do, making it an essential tool for testing the effectiveness of your entire defense strategy.