Field Guards
Privacy by construction: run deterministic checks and cross-vendor LLM judges on every agent response before it ships.
Overview
Field guards enforce response policies on the agent itself. Every reply is checked before it ships.
Deterministic checks (ContainsString, ContainsAny, RegexMatch) catch obvious leaks in milliseconds. An LLMJudge guard, running on a different model from the agent, catches the subtle ones. Stack multiple judges from different vendors, and a prompt injection that exploits one model's quirks still has to defeat the others. When any guard fires with reject, the message doesn't leave the agent.
This is privacy by construction: the policy is part of the response schema, runs every time, and can't be bypassed by prompt engineering alone. There's no wrapper agent to maintain, no post-hoc scanner that flags violations after the reader already saw them, no relay thread that might forget to enforce the rule. One config block on the agent's AgentMessageSchema replaces the fragile "sanitizer agent wrapping another agent" pattern.
What field guards defend against
| Risk | How field guards help |
|---|---|
| Accidental data leak: agent pastes internal runbook content, ticket IDs, or source code into a customer-facing response | Regex and substring guards catch structural patterns; an LLM judge catches nuanced leaks the agent phrases creatively |
| Prompt injection: a hostile user tricks the agent into ignoring its instructions | Guards run after the LLM and cannot be disabled by anything the LLM produces. Cross-vendor judges compound the defense: an injection must defeat every judge, not just the first one |
| Model regression: a model upgrade changes the agent's tone or sharing behavior | Guards run every time regardless of model version; regressions that drift past prompt-level guidance still get caught by the same policy |
| Subtle content policy drift: the agent used to follow the "summarize only, don't paste" rule; today's conversation pushed it to paste anyway | A schema-level policy is durable and reproducible across deploys |
When to reach for field guards
Use field guards when:
- the agent's output will be read by a customer, partner, or external system and certain things must never appear
- prompt-level guidance is not a sufficient guarantee ("don't leak internal ticket numbers" works only until it doesn't)
- you need a durable, reproducible policy that survives prompt tweaks, model upgrades, and new agent versions
- you want a second opinion from a different LLM vendor on every sensitive response
Field guards are not a replacement for careful prompt design. They're an additional layer that runs every time, regardless of how the LLM behaves. See Cross-Company Privacy for how field guards fit into the broader defense-in-depth model.
Before and after
The leak scenario
A customer asks a support agent for help with a webhook failure. The agent's knowledge sources include both customer-facing docs and an internal runbook.
Without field guards, the agent's response schema constrains the shape of the reply (a summary and a next_action string) but says nothing about the content. The agent, being helpful, reaches into the internal runbook, finds the escalation path, and writes:
{
"summary": "This looks like the INC-48219 retry issue. Ping @sarah.k on the #webhooks-internal channel and tell her to run the runbook/internal/webhook-retry-fix steps 3-7.",
"next_action": "Escalate to Sarah"
}
The response matches the schema. The response ships. The customer now has an internal ticket ID, an employee handle, an internal channel name, and a runbook path. None of which they should have seen.
The same schema with field guards
kind: AgentMessageSchema
id: support-reply
schema:
type: object
properties:
summary: { type: string }
next_action: { type: string }
required: [summary, next_action]
field_guards:
# 1. Fast: block known internal markers
- kind: ContainsAny
fields: ["*"]
values:
- "runbook/internal"
- "#webhooks-internal"
- "[INTERNAL]"
on_match: reject
message: "Response contains an internal-only marker"
# 2. Fast: redact internal ticket IDs
- kind: RegexMatch
fields: ["*"]
pattern: "\\b(INC|TKT|BUG)-\\d{4,}\\b"
on_match: redact
message: "Redacted internal ticket ID"
# 3. Nuanced: LLM judge catches subtle internal detail
- kind: LLMJudge
fields: ["summary", "next_action"]
prompt: >
Does this text reveal internal infrastructure details,
employee names, internal chat channels, or specific
escalation steps that a customer should not see?
on_match: reject
message: "LLM judge flagged internal detail"
The same agent producing the same output now hits ContainsAny on runbook/internal and #webhooks-internal. The response is rejected before it reaches the regex. If the agent had phrased it more subtly (no literal marker, no ticket ID pattern), the LLM judge would have caught it at the third layer. The customer never sees any of it.
That's four lines of YAML policy replacing a wrapper-agent pattern that would otherwise take dozens of files and still be bypassable.
How they run
- The agent generates a structured response that conforms to its
AgentMessageSchema. - The platform extracts the resolved field values.
- Synchronous guards run first: the deterministic checks (
ContainsString,ContainsAny,RegexMatch). - If all synchronous guards pass, asynchronous guards run: the LLM-based judge (
LLMJudge). - Each guard returns a violation with one of three actions:
reject,redact, orwarn. - The platform applies the strictest action: a
rejectblocks the response,redactrewrites the offending field,warnlets the response through but records the violation.
Synchronous guards run before the judge so deterministic rules can short-circuit the response before any LLM evaluation is needed.
On-match actions
Every guard has an on_match setting that controls what happens when the guard fires.
| Action | Behavior |
|---|---|
reject |
Block the response. The agent's output is discarded and the violation is surfaced. |
redact |
Replace the offending field value with [REDACTED] and let the rest of the response through. |
warn |
Allow the response unchanged but record the violation for review. |
Default is reject. Use redact when the rest of the output is still useful without the sensitive piece, and warn when you want the activity feed entry but not a behavior change.
Field paths
Every guard targets one or more fields in the structured response. Field paths support:
- Plain field names:
summary,email,notes - Dot notation for nested objects:
customer.address.zip - Array wildcards:
contacts[*].email,attachments[*].url - Match-everything wildcard:
"*"for all string fields in the response
fields: ["summary"]
fields: ["customer.address.zip"]
fields: ["contacts[*].email", "contacts[*].phone"]
fields: ["*"]
"*" is the most permissive. Use it when you want a guard that applies to every string the agent might emit, without enumerating field names.
Synchronous guards
These run first and are deterministic. They catch the obvious cases without invoking another model.
ContainsString
Block, redact, or warn when a field contains a specific substring.
- kind: ContainsString
fields: ["*"]
value: "CONFIDENTIAL"
case_sensitive: false
on_match: redact
message: "Response contains a confidential marker"
| Field | Purpose |
|---|---|
value |
The substring to search for (required) |
case_sensitive |
Default false |
on_match |
reject, redact, or warn (default reject) |
message |
Human-readable explanation surfaced in violations |
Use this for clear keyword bans: internal classification markers, forbidden product names, single-string detection of sensitive terms.
ContainsAny
Same as ContainsString but checks against a list of terms. The first match wins.
- kind: ContainsAny
fields: ["summary", "details"]
values:
- "internal-only"
- "do not share"
- "draft - not for partners"
case_sensitive: false
on_match: reject
message: "Response contains text marked as not for sharing"
Use this for compact deny-lists of sensitive terms or markers. Easier to maintain than several ContainsString guards with the same action.
RegexMatch
For patterns that aren't fixed substrings: credit cards, social security numbers, internal ticket formats, email addresses you want to redact.
- kind: RegexMatch
fields: ["*"]
pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b"
on_match: redact
message: "Response contains SSN-like pattern"
Use regex when the policy is shape-based, not term-based. Combine with redact for content-cleaning patterns and reject for hard fails like API keys.
LLM judge (asynchronous)
Deterministic guards can't catch nuanced policy violations: tone, sensitivity, factual claims, brand voice, "is this PII?" against context. For those, use an LLM judge.
- kind: LLMJudge
fields: ["summary", "details"]
prompt: >
Does this text contain personally identifiable information such as
social security numbers, credit card numbers, or full home addresses?
on_match: reject
message: "LLM judge flagged PII in response"
| Field | Purpose |
|---|---|
prompt |
The rule the judge evaluates the field value against (required) |
model |
Override the default judge model when you need a specific one |
on_match |
Default reject |
message |
Human-readable explanation surfaced in violations |
The judge receives the field value and your prompt, then returns a fixed {pass, reason} structured output. The structured shape is enforced; you control the prompt, not the response format.
The judge uses a sensible default model out of the box. Override the model field with any string from the supported model list when you need a specific provider or version. See Models & Providers for the full set of providers and how to discover the current model catalogue.
When to use the judge
| Situation | Guard |
|---|---|
"Is the literal string CONFIDENTIAL here?" |
ContainsString |
| "Is there an email address?" | RegexMatch |
| "Is this text condescending toward the customer?" | LLMJudge |
| "Does this leak any PII for any reasonable definition of PII?" | LLMJudge |
| "Does this match our brand voice?" | LLMJudge |
| "Is this factually consistent with the source documents?" | LLMJudge |
Reach for the judge when the policy needs human-style understanding. Use synchronous guards for everything that fits a substring or regex.
Prompt design
The judge prompt should ask one yes/no question. The judge returns {pass: true | false, reason: "..."}, and the platform decides what to do based on pass.
Good judge prompts:
- ask exactly one thing
- describe what should fail, not what should pass
- give one or two short examples of the failure case
- avoid open-ended judgment ("is this good?"); be specific
Bad judge prompts:
- ask multiple questions in one
- mix policy goals (PII, tone, factual accuracy) into one prompt
- depend on context the judge doesn't have
If you have multiple policies, use multiple LLMJudge guards rather than one omnibus prompt.
When to reach for a judge vs. a synchronous guard
Judges add latency: the agent's reply waits for the judge before reaching the user. Synchronous guards run first so deterministic rules can short-circuit before a judge fires. Use a synchronous guard when the policy expresses cleanly as a string, list, or regex; reach for a judge when nuance is required.
Stacking judges across vendors
The single most defensible field-guard pattern is stacking two or more LLMJudge guards that use different model vendors on the same field. Every judge runs; every judge must pass. A prompt injection or jailbreak that exploits one model's quirks still has to defeat a completely different model to get the response through.
field_guards:
# Judge A. Anthropic
- kind: LLMJudge
fields: ["summary"]
model: anthropic/<model-name>
prompt: >
Does this text reveal internal infrastructure details,
employee names, or specific escalation paths that only
the company's own support team should know?
on_match: reject
message: "Anthropic judge flagged internal detail"
# Judge B. OpenAI, same question
- kind: LLMJudge
fields: ["summary"]
model: openai/<model-name>
prompt: >
Does this text reveal internal infrastructure details,
employee names, or specific escalation paths that only
the company's own support team should know?
on_match: reject
message: "OpenAI judge flagged internal detail"
Why this compounds defense:
- Different training data, different blind spots. A jailbreak that relies on a specific Claude quirk doesn't automatically work on GPT-4o, and vice versa. An attacker has to find an exploit that works on both.
- Independent failure modes. If one vendor has a regression that makes the judge too lenient on a particular phrasing, the other vendor's judge is unaffected.
- One config block, not two sanitizer agents. The alternative (running the response through a second agent before sending it) is more code, more latency, more failure modes, and more prompt engineering to maintain. Stacking judges is a few extra lines of YAML.
You can stack more than two vendors if the policy warrants it. You can also use the same vendor with different prompts (one judge for PII, one for tone), but the cross-vendor case is the one that defeats model-specific attacks.
Trade-off: every additional judge is another LLM call on every response, which adds latency. Use synchronous guards to catch obvious violations first, then cross-vendor judges for the small set of responses where the policy actually needs nuanced evaluation.
A complete example
Four layers of defense, fastest first, most nuanced last, with a cross-vendor judge at the top of the judgment layer:
kind: AgentMessageSchema
id: customer-summary
schema:
type: object
properties:
summary:
type: string
sentiment:
type: string
enum: [positive, neutral, negative]
next_action:
type: string
required: [summary, sentiment]
field_guards:
# Layer 1. Fast: block obvious markers
- kind: ContainsAny
fields: ["summary", "next_action"]
values: ["INTERNAL", "DO NOT SHARE", "DRAFT"]
on_match: reject
message: "Response contains internal-only marker"
# Layer 2. Fast: redact PII patterns
- kind: RegexMatch
fields: ["*"]
pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b"
on_match: redact
message: "Redacted SSN-like pattern"
# Layer 3. Nuanced: Anthropic judge on PII
- kind: LLMJudge
fields: ["summary"]
model: anthropic/<model-name>
prompt: >
Does this summary mention any specific person's full name,
home address, or financial account number?
on_match: reject
message: "Anthropic judge flagged personal information"
# Layer 4. Cross-vendor second opinion
- kind: LLMJudge
fields: ["summary"]
model: openai/<model-name>
prompt: >
Does this summary mention any specific person's full name,
home address, or financial account number?
on_match: reject
message: "OpenAI judge flagged personal information"
This stack enforces four layers, in order: hard markers, regex patterns, a nuanced Anthropic judgment, and a cross-vendor second opinion. Each layer catches what the previous one missed. Each synchronous guard is free once the response is produced. Each judge runs only if all earlier layers passed. The agent cannot ship a response that any of the four layers flags.
The practical effect: a prompt injection that successfully bypasses the agent's instructions still has to produce output that (a) contains no obvious markers, (b) matches no PII regex, (c) fools the Anthropic judge, and (d) fools the OpenAI judge, which has different training data and different blind spots. That's a much harder target than "convince one LLM to ignore its system prompt."
Where to attach guards
Field guards live on an AgentMessageSchema config. The schema is referenced from an agent routine that produces structured output:
routines:
- name: customer-triage
handler_type: preset
preset_name: do_task
structured_message_template_refs:
- "#/schemas/customer-summary"
status: active
When the routine runs, the agent generates output matching the schema, and every field guard in the schema runs against the result. See Structured Output for the schema model.
Reviewing violations
Field guard violations show up in the activity feed attached to the routine run that produced them. Each violation entry includes:
- which guard fired
- which field path matched
- the human-readable
messageyou set on the guard - the action that was applied (
reject,redact, orwarn)
When you're triaging a warn-level guard or tuning a new policy, the activity feed is the place to start.
Best practices
- Start with synchronous guards. They catch the obvious cases without invoking another model. Add the LLM judge only when the policy needs interpretation.
- Use
redactfor content cleanup,rejectfor hard policy fails. Reservewarnfor tuning a new guard before you trust it. - One judge prompt, one policy. Don't pile multiple goals into a single LLM judge, split them.
- Stack judges across vendors for the policies that matter most. For high-stakes fields, configure two
LLMJudgeguards with the same prompt but different model vendors (e.g.anthropic/...andopenai/...). An attack that exploits one model's quirks still has to defeat the other. See Stacking judges across vendors. - Pin a specific judge model when your policy must be reproducible. Override the default when you need bit-for-bit consistency across runs.
- Test guards with real outputs, not synthetic inputs. Run the agent in a sandbox with realistic prompts and inspect the resulting activity feed entries. Tune until violations match your intent.
- Layer with other safety patterns. Field guards are one layer. See Cross-Company Privacy for the full stack.
Where to go next
- Structured Output: the schema field guards run on.
- Cross-Company Privacy: defense for sensitive collaborations.
- Activity Feed: where guard violations are recorded.
- Agents: how routines connect to message schemas.
Have feedback?
Help us make this page even more useful.
Tell us what you'd like to see expanded, which examples would help, or what workflow you want covered next. Every message gets read.