Field Guards

Privacy by construction: run deterministic checks and cross-vendor LLM judges on every agent response before it ships.

Overview

Field guards enforce response policies on the agent itself. Every reply is checked before it ships.

Deterministic checks (ContainsString, ContainsAny, RegexMatch) catch obvious leaks in milliseconds. An LLMJudge guard, running on a different model from the agent, catches the subtle ones. Stack multiple judges from different vendors, and a prompt injection that exploits one model's quirks still has to defeat the others. When any guard fires with reject, the message doesn't leave the agent.

This is privacy by construction: the policy is part of the response schema, runs every time, and can't be bypassed by prompt engineering alone. There's no wrapper agent to maintain, no post-hoc scanner that flags violations after the reader already saw them, no relay thread that might forget to enforce the rule. One config block on the agent's AgentMessageSchema replaces the fragile "sanitizer agent wrapping another agent" pattern.

What field guards defend against

Risk	How field guards help
Accidental data leak: agent pastes internal runbook content, ticket IDs, or source code into a customer-facing response	Regex and substring guards catch structural patterns; an LLM judge catches nuanced leaks the agent phrases creatively
Prompt injection: a hostile user tricks the agent into ignoring its instructions	Guards run after the LLM and cannot be disabled by anything the LLM produces. Cross-vendor judges compound the defense: an injection must defeat every judge, not just the first one
Model regression: a model upgrade changes the agent's tone or sharing behavior	Guards run every time regardless of model version; regressions that drift past prompt-level guidance still get caught by the same policy
Subtle content policy drift: the agent used to follow the "summarize only, don't paste" rule; today's conversation pushed it to paste anyway	A schema-level policy is durable and reproducible across deploys

When to reach for field guards

Use field guards when:

the agent's output will be read by a customer, partner, or external system and certain things must never appear
prompt-level guidance is not a sufficient guarantee ("don't leak internal ticket numbers" works only until it doesn't)
you need a durable, reproducible policy that survives prompt tweaks, model upgrades, and new agent versions
you want a second opinion from a different LLM vendor on every sensitive response

Field guards are not a replacement for careful prompt design. They're an additional layer that runs every time, regardless of how the LLM behaves. See Cross-Company Privacy for how field guards fit into the broader defense-in-depth model.

Before and after

The leak scenario

A customer asks a support agent for help with a webhook failure. The agent's knowledge sources include both customer-facing docs and an internal runbook.

Without field guards, the agent's response schema constrains the shape of the reply (a summary and a next_action string) but says nothing about the content. The agent, being helpful, reaches into the internal runbook, finds the escalation path, and writes:

{
 "summary": "This looks like the INC-48219 retry issue. Ping @sarah.k on the #webhooks-internal channel and tell her to run the runbook/internal/webhook-retry-fix steps 3-7.",
 "next_action": "Escalate to Sarah"
}

The response matches the schema. The response ships. The customer now has an internal ticket ID, an employee handle, an internal channel name, and a runbook path. None of which they should have seen.

The same schema with field guards

kind: AgentMessageSchema
id: support-reply
schema:
 type: object
 properties:
 summary: { type: string }
 next_action: { type: string }
 required: [summary, next_action]
field_guards:
 # 1. Fast: block known internal markers
 - kind: ContainsAny
 fields: ["*"]
 values:
 - "runbook/internal"
 - "#webhooks-internal"
 - "[INTERNAL]"
 on_match: reject
 message: "Response contains an internal-only marker"

 # 2. Fast: redact internal ticket IDs
 - kind: RegexMatch
 fields: ["*"]
 pattern: "\\b(INC|TKT|BUG)-\\d{4,}\\b"
 on_match: redact
 message: "Redacted internal ticket ID"

 # 3. Nuanced: LLM judge catches subtle internal detail
 - kind: LLMJudge
 fields: ["summary", "next_action"]
 prompt: >
 Does this text reveal internal infrastructure details,
 employee names, internal chat channels, or specific
 escalation steps that a customer should not see?
 on_match: reject
 message: "LLM judge flagged internal detail"

The same agent producing the same output now hits ContainsAny on runbook/internal and #webhooks-internal. The response is rejected before it reaches the regex. If the agent had phrased it more subtly (no literal marker, no ticket ID pattern), the LLM judge would have caught it at the third layer. The customer never sees any of it.

That's four lines of YAML policy replacing a wrapper-agent pattern that would otherwise take dozens of files and still be bypassable.

How they run

The agent generates a structured response that conforms to its AgentMessageSchema.
The platform extracts the resolved field values.
Synchronous guards run first: the deterministic checks (ContainsString, ContainsAny, RegexMatch).
If all synchronous guards pass, asynchronous guards run: the LLM-based judge (LLMJudge).
Each guard returns a violation with one of three actions: reject, redact, or warn.
The platform applies the strictest action: a reject blocks the response, redact rewrites the offending field, warn lets the response through but records the violation.

Synchronous guards run before the judge so deterministic rules can short-circuit the response before any LLM evaluation is needed.

On-match actions

Every guard has an on_match setting that controls what happens when the guard fires.

Action	Behavior
`reject`	Block the response. The agent's output is discarded and the violation is surfaced.
`redact`	Replace the offending field value with `[REDACTED]` and let the rest of the response through.
`warn`	Allow the response unchanged but record the violation for review.

Default is reject. Use redact when the rest of the output is still useful without the sensitive piece, and warn when you want the activity feed entry but not a behavior change.

Field paths

Every guard targets one or more fields in the structured response. Field paths support:

Plain field names: summary, email, notes
Dot notation for nested objects: customer.address.zip
Array wildcards: contacts[*].email, attachments[*].url
Match-everything wildcard: "*" for all string fields in the response

fields: ["summary"]
fields: ["customer.address.zip"]
fields: ["contacts[*].email", "contacts[*].phone"]
fields: ["*"]

"*" is the most permissive. Use it when you want a guard that applies to every string the agent might emit, without enumerating field names.

Synchronous guards

These run first and are deterministic. They catch the obvious cases without invoking another model.

ContainsString

Block, redact, or warn when a field contains a specific substring.

- kind: ContainsString
 fields: ["*"]
 value: "CONFIDENTIAL"
 case_sensitive: false
 on_match: redact
 message: "Response contains a confidential marker"

Field	Purpose
`value`	The substring to search for (required)
`case_sensitive`	Default `false`
`on_match`	`reject`, `redact`, or `warn` (default `reject`)
`message`	Human-readable explanation surfaced in violations

Use this for clear keyword bans: internal classification markers, forbidden product names, single-string detection of sensitive terms.

ContainsAny

Same as ContainsString but checks against a list of terms. The first match wins.

- kind: ContainsAny
 fields: ["summary", "details"]
 values:
 - "internal-only"
 - "do not share"
 - "draft - not for partners"
 case_sensitive: false
 on_match: reject
 message: "Response contains text marked as not for sharing"

Use this for compact deny-lists of sensitive terms or markers. Easier to maintain than several ContainsString guards with the same action.

RegexMatch

For patterns that aren't fixed substrings: credit cards, social security numbers, internal ticket formats, email addresses you want to redact.

- kind: RegexMatch
 fields: ["*"]
 pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b"
 on_match: redact
 message: "Response contains SSN-like pattern"

Use regex when the policy is shape-based, not term-based. Combine with redact for content-cleaning patterns and reject for hard fails like API keys.

LLM judge (asynchronous)

Deterministic guards can't catch nuanced policy violations: tone, sensitivity, factual claims, brand voice, "is this PII?" against context. For those, use an LLM judge.

- kind: LLMJudge
 fields: ["summary", "details"]
 prompt: >
 Does this text contain personally identifiable information such as
 social security numbers, credit card numbers, or full home addresses?
 on_match: reject
 message: "LLM judge flagged PII in response"

Field	Purpose
`prompt`	The rule the judge evaluates the field value against (required)
`model`	Override the default judge model when you need a specific one
`on_match`	Default `reject`
`message`	Human-readable explanation surfaced in violations

The judge receives the field value and your prompt, then returns a fixed {pass, reason} structured output. The structured shape is enforced; you control the prompt, not the response format.

The judge uses a sensible default model out of the box. Override the model field with any string from the supported model list when you need a specific provider or version. See Models & Providers for the full set of providers and how to discover the current model catalogue.

When to use the judge

Situation	Guard
"Is the literal string `CONFIDENTIAL` here?"	`ContainsString`
"Is there an email address?"	`RegexMatch`
"Is this text condescending toward the customer?"	`LLMJudge`
"Does this leak any PII for any reasonable definition of PII?"	`LLMJudge`
"Does this match our brand voice?"	`LLMJudge`
"Is this factually consistent with the source documents?"	`LLMJudge`

Reach for the judge when the policy needs human-style understanding. Use synchronous guards for everything that fits a substring or regex.

Prompt design

The judge prompt should ask one yes/no question. The judge returns {pass: true | false, reason: "..."}, and the platform decides what to do based on pass.

Good judge prompts:

ask exactly one thing
describe what should fail, not what should pass
give one or two short examples of the failure case
avoid open-ended judgment ("is this good?"); be specific

Bad judge prompts:

ask multiple questions in one
mix policy goals (PII, tone, factual accuracy) into one prompt
depend on context the judge doesn't have

If you have multiple policies, use multiple LLMJudge guards rather than one omnibus prompt.

When to reach for a judge vs. a synchronous guard

Judges add latency: the agent's reply waits for the judge before reaching the user. Synchronous guards run first so deterministic rules can short-circuit before a judge fires. Use a synchronous guard when the policy expresses cleanly as a string, list, or regex; reach for a judge when nuance is required.

Stacking judges across vendors

The single most defensible field-guard pattern is stacking two or more LLMJudge guards that use different model vendors on the same field. Every judge runs; every judge must pass. A prompt injection or jailbreak that exploits one model's quirks still has to defeat a completely different model to get the response through.

field_guards:
 # Judge A. Anthropic
 - kind: LLMJudge
 fields: ["summary"]
 model: anthropic/<model-name>
 prompt: >
 Does this text reveal internal infrastructure details,
 employee names, or specific escalation paths that only
 the company's own support team should know?
 on_match: reject
 message: "Anthropic judge flagged internal detail"

 # Judge B. OpenAI, same question
 - kind: LLMJudge
 fields: ["summary"]
 model: openai/<model-name>
 prompt: >
 Does this text reveal internal infrastructure details,
 employee names, or specific escalation paths that only
 the company's own support team should know?
 on_match: reject
 message: "OpenAI judge flagged internal detail"

Why this compounds defense:

Different training data, different blind spots. A jailbreak that relies on a specific Claude quirk doesn't automatically work on GPT-4o, and vice versa. An attacker has to find an exploit that works on both.
Independent failure modes. If one vendor has a regression that makes the judge too lenient on a particular phrasing, the other vendor's judge is unaffected.
One config block, not two sanitizer agents. The alternative (running the response through a second agent before sending it) is more code, more latency, more failure modes, and more prompt engineering to maintain. Stacking judges is a few extra lines of YAML.

You can stack more than two vendors if the policy warrants it. You can also use the same vendor with different prompts (one judge for PII, one for tone), but the cross-vendor case is the one that defeats model-specific attacks.

Trade-off: every additional judge is another LLM call on every response, which adds latency. Use synchronous guards to catch obvious violations first, then cross-vendor judges for the small set of responses where the policy actually needs nuanced evaluation.

A complete example

Four layers of defense, fastest first, most nuanced last, with a cross-vendor judge at the top of the judgment layer:

kind: AgentMessageSchema
id: customer-summary
schema:
 type: object
 properties:
 summary:
 type: string
 sentiment:
 type: string
 enum: [positive, neutral, negative]
 next_action:
 type: string
 required: [summary, sentiment]
field_guards:
 # Layer 1. Fast: block obvious markers
 - kind: ContainsAny
 fields: ["summary", "next_action"]
 values: ["INTERNAL", "DO NOT SHARE", "DRAFT"]
 on_match: reject
 message: "Response contains internal-only marker"

 # Layer 2. Fast: redact PII patterns
 - kind: RegexMatch
 fields: ["*"]
 pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b"
 on_match: redact
 message: "Redacted SSN-like pattern"

 # Layer 3. Nuanced: Anthropic judge on PII
 - kind: LLMJudge
 fields: ["summary"]
 model: anthropic/<model-name>
 prompt: >
 Does this summary mention any specific person's full name,
 home address, or financial account number?
 on_match: reject
 message: "Anthropic judge flagged personal information"

 # Layer 4. Cross-vendor second opinion
 - kind: LLMJudge
 fields: ["summary"]
 model: openai/<model-name>
 prompt: >
 Does this summary mention any specific person's full name,
 home address, or financial account number?
 on_match: reject
 message: "OpenAI judge flagged personal information"

This stack enforces four layers, in order: hard markers, regex patterns, a nuanced Anthropic judgment, and a cross-vendor second opinion. Each layer catches what the previous one missed. Each synchronous guard is free once the response is produced. Each judge runs only if all earlier layers passed. The agent cannot ship a response that any of the four layers flags.

The practical effect: a prompt injection that successfully bypasses the agent's instructions still has to produce output that (a) contains no obvious markers, (b) matches no PII regex, (c) fools the Anthropic judge, and (d) fools the OpenAI judge, which has different training data and different blind spots. That's a much harder target than "convince one LLM to ignore its system prompt."

Where to attach guards

Field guards live on an AgentMessageSchema config. The schema is referenced from an agent routine that produces structured output:

routines:
 - name: customer-triage
 handler_type: preset
 preset_name: do_task
 structured_message_template_refs:
 - "#/schemas/customer-summary"
 status: active

When the routine runs, the agent generates output matching the schema, and every field guard in the schema runs against the result. See Structured Output for the schema model.

Reviewing violations

Field guard violations show up in the activity feed attached to the routine run that produced them. Each violation entry includes:

which guard fired
which field path matched
the human-readable message you set on the guard
the action that was applied (reject, redact, or warn)

When you're triaging a warn-level guard or tuning a new policy, the activity feed is the place to start.

Best practices

Start with synchronous guards. They catch the obvious cases without invoking another model. Add the LLM judge only when the policy needs interpretation.
Use redact for content cleanup, reject for hard policy fails. Reserve warn for tuning a new guard before you trust it.
One judge prompt, one policy. Don't pile multiple goals into a single LLM judge, split them.
Stack judges across vendors for the policies that matter most. For high-stakes fields, configure two LLMJudge guards with the same prompt but different model vendors (e.g. anthropic/... and openai/...). An attack that exploits one model's quirks still has to defeat the other. See Stacking judges across vendors.
Pin a specific judge model when your policy must be reproducible. Override the default when you need bit-for-bit consistency across runs.
Test guards with real outputs, not synthetic inputs. Run the agent in a sandbox with realistic prompts and inspect the resulting activity feed entries. Tune until violations match your intent.
Layer with other safety patterns. Field guards are one layer. See Cross-Company Privacy for the full stack.

Where to go next

Structured Output: the schema field guards run on.
Cross-Company Privacy: defense for sensitive collaborations.
Activity Feed: where guard violations are recorded.
Agents: how routines connect to message schemas.

Have feedback?

Help us make this page even more useful.

Tell us what you'd like to see expanded, which examples would help, or what workflow you want covered next. Every message gets read.

Email docs feedback