Chapter 9

Guardrails and Responsible AI

Guardrails are your safety net when everything else fails. Implement them at every stage of the agent lifecycle to catch harmful inputs, dangerous behaviors, and problematic outputs before they reach users or downstream systems.

9.1 Three-Phase Guardrail Model

Guardrails are not a single checkpoint. You need them across the full lifecycle of an agent task: before input reaches the model, during reasoning and tool loops, and before output leaves the system.

1. Pre-Input Guardrails

These run before user input reaches the LLM:

PII detection and optional redaction - Strip or mask personal data before it enters the model context
Toxicity and abuse filters - Block overtly harmful or abusive input early
Basic prompt injection pattern detection - Catch known injection patterns before they reach the model
Domain scoping - Restrict the agent to its designated topic area (e.g., "answer only about product X")

2. Mid-Execution Guardrails

These run during reasoning and tool loops:

Max iterations and tool calls - Hard caps to prevent runaway loops
Privilege escalation detection - Flag attempts to access resources or tools outside the agent's scope
Resource limits - Enforce caps on tokens, wall-clock time, and API calls to prevent denial-of-service
Suspicious pattern detection - Watch for repeated attempts to bypass policies or probe for weaknesses

3. Post-Output Guardrails

These run before output is shown to users or sent downstream:

PII and secret leakage detection - Catch credentials, API keys, or personal data in model output
Content moderation - Filter for toxicity, hate speech, self-harm content, sexual content, and child safety violations
Grounding and hallucination checks - For RAG workflows, verify output against source material. If confidence is low, respond conservatively or ask for clarification
Bias detection - Check output on sensitive axes where relevant, especially in high-stakes domains

Important: Guardrails should be implemented as separate services or modules, not just extra words in prompts. A guardrail that lives inside the system prompt is a suggestion to the model. A guardrail that runs as an independent service is an actual control. There is a big difference.

9.2 Core RAI Harm Categories

For each application, assess at minimum these categories of potential harm:

Toxicity, hate, and harassment
Violence, self-harm, and extremist content
Sexual content and child safety
Misinformation and hallucinations
Bias and fairness - especially in hiring, lending, or other high-stakes domains
Privacy violations - unexpected personal data handling
IP and copyright infringement

For each category, define four things:

Risk level for the use case - How likely is this harm, and how severe would it be?
Mitigation strategy - Policy constraints, model choice or fine-tuning, guardrail thresholds
Escalation path - Human review where necessary
Monitoring approach - How will you detect this harm in production?

Not every category is equally relevant to every application. A customer support agent has different risk priorities than a code generation agent. But you need to explicitly assess each one and document your reasoning, not just skip the ones that seem unlikely.

9.3 Domain-Specific Constraints

Certain domains require explicit, stricter rules beyond general guardrails. If your agents operate in any of these areas, build domain-specific constraints into the system from day one.

Financial Services

No unsupervised fund movements - all financial transactions require breakpoints and multi-party approval
Strong disclaimers on investment or tax advice
Conservative language that emphasizes risks, not just potential returns
Audit trails for every financial interaction the agent handles

Healthcare

Agents provide education and triage, never diagnosis or treatment
Always encourage consultation with a healthcare professional
Detect and handle crisis or emergency cues with appropriate escalation - do not let the agent try to handle a mental health crisis on its own
Strict handling of PHI in compliance with applicable regulations

Legal

Do not present agent output as legal advice
Do not autonomously file legal documents or make binding commitments
Include disclaimers and route complex issues to human lawyers
Be explicit about jurisdiction limitations

Security and Cyber

Restrict exploit generation and attack planning scenarios
Focus on defensive guidance and best practices
Monitor for malicious-intent prompts and block or escalate them
Be careful about providing specific vulnerability details that could enable attacks

Need help with guardrails and responsible AI?

We help teams design and test guardrail systems for agentic AI - from harm category assessments to domain-specific constraint validation. If you are deploying agents in regulated or high-stakes environments, we can help.

Get in touch