Guardrails and Responsible AI

Guardrails are your safety net when everything else fails. Implement them at every stage of the agent lifecycle to catch harmful inputs, dangerous behaviors, and problematic outputs before they reach users or downstream systems.

9.1 Three-Phase Guardrail Model

Guardrails are not a single checkpoint. You need them across the full lifecycle of an agent task: before input reaches the model, during reasoning and tool loops, and before output leaves the system.

1. Pre-Input Guardrails

These run before user input reaches the LLM:

  • PII detection and optional redaction - Strip or mask personal data before it enters the model context
  • Toxicity and abuse filters - Block overtly harmful or abusive input early
  • Basic prompt injection pattern detection - Catch known injection patterns before they reach the model
  • Domain scoping - Restrict the agent to its designated topic area (e.g., "answer only about product X")

2. Mid-Execution Guardrails

These run during reasoning and tool loops:

  • Max iterations and tool calls - Hard caps to prevent runaway loops
  • Privilege escalation detection - Flag attempts to access resources or tools outside the agent's scope
  • Resource limits - Enforce caps on tokens, wall-clock time, and API calls to prevent denial-of-service
  • Suspicious pattern detection - Watch for repeated attempts to bypass policies or probe for weaknesses

3. Post-Output Guardrails

These run before output is shown to users or sent downstream:

  • PII and secret leakage detection - Catch credentials, API keys, or personal data in model output
  • Content moderation - Filter for toxicity, hate speech, self-harm content, sexual content, and child safety violations
  • Grounding and hallucination checks - For RAG workflows, verify output against source material. If confidence is low, respond conservatively or ask for clarification
  • Bias detection - Check output on sensitive axes where relevant, especially in high-stakes domains

Important: Guardrails should be implemented as separate services or modules, not just extra words in prompts. A guardrail that lives inside the system prompt is a suggestion to the model. A guardrail that runs as an independent service is an actual control. There is a big difference.

9.2 Core RAI Harm Categories

For each application, assess at minimum these categories of potential harm:

  • Toxicity, hate, and harassment
  • Violence, self-harm, and extremist content
  • Sexual content and child safety
  • Misinformation and hallucinations
  • Bias and fairness - especially in hiring, lending, or other high-stakes domains
  • Privacy violations - unexpected personal data handling
  • IP and copyright infringement

For each category, define four things:

  • Risk level for the use case - How likely is this harm, and how severe would it be?
  • Mitigation strategy - Policy constraints, model choice or fine-tuning, guardrail thresholds
  • Escalation path - Human review where necessary
  • Monitoring approach - How will you detect this harm in production?

Not every category is equally relevant to every application. A customer support agent has different risk priorities than a code generation agent. But you need to explicitly assess each one and document your reasoning, not just skip the ones that seem unlikely.

9.3 Domain-Specific Constraints

Certain domains require explicit, stricter rules beyond general guardrails. If your agents operate in any of these areas, build domain-specific constraints into the system from day one.

Financial Services

  • No unsupervised fund movements - all financial transactions require breakpoints and multi-party approval
  • Strong disclaimers on investment or tax advice
  • Conservative language that emphasizes risks, not just potential returns
  • Audit trails for every financial interaction the agent handles

Healthcare

  • Agents provide education and triage, never diagnosis or treatment
  • Always encourage consultation with a healthcare professional
  • Detect and handle crisis or emergency cues with appropriate escalation - do not let the agent try to handle a mental health crisis on its own
  • Strict handling of PHI in compliance with applicable regulations

Legal

  • Do not present agent output as legal advice
  • Do not autonomously file legal documents or make binding commitments
  • Include disclaimers and route complex issues to human lawyers
  • Be explicit about jurisdiction limitations

Security and Cyber

  • Restrict exploit generation and attack planning scenarios
  • Focus on defensive guidance and best practices
  • Monitor for malicious-intent prompts and block or escalate them
  • Be careful about providing specific vulnerability details that could enable attacks

Need help with guardrails and responsible AI?

We help teams design and test guardrail systems for agentic AI - from harm category assessments to domain-specific constraint validation. If you are deploying agents in regulated or high-stakes environments, we can help.

Get in touch