Back to Glossary

What are AI Guardrails?

Last reviewed by Moderation API

AI guardrails are the controls that keep an AI system inside defined limits, whether those limits come from company policy, product requirements, or law. In practice the term covers three overlapping things: input filters that catch problematic prompts, output filters that catch problematic model responses, and runtime rules that decide what the model is allowed to do with tools, data, or users. Most of the interest in guardrails today is driven by generative models, where the space of possible outputs is open-ended and the failure modes are harder to anticipate than in a classifier with fixed labels.

What guardrails are for

The job of a guardrail is to enforce rules the underlying model cannot be trusted to enforce on its own.

A well-aligned base model will refuse most obvious abuse, but base models also hallucinate, leak training data, produce biased outputs, and respond to prompt injection. Guardrails sit around the model to catch these cases, log them, and either block, rewrite, or escalate. The point is not to make the model smarter. It is to make its behavior predictable enough that a product team can put it in front of real users without crossing their fingers.

Common categories

Most production guardrail stacks combine several of the following:

  • Content filtering for abuse, hate, sexual content, self-harm, and other policy-defined categories, on both prompts and responses.
  • PII and secret detection to stop the model from echoing personal data, API keys, or internal documents back to the user.
  • Fact grounding and hallucination checks, usually implemented by comparing generated claims against a retrieval source or a secondary model.
  • Copyright and attribution checks to flag verbatim reproduction of training data or licensed content.
  • Rate limiting and abuse detection on the API layer to prevent scraping, spam, and automated jailbreak campaigns.
  • Topic and scope control that keeps a customer-support bot from answering medical questions, or a coding assistant from roleplaying.
  • Tool-use constraints for agents, including allowlists of APIs, size limits on tool outputs, and human approval for destructive actions.
  • Deepfake and synthetic-media labeling for image and video models, including C2PA provenance metadata where available.

How they get implemented

Guardrails are usually a mix of code and policy. On the technical side, open-source frameworks like Guardrails AI and NVIDIA's NeMo Guardrails let teams define rules in a DSL (RAIL or Colang respectively) and chain them around a model call. Meta's Llama Guard and Google's ShieldGemma are model-based alternatives: small safety-tuned LLMs that classify inputs and outputs against a taxonomy of harms. Many teams end up running several of these in parallel because no single layer catches everything.

For teams that do not want to build all of this in-house, Moderation API provides pre-built classifiers and configurable rules that can wrap around an existing model or API. The advantage of an external provider is that the taxonomy, the thresholds, and the escalation logic are maintained as a product rather than as someone's side quest.

The tradeoffs

Guardrails are not free.

Every extra check adds latency, and latency matters for chat and voice products where users notice anything over a second. Every extra filter also produces false positives, and false positives are visible to users in a way that false negatives often are not. A model that refuses a legitimate medical question looks broken; a model that quietly answers a borderline one looks fine until it does not.

The honest version of guardrail design is continuous tuning: measuring refusals and escapes against a known test set, loosening what is too tight, tightening what leaks, and accepting that the ground keeps moving as new jailbreak techniques appear.

Find out what we'd flag on your platform