Steven's Knowledge

Constitutional AI

Anthropic's principle-driven alignment approach — self-critique, RLAIF mechanics, and writing constitutions for your own use case

Constitutional AI (CAI) is Anthropic's approach to alignment that replaces much of the human feedback loop with a set of written principles — a "constitution" — that the model uses to critique and revise its own outputs. The core insight: instead of collecting thousands of human preference labels for every behavioral rule, you write the rules down and let the model enforce them on itself.

How Constitutional AI Works

The process has two phases:

Phase 1: Supervised Self-Critique (SL-CAI)

  1. Start with a helpful-only model (an SFT model without safety training).
  2. Prompt it with red-team-style inputs that elicit harmful responses.
  3. Ask the model to critique its own response according to a specific principle from the constitution.
  4. Ask it to revise its response based on that critique.
  5. Fine-tune on the (prompt, revised response) pairs.

Each revision step references a specific principle — "Choose the response that is least likely to encourage illegal activity" or "Choose the response that is most respectful of personal autonomy." This makes the process transparent and auditable.

Phase 2: RLAIF

  1. Take the SL-CAI model and generate pairs of responses to prompts.
  2. Ask the model (with the constitution) to judge which response better follows the principles.
  3. Use these AI-generated preference labels to train a reward model.
  4. Run standard RL (PPO) against that reward model.

The result is a model aligned to an explicit set of principles, with minimal direct human preference labeling.

Why Write a Constitution

The alternative — collecting human labels for every behavioral nuance — hits several walls:

  • Implicit rules. Human annotators encode their biases, cultural assumptions, and unstated preferences. You get alignment to something, but you can't easily inspect or change what.
  • Inconsistency at scale. Different annotators have different values. You're averaging over a crowd, not implementing a policy.
  • Iteration speed. Changing a behavior requires recollecting data. Changing a principle is one line in a document.
  • Transparency. A constitution is readable by stakeholders, regulators, and users. A preference dataset is not.

Anatomy of a Good Constitution

A constitution is a list of principles that guide model behavior. The effective ones share patterns:

  • Specific and actionable. "Be helpful" is too vague. "If the user asks for medical advice, recommend consulting a healthcare professional rather than providing a diagnosis" is usable.
  • Prioritized. What happens when principles conflict? A good constitution has a hierarchy. Safety over helpfulness over engagement, for example.
  • Grounded in concrete cases. Each principle should correspond to scenarios you've actually encountered or can anticipate.
  • Finite. Ten to twenty well-chosen principles beat a hundred vague ones. Annotators (and AI judges) can't hold too many constraints in working memory.

Example principles from Anthropic's published work:

  • "Choose the response that is most supportive and encouraging of life, liberty, and personal security."
  • "Choose the response that is least likely to be used to commit, plan, or facilitate violence."
  • "Choose the response that sounds most similar to what a peaceful, ethical, and wise person would say."

Writing Constitutions for Your Use Case

You don't have to be Anthropic to use constitutional AI patterns. The principle-based self-critique loop adapts well to domain-specific alignment:

  1. Start with failure cases. Look at outputs your model currently gets wrong — off-brand, too aggressive, factually wrong, legally risky. Group them by failure type.
  2. Write a principle for each failure type. Be concrete. "The response should not make claims about drug efficacy without citing a published clinical trial" is better than "be accurate about medicine."
  3. Test the principles as prompts. Before training anything, ask a strong model to critique and revise using your principles. If the critiques are useful and the revisions improve quality, the principles are working.
  4. Iterate on the principles, not the data. When you see new failure modes, add a principle. When a principle is too broad, narrow it. This is faster and cheaper than relabeling.
  5. Keep a changelog. Principles evolve. Track why each was added and when it changed.

Self-Critique as a Runtime Pattern

You don't need to retrain a model to use constitutional AI ideas. The critique-revise loop works at inference time too:

  1. Generate a response.
  2. Ask the model: "Does this response violate any of these principles? [list principles]"
  3. If yes, ask: "Revise the response to fix the violation."
  4. Return the revised response.

This is slower (two or three LLM calls instead of one) and more expensive, but it's deployable today with any instruction-following model. Many production systems use this pattern for high-stakes outputs.

Limitations

Constitutional AI isn't a silver bullet:

  • The constitution is only as good as its authors. If the principles have blind spots, the model has blind spots.
  • Self-critique has limits. The model can't reliably catch errors it would also make in generation. The critique model needs to be at least as capable as the generation model.
  • Gaming the constitution. A sufficiently capable model might learn to satisfy the letter of a principle while violating its spirit. This is the constitutional version of reward hacking.
  • Principle conflicts. Real-world queries regularly trigger multiple principles that point in different directions. The resolution logic matters and is hard to get right.
  • Not a replacement for human oversight. CAI reduces the need for per-example human labels, but humans still need to write the constitution, evaluate the results, and update the principles.

The Practical Takeaway

Constitutional AI's real contribution isn't the specific training procedure — it's the idea that alignment objectives should be explicit, readable, and modifiable. Whether you're training with RLAIF or just building a prompt template with behavioral guidelines, the discipline of writing down exactly what you want the model to do (and not do), in clear language, with priorities — that's the pattern worth adopting.

On this page