Audit Trails

Logging for compliance, decision trails, explainability, and reproducibility in AI systems

When a regulator asks "why did your system make this decision?", you need an answer. Not a vague one — a specific, traceable, reproducible answer. Audit trails are the infrastructure that makes this possible, and bolting them on after the fact is painful. Build them in from the start.

Why Audit Trails Matter for AI

Traditional software audit logs track who did what. AI audit trails are harder because the system itself makes decisions. You need to capture:

What the system decided — the output, recommendation, or action taken.
What inputs drove the decision — the data the model saw at inference time.
Which model version produced it — not just "v2" but the exact checkpoint, config, and prompt template.
What the confidence was — scores, probabilities, or uncertainty estimates.
Whether a human was in the loop — and what they did (approved, overrode, ignored).

Input/Output Logging at Scale

The simplest and most valuable audit measure: log every input and output of your AI system.

What to log

For each inference call:

Timestamp (UTC, millisecond precision)
Request ID (unique, for tracing through the system)
User/session identifier (pseudonymized if needed for privacy)
Full input — the prompt, context, retrieved documents, system instructions
Full output — the model's raw response before any post-processing
Post-processed output — what the user actually saw
Model identifier — model name, version, checkpoint hash
Latency and token counts
Any filters triggered — safety filters, content moderation, guardrails

Practical concerns

Storage cost — full I/O logging for a busy system generates terabytes. Use tiered storage: hot (30 days), warm (1 year), cold (retention period). Compress aggressively.
PII in logs — inputs often contain user PII. You need a strategy: log-and-redact, log-and-encrypt, or log with access controls. Don't skip logging because of PII — that's worse for compliance.
Sampling — for low-risk systems, you may log a statistical sample rather than every call. For high-risk systems, log everything.

Decision Audit Trails

For systems that make consequential decisions (credit, hiring, medical), you need more than I/O logs. You need a decision trace that explains the path from input to output:

Feature attribution — which input features most influenced the decision? SHAP values, attention weights, or integrated gradients.
Retrieval provenance — if the system used RAG, which documents were retrieved and what were their relevance scores?
Chain-of-thought — if the model reasoned step by step, capture the reasoning trace.
Rule/policy application — which business rules, thresholds, or policies were applied post-model?
Alternative outcomes — what would the decision have been with slightly different inputs? (Store this for high-stakes decisions, not every call.)

Explainability Requirements

Different stakeholders need different explanations:

End users — "Why was my application denied?" Plain language, actionable. Required by ECOA (credit), GDPR Article 22 (automated decisions).
Internal reviewers — "Is this model behaving as intended?" Technical metrics, distribution shifts, edge case analysis.
Regulators — "Demonstrate that this system is fair and accurate." Statistical evidence, methodology documentation, test results.
Auditors — "Show me the complete chain from data to decision." Full provenance, reproducible from logs.

Build your logging to serve all four audiences. The underlying data is the same; the presentation layer differs.

Reproducibility for Regulatory Review

A regulator may ask you to reproduce a specific decision months or years later. This requires:

Model versioning — every deployed model version is archived and can be re-loaded.
Data snapshots — the retrieval corpus, feature store, and reference data as they existed at decision time.
Config versioning — prompt templates, system instructions, hyperparameters, and post-processing rules.
Deterministic inference — set random seeds where possible. Document where non-determinism exists (e.g., model sampling temperature).
Environment pinning — the inference stack (library versions, hardware) should be reproducible.

Full reproducibility is expensive. A pragmatic approach: ensure exact reproducibility for Tier 3 (high-risk) systems, and "substantially similar" reproducibility for others.

Implementation Patterns

Structured logging

Use structured formats (JSON lines, Protobuf) rather than free-text logs. Schema-enforce your audit events so nothing critical is missing:

Define a canonical AuditEvent schema.
Validate every log entry against it at write time.
Version the schema and migrate old logs when it changes.

Immutable storage

Audit logs must be tamper-evident. Options:

Append-only storage with write-once policies (S3 Object Lock, WORM storage).
Hash chains — each log entry includes a hash of the previous entry, making tampering detectable.
Third-party attestation — for highest assurance, use a third-party service to timestamp and sign log batches.

Retention policies

Align retention with regulatory requirements:

EU AI Act expects logs to be retained for "an appropriate period" (at least the duration of the system's use plus a buffer).
Financial regulations often require 5-7 years.
Healthcare may require the lifetime of the patient record.

Document your retention policy, enforce it automatically, and make sure deletion is also logged.

On this page