Model Cards & Documentation

Model cards, datasheets, system cards, and what to document for transparency and accountability

Documentation is the least glamorous part of AI engineering and one of the most consequential. When something goes wrong — and it will — the question is not "did you test it?" but "can you show me the documentation?" Model cards, datasheets, and system cards are the standard formats. Knowing what to put in them separates professional AI engineering from hacking.

Model Cards (Mitchell et al., 2019)

A model card is a short document that accompanies a trained model. The original paper by Mitchell et al. proposed it as a standardized way to communicate what a model does, how it was evaluated, and where it falls short.

A model card should include:

Model details — architecture, version, training date, developer, license.
Intended use — what the model is designed for and explicitly what it's not.
Factors — demographic groups, environmental conditions, or instrumentation that affect performance.
Metrics — which evaluation metrics were used and why.
Evaluation data — what datasets were used for evaluation and how they were chosen.
Training data — high-level description of the training data (detailed description goes in the datasheet).
Quantitative analyses — disaggregated performance across relevant subgroups. This is the heart of the card.
Ethical considerations — known risks, potential for misuse, sensitive use cases.
Caveats and recommendations — known limitations and guidance for downstream users.

What makes a model card useful vs performative

A bad model card reads like marketing copy with a few tables. A good model card:

Disaggregates performance by subgroup. Not just "92% accuracy" but accuracy broken down by gender, age, language, geography — whatever factors matter for the use case.
Documents failures explicitly. Where does the model break? Under what conditions?
States intended and out-of-scope uses clearly. "This model is designed for X. It should not be used for Y."
Updates over time as the model evolves. A model card is a living document.

Datasheets for Datasets (Gebru et al., 2021)

A datasheet accompanies a dataset, inspired by datasheets in electronics. It answers: where did this data come from, how was it collected, what's in it, and what are its limitations?

Key sections:

Motivation — why was this dataset created? Who funded it?
Composition — what's in the dataset? How many instances? What do they represent? Any PII?
Collection process — how was the data gathered? Who collected it? Over what timeframe?
Preprocessing — what cleaning, filtering, or labeling was done?
Uses — what tasks is the dataset appropriate for? What should it not be used for?
Distribution — how is the dataset shared? Under what license?
Maintenance — who maintains the dataset? How can errors be reported?

For AI engineers, datasheets matter because garbage in, garbage out applies at the institutional level. If your training data has undocumented biases, your model inherits them. The datasheet is where you force yourself to confront that.

System Cards for AI Products

A system card documents the entire AI product, not just the model. OpenAI popularized this with their GPT-4 system card. It covers the full pipeline: model, fine-tuning, safety mitigations, deployment guardrails, and human oversight.

A system card typically includes:

System overview — what the product does, who it's for, how it's deployed.
Model details — the underlying model(s), with references to their model cards.
Safety evaluations — red teaming results, bias evaluations, adversarial testing.
Mitigations — what safety measures are in place (content filters, rate limits, human review).
Deployment constraints — usage policies, geographic restrictions, prohibited uses.
Monitoring — how the system is monitored in production and how issues are escalated.
Known limitations — what the system gets wrong, edge cases, failure modes.

System cards are the right abstraction for customer-facing AI products because users interact with the system, not the raw model.

What to Document (Minimum Viable Documentation)

If you can't do everything, prioritize these for every model or AI system you put into production:

What it does and doesn't do — intended use, out-of-scope use. Two paragraphs.
Training data summary — source, size, date range, known gaps. One page.
Evaluation results, disaggregated — performance across relevant subgroups. One table with commentary.
Known failure modes — specific inputs or conditions where the system fails. Bullet list.
Who to contact — the team that owns the model, how to report issues.

This takes half a day for a model you know well. Skip it and you're storing up pain for when someone asks — and they will.

Transparency Reporting

Beyond per-model documentation, organizations should publish periodic transparency reports covering:

What AI systems are deployed and what they do.
Aggregate performance and safety metrics across systems.
Incidents — what went wrong, what was done about it, what changed.
User feedback and complaints — volume, categories, resolution rates.
Governance actions — how many use cases were reviewed, approved, rejected, or modified.

The EU AI Act requires some of this for high-risk systems. Even without legal obligation, transparency reports build trust with users, customers, and regulators.

Bias Documentation

Bias documentation deserves special attention. It's not enough to say "we tested for bias." Document:

Which biases you tested for — demographic parity, equalized odds, calibration, others? Why those?
Which groups you evaluated — and why those groups? Who was excluded and why?
What you found — specific disparities, with numbers. Don't hide unfavorable results.
What you did about it — mitigation steps taken, their effectiveness, remaining gaps.
What you couldn't test — data limitations, groups too small for statistical power, intersectional analyses not performed.

Honest bias documentation protects you more than sanitized documentation. Regulators and auditors can tell the difference. A company that says "we found these disparities and here's what we did" is in a far better position than one that says "we found no bias" when they clearly didn't look hard enough.

Keeping Documentation Alive

Documentation rots faster than code. Practices that help:

Automate what you can — generate model cards from evaluation pipelines, not manually.
Tie documentation to the release process — no release without updated docs.
Version documentation alongside the model — same repo, same commit, same review.
Assign an owner — documentation without an owner is documentation that decays.
Review quarterly — even if the model hasn't changed, the world around it has.

On this page