Data Privacy

PII in training and inference, GDPR challenges, data residency, anonymization, and federated learning

AI systems are data-hungry by nature, and that data is often about people. The collision between "train on everything" and "protect personal data" is one of the hardest practical problems in AI engineering. Get it wrong and you face fines, lawsuits, and lost trust. Get it right and privacy becomes a competitive advantage.

PII in Training Data

Large models absorb their training data. If that data contains personally identifiable information (PII), the model may memorize and regurgitate it — names, emails, phone numbers, medical details. This is not theoretical; it's been demonstrated repeatedly.

What to do:

Audit training data for PII before training. Use NER-based PII detection tools, regex patterns, and manual spot checks.
Scrub or replace PII with synthetic equivalents. Full removal is better; replacement preserves data structure.
Differential privacy during training adds noise that bounds how much any single data point can influence the model. It works but degrades model quality — tune the privacy budget carefully.
Canary testing — insert known PII canaries into training data and check if the trained model can reproduce them. A practical memorization detector.

PII in Prompts

Even if your model is clean, users will paste PII into prompts. Customer support agents paste full customer records. Developers paste stack traces with usernames. This PII then flows through your API providers, logging infrastructure, and analytics.

Mitigations:

Client-side PII detection and redaction — strip PII before it reaches your API. Replace with placeholders, restore in the response.
Prompt anonymization proxies — middleware that detects and masks PII in transit, de-masks on the way back.
Contractual protections — your DPA with the API provider should explicitly prohibit training on user inputs.
Minimize logging of raw prompts — or encrypt them with per-user keys.

GDPR Article 17 gives individuals the right to have their personal data deleted. For traditional databases, this is straightforward. For trained models, it's an open problem.

The core tension: you can't "delete" a person's data from a trained model's weights. Current approaches:

Don't train on personal data — the cleanest solution. Use only consented, anonymized, or synthetic data.
Machine unlearning — retrain or fine-tune the model to "forget" specific data points. Active research area; no production-ready solution works reliably at scale yet.
Data deletion from the training pipeline — delete the source data and document that the model was trained before deletion was requested. Accept the risk that memorized data persists.
Guardrails at inference — even if the model "knows" the data, prevent it from outputting it via content filters.

The pragmatic position: prevent PII from entering training data in the first place. For data already trained on, maintain a deletion log, apply inference guardrails, and retrain periodically on cleaned data.

Data Residency

Where does your data live? Many jurisdictions require that data about their citizens stays within their borders:

EU — GDPR restricts transfers outside the EU/EEA unless adequacy decisions or appropriate safeguards apply.
China — data localization requirements for critical information infrastructure operators.
India — evolving requirements under DPDPA for sensitive personal data.
Sector-specific — healthcare data (HIPAA), financial data, government data often have explicit residency requirements.

For AI systems, data residency applies to:

Training data — where it's stored and processed during training.
Inference inputs — where user prompts are sent and processed.
Logs and audit trails — where inference records are stored.
Model weights — some regulations may restrict where models trained on local data can be deployed.

If you use cloud AI APIs, know which region your requests route to. Most providers offer region-specific endpoints — use them.

Anonymization Techniques

True anonymization (as GDPR defines it) means the data can never be re-identified, even with auxiliary datasets. This is a high bar.

Practical techniques:

k-anonymity — ensure every record is indistinguishable from at least k-1 others on quasi-identifiers.
Differential privacy — add calibrated noise so individual contributions are mathematically bounded.
Synthetic data generation — train a model on real data, then generate synthetic records that preserve statistical properties without containing real individuals.
Aggregation — use only aggregate statistics rather than individual records.
Pseudonymization — replace identifiers with tokens. Useful but legally distinct from anonymization under GDPR — pseudonymized data is still personal data.

For AI training: synthetic data generation is increasingly the best option. The quality of synthetic data has gotten good enough for many tasks.

Federated Learning as a Privacy Tool

Federated learning trains models across distributed data sources without centralizing the data. Each participant trains locally and shares only model updates (gradients), not raw data.

Where it helps:

Healthcare — hospitals can collaboratively train models without sharing patient records.
Mobile/edge — user data stays on device; only aggregated model updates go to the server.
Cross-organization — competitors can build shared models without exposing proprietary data.

Limitations engineers should know:

Gradient leakage — model updates can leak information about training data. Secure aggregation and differential privacy help but add complexity.
Heterogeneous data — data across participants is rarely IID, which degrades model quality.
Communication overhead — syncing model updates across many participants is expensive.
Not a silver bullet for compliance — federated learning reduces data exposure but doesn't automatically satisfy GDPR. You still need legal basis for processing.

DPAs with AI Providers

If you use third-party AI APIs, your Data Processing Agreement (DPA) is a critical compliance document. Key provisions to negotiate:

No training on inputs/outputs — the provider must not use your data to train or improve their models.
Data retention limits — how long does the provider retain your data? Push for zero retention or the minimum possible.
Sub-processors — who else handles your data? Where are they located?
Data residency guarantees — which regions process and store your data?
Incident notification — how quickly must the provider notify you of a breach?
Audit rights — can you (or your auditor) inspect the provider's practices?
Deletion on termination — what happens to your data when the contract ends?

Don't treat the DPA as a legal-only document. Engineers should read it to understand what technical guarantees actually exist versus what's just contractual language.