Push a user’s e-mail or SSN into a prompt and you’ve just minted a privacy time-bomb. This deep dive shows how our Micro-GCC squads stop that from happening with a three-layer defence:
➊ open-source PII redaction before vector storage,
➋ AES-GCM–encrypted embeddings, and
➌ placeholder “rehydration” at generation time.
We benchmarked spaCy + Presidio, Amazon Comprehend, and GPT-4o for precision/recall, then wired the winner into a RAG stack. Copy-paste Docker Compose and Lambda snippets included.
LLMs store prompts in logs for retraining or debugging. If a user’s phone number lands there, every copy of the model—or your service provider’s logs—now holds regulated data. Regulators don’t care that “it’s just context.” They care that:
Objective: ensure raw PII never crosses the “prompt boundary”; only anonymised tokens do.
Think of it like a data diode: user data only flows one way—into your secure DB, never into the LLM.
| Engine | Approach | Precision | Recall | Latency (ms/chunk) | Cost per 1 M words |
| spaCy + Presidio | NER + regex | 0.91 | 0.93 | 28 | $0 (self-host) |
| Amazon Comprehend | API NER | 0.94 | 0.90 | 140 | $1.00 |
| GPT-4o | LLM zero-shot | 0.89 | 0.92 | 500 | $3.40 |
Dataset: HIPAA-mini + Enron-PII blend, 50 K sentences.
Pick:spaCy + Presidio—best recall under latency & cost budget.
yaml
CopyEdit
services:
redact:
image: mcr.microsoft.com/presidio/analyzer
ports: [“3005:3000”]
anonym:
image: mcr.microsoft.com/presidio/anonymizer
ports: [“3006:3000”]
We use the Iceberg token scheme:
ruby
CopyEdit
<EMAIL::sha256(email@example.com>::1>
<PHONE::sha256(+15551234567)::1>
Why deterministic?
– Same user string → same token → better vector clustering.
– Hash acts as salt-ed pseudo ID; attacker can’t reverse without raw.
– Suffix ::1 denotes version for future re-hash migrations.
Implementation (Python):
python
CopyEdit
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def redact(text: str):
results = analyzer.analyze(text, language=”en”)
redacted = anonymizer.anonymize(text, results,
operators={“EMAIL_ADDRESS”: {“type”: “replace”,
“new_value”: “<EMAIL::” + hash_val + “::1>”}})
return redacted
Metadata JSON stored alongside vector:
json
CopyEdit
{
“token”: “<EMAIL::ab12cd::1>”,
“enc_payload”: “Base64:AESGCM(plaintext=email@example.com)”,
“iv”: “random_iv”,
“aad”: “tenant123”
}
Terraform sample for pgvector + KMS IAM role in repo.
Latency hit: 7 ms average for decrypt + replace.
| Stage | Baseline p95 | With Redaction | Delta |
| Embed + Search | 220 ms | 235 ms | +15 ms |
| LLM Generate | 420 ms | 420 ms | 0 |
| Rehydrate | N/A | 7 ms | +7 ms |
| Total | 640 ms | 662 ms | +22 ms (3.4 %) |
Extra $ cost: 0 (self-host), minor KMS calls ($0.05 per 1 M decrypts).
Problem: EU users submit mood journals with e-mails, phone numbers. Prototype leaked PII in OpenAI logs.
Fix: SpaCy/Presidio redaction + encrypted tokens.
Outcome:
| Pitfall | Fix |
| False positives (John Major → PERSON) | Add domain dictionaries; raise confidence threshold for names. |
| Hash collision risk | Use SHA-256 plus tenant salt in KMS. |
| Placeholder leakage in UI | Run Vue/React sanitiser to display decrypted value only for authorised user. |
| Cost spike in KMS decrypt | Batch decrypt per response; cache for 5 min. |
| NER drift with medical jargon | Fine-tune spaCy model with 1 K annotated domain sentences (takes < 2 hrs). |
| Sprint | Milestone |
| 1 | Dockerised Presidio, redaction in ingest Lambda |
| 2 | Implement placeholder scheme & encrypted metadata |
| 3 | Wrap LLM calls with rehydrate step |
| 4 | Add k6 latency budget + ComplyLogs PII scanner in CI |
| 5 | Run DPIA update & update privacy policy copy |