December 20, 2025 / admin

Push a user’s e-mail or SSN into a prompt and you’ve just minted a privacy time-bomb. This deep dive shows how our Micro-GCC squads stop that from happening with a three-layer defence:

➊ open-source PII redaction before vector storage,
➋ AES-GCM–encrypted embeddings, and
➌ placeholder “rehydration” at generation time.

We benchmarked spaCy + Presidio, Amazon Comprehend, and GPT-4o for precision/recall, then wired the winner into a RAG stack. Copy-paste Docker Compose and Lambda snippets included.

Why Prompts Leak PII 

LLMs store prompts in logs for retraining or debugging. If a user’s phone number lands there, every copy of the model—or your service provider’s logs—now holds regulated data. Regulators don’t care that “it’s just context.” They care that:

  • GDPR Art. 4(1) defines any data identifying a person.
  • CCPA/DPDP fines apply to sharing without consent, including logs.
  • Deleting a prompt from vector DB is easy; deleting it from vendor pretraining buckets is near-impossible.

Objective: ensure raw PII never crosses the “prompt boundary”; only anonymised tokens do.

Three-Layer Defence Overview 

  1. Ingest Redaction – Replace PII with deterministic placeholders before embedding.
  2. Encrypted Embeddings – Store vectors in pgvector/Pinecone with AES-GCM-sealed metadata.
  3. Generation Rehydration – When retrieval returns a placeholder, re-insert the real value after the LLM call.

Think of it like a data diode: user data only flows one way—into your secure DB, never into the LLM.

PII Redaction Engines Benchmarked 

EngineApproachPrecisionRecallLatency (ms/chunk)Cost per 1 M words
spaCy + PresidioNER + regex0.910.9328$0 (self-host)
Amazon ComprehendAPI NER0.940.90140$1.00
GPT-4oLLM zero-shot0.890.92500$3.40

Dataset: HIPAA-mini + Enron-PII blend, 50 K sentences.
Pick:spaCy + Presidio—best recall under latency & cost budget.

Docker Compose Snippet

yaml

CopyEdit

services:

  redact:

    image: mcr.microsoft.com/presidio/analyzer

    ports: [“3005:3000”]

  anonym:

    image: mcr.microsoft.com/presidio/anonymizer

    ports: [“3006:3000”]

Deterministic Placeholder Strategy 

We use the Iceberg token scheme:

ruby

CopyEdit

<EMAIL::sha256(email@example.com>::1>

<PHONE::sha256(+15551234567)::1>

Why deterministic?
– Same user string → same token → better vector clustering.
– Hash acts as salt-ed pseudo ID; attacker can’t reverse without raw.
– Suffix ::1 denotes version for future re-hash migrations.

Implementation (Python):

python

CopyEdit

from presidio_analyzer import AnalyzerEngine

from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()

anonymizer = AnonymizerEngine()

def redact(text: str):

    results = analyzer.analyze(text, language=”en”)

    redacted = anonymizer.anonymize(text, results,

       operators={“EMAIL_ADDRESS”: {“type”: “replace”,

                                    “new_value”: “<EMAIL::” + hash_val + “::1>”}})

    return redacted

Encrypted Embedding Storage 

Metadata JSON stored alongside vector:

json

CopyEdit

{

  “token”: “<EMAIL::ab12cd::1>”,

  “enc_payload”: “Base64:AESGCM(plaintext=email@example.com)”,

  “iv”: “random_iv”,

  “aad”: “tenant123”

}

  • AES-GCM keys live in KMS; client retrieves plaintext only after user consent.
  • Vector store sees only tokens and ciphertext—useless to attackers.

Terraform sample for pgvector + KMS IAM role in repo.

Retrieval & Rehydration Flow 

  1. User query → embedding model → vector search (tokens only).
  2. Retrieve top-k docs with placeholders.
  3. LLM prompt gets redacted docs → no PII leaks.
  4. Post-LLM: regex replace <EMAIL::hash::1> with decrypted email from KMS.

Latency hit: 7 ms average for decrypt + replace.

k6 Latency & Cost Benchmark 

StageBaseline p95With RedactionDelta
Embed + Search220 ms235 ms+15 ms
LLM Generate420 ms420 ms0
RehydrateN/A7 ms+7 ms
Total640 ms662 ms+22 ms (3.4 %)

Extra $ cost: 0 (self-host), minor KMS calls ($0.05 per 1 M decrypts).

Case Study — Emotion-Analysis Wellness App 

Problem: EU users submit mood journals with e-mails, phone numbers. Prototype leaked PII in OpenAI logs.
Fix: SpaCy/Presidio redaction + encrypted tokens.
Outcome:

  • OpenAI logs 0 PII (verified by ComplyLogs scanner).
  • GDPR DPIA score improved from “Medium” → “Low.”
  • Latency added 19 ms p95; DAU retention +5 % (users trust privacy banner).

Pitfalls & Pro Tips 

PitfallFix
False positives (John Major → PERSON)Add domain dictionaries; raise confidence threshold for names.
Hash collision riskUse SHA-256 plus tenant salt in KMS.
Placeholder leakage in UIRun Vue/React sanitiser to display decrypted value only for authorised user.
Cost spike in KMS decryptBatch decrypt per response; cache for 5 min.
NER drift with medical jargonFine-tune spaCy model with 1 K annotated domain sentences (takes < 2 hrs).

Adoption Roadmap 

SprintMilestone
1Dockerised Presidio, redaction in ingest Lambda
2Implement placeholder scheme & encrypted metadata
3Wrap LLM calls with rehydrate step
4Add k6 latency budget + ComplyLogs PII scanner in CI
5Run DPIA update & update privacy policy copy

Take-Home Checklist 

  1. Self-host Presidio; benchmark precision/recall.
  2. Replace PII with deterministic hashed tokens.
  3. Encrypt raw values in metadata with KMS.
  4. Rehydrate after LLM step.
  5. Monitor PII scan on logs weekly.