John

Senior Cloud Engineer & Technical Lead

Protecting Sensitive Data When Using AI Tools: Microsoft Presidio and Alternatives

I wanted to ask ChatGPT to help me review my monthly budget. Nothing complicated, just “here’s what I made, here’s what I spent, where can I cut back?” But the moment I started typing out the prompt, I realized I was about to paste in my bank name, account transactions, and enough detail for anyone to piece together my financial life. I closed the tab.

That interaction stuck with me. AI tools are genuinely useful for personal finance, tax prep questions, insurance comparisons, and even just making sense of a confusing medical bill. But all of those use cases involve exactly the kind of data you don’t want sitting on someone else’s servers: names, account numbers, SSNs, addresses, dollar amounts tied to real people.

So I started looking into ways to scrub that data before it ever leaves my machine. Strip out the sensitive stuff, let the LLM do its thing on sanitized input, then map the real values back into the response on my end. Turns out there’s a well-established pattern for this, and Microsoft Presidio is the open-source tool that keeps coming up. After setting it up, I get why.

The Proxy Pattern: Keep Your Data Local

The idea is simple. Instead of sending raw text straight to an LLM, you run it through a detection and anonymization layer first. The AI only ever sees sanitized data. Once you get the response back, you optionally swap the real values back in.

flowchart LR subgraph Input["Your Prompt"] A["Raw Text with PII
(names, accounts, SSN)"] end subgraph Presidio["Presidio Layer (Local)"] B[Analyzer - Detect PII] C[Anonymizer - Mask/Replace] B --> C end subgraph LLM["AI Provider"] D[LLM Processing] end subgraph Output["Your Machine"] E[Deanonymizer - Restore] F[Clean Response] E --> F end A --> B C --> D D --> E

The important part: Presidio runs locally. Your real data never leaves your machine. The LLM only sees placeholder values or realistic fakes.

Microsoft Presidio: The Open-Source Foundation

Presidio is an open-source framework from Microsoft built for detecting and anonymizing PII. It has two main components:

  • presidio-analyzer: Detects PII using NLP, regex patterns, Named Entity Recognition, and context-aware detection
  • presidio-anonymizer: Transforms detected PII through various anonymization strategies

It supports over 50 entity types out of the box: credit card numbers, social security numbers, names, phone numbers, email addresses, bank account numbers, and more. It uses spaCy’s NLP models for entity recognition combined with pattern matching for structured data.

Basic Example: Scrubbing a Finance Prompt

Here’s what it looks like in practice. Say you want to ask an LLM to help you analyze spending:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# Initialize the engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Your prompt before sending to an LLM
text = """
Here's my situation. I'm John Smith, and I bank with Chase.
My checking account ends in 4521 and I have a credit card
ending in 8834. Last month I earned $8,200 after taxes.
My SSN is 123-45-6789. Can you help me build a monthly
budget and figure out where I'm overspending?
"""

# Detect PII entities
results = analyzer.analyze(
    text=text,
    entities=[
        "PERSON", "US_SSN", "CREDIT_CARD",
        "PHONE_NUMBER", "EMAIL_ADDRESS", "US_BANK_NUMBER"
    ],
    language="en"
)

# Print what was found
print("Detected PII:")
for result in results:
    print(f"  {result.entity_type}: {text[result.start:result.end]} "
          f"(confidence: {result.score:.2f})")

# Anonymize before sending anywhere
anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results
)

print(f"\nSafe to send:\n{anonymized.text}")

Output:

Detected PII:
  PERSON: John Smith (confidence: 0.85)
  US_SSN: 123-45-6789 (confidence: 0.85)

Safe to send:
Here's my situation. I'm <PERSON>, and I bank with Chase.
My checking account ends in 4521 and I have a credit card
ending in 8834. Last month I earned $8,200 after taxes.
My SSN is <US_SSN>. Can you help me build a monthly
budget and figure out where I'm overspending?

The LLM can still give you solid budgeting advice. It doesn’t need your real name or SSN to do that.

Anonymization Strategies

Presidio gives you options for how to handle detected PII:

from presidio_anonymizer.entities import OperatorConfig

operators = {
    # Replace with a generic label
    "PERSON": OperatorConfig("replace", {"new_value": "[NAME]"}),

    # Mask with asterisks
    "EMAIL_ADDRESS": OperatorConfig("mask", {
        "type": "mask",
        "masking_char": "*",
        "chars_to_mask": 12,
        "from_end": False
    }),

    # Hash the value (one-way, irreversible)
    "US_SSN": OperatorConfig("hash", {"hash_type": "sha256"}),

    # Remove entirely
    "PHONE_NUMBER": OperatorConfig("redact", {}),

    # Encrypt (reversible with key)
    "CREDIT_CARD": OperatorConfig("encrypt", {"key": encryption_key}),
}

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators=operators
)

For personal use, replace and redact are the simplest options. If you want to be able to restore values in the response, encrypt keeps that door open.

LangChain Integration: Reversible Anonymization

The real power comes when you integrate Presidio with LangChain for reversible anonymization. This uses the Faker library to replace PII with realistic fake values. Instead of <PERSON>, the LLM sees “Michael Brown” - a fake name that produces more natural responses. After you get the response, the original values get swapped back in.

from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Initialize the reversible anonymizer
anonymizer = PresidioReversibleAnonymizer(
    add_default_faker_operators=True,
    faker_seed=42
)

# What you actually want to ask
original_text = """
I need help reviewing my finances.
Name: Sarah Johnson
Monthly income: $7,500 after taxes
Rent: $2,100/month
Car payment: $485/month to Capital One Auto (account ending 7823)
Credit card balance: $3,200 on a Chase Sapphire (card ending 9941)
Email for statements: sarah.j@gmail.com
Phone: (555) 987-6543

Am I spending too much on housing relative to income?
What should my emergency fund target be?
"""

# Anonymize - Faker swaps in realistic fake values
anonymized_text = anonymizer.anonymize(original_text)
print(f"What the LLM sees:\n{anonymized_text}")

# Send sanitized version to LLM
llm = ChatOpenAI(model="gpt-4")
prompt = ChatPromptTemplate.from_template(
    "Provide personal finance advice based on this summary:\n\n{details}"
)

chain = prompt | llm
response = chain.invoke({"details": anonymized_text})

# Restore original values in the response
final_response = anonymizer.deanonymize(response.content)
print(f"\nAdvice with your real details restored:\n{final_response}")

The Faker-based approach matters. When the LLM sees a real-looking name instead of a <PERSON> token, the response reads naturally. “Sarah, based on your income…” instead of “PERSON, based on your income…”

flowchart TB subgraph Input["Your Finance Question"] A["Sarah Johnson
sarah.j@gmail.com
Chase Sapphire ending 9941"] end subgraph Anon["Anonymization (Local)"] B["Presidio Analyzer
Detects: PERSON, EMAIL, etc."] C["Faker Generator
Creates realistic fakes"] D["Mapping Store
Sarah → Michael
sarah.j@... → fake@..."] end subgraph LLM["AI Provider"] E["Sees: Michael Brown
michael.b@example.com"] F["Generates finance advice
using fake identities"] end subgraph Restore["Deanonymization (Local)"] G["Lookup mapping"] H["Replace fakes with originals"] I["Personalized advice with
your real details restored"] end A --> B --> C --> D D --> E --> F F --> G --> H --> I

Installation

pip install langchain langchain-experimental langchain-openai \
    presidio-analyzer presidio-anonymizer spacy Faker

# Download spaCy model for NLP
python -m spacy download en_core_web_lg

Alternatives to Presidio

Presidio is solid, but it’s not the only option. Depending on what you need, these might be better fits.

Private AI

Private AI is a commercial solution with 99%+ claimed accuracy across 50+ entity types and 47+ languages. Their PrivateGPT product strips PII from prompts before they hit OpenAI, then “rehydrates” responses with the original data. Same concept as the LangChain reversible anonymizer, but packaged as a managed service.

Best for: People who want this to just work without writing Python.

Protecto

Protecto addresses something that trips up simpler tools: the “response alteration” problem. When an LLM rephrases your input in its response, a naive find-and-replace to restore values can fail. Protecto handles cases where the AI restructures the text.

Best for: Use cases where the LLM heavily modifies the input text in its response.

John Snow Labs (Healthcare)

If you’re trying to scrub medical records or clinical notes before asking an LLM about your health situation, John Snow Labs is purpose-built for that. They report 98.6% F1-score on clinical note de-identification, compared to general-purpose tools that score around 0.41 F1 on the same medical datasets.

That gap is massive. General PII tools miss things like MRN numbers, physician names in clinical context, and facility-specific identifiers. If your use case involves medical data, general-purpose tools aren’t enough.

Best for: Medical records, health-related AI queries, HIPAA-sensitive data.

Strac

Strac provides a DLP (Data Loss Prevention) layer for ChatGPT with real-time monitoring. It watches what you’re about to send and flags or blocks sensitive data before it leaves.

Best for: A browser-level safety net if you don’t want to build a pipeline.

LLM-Based Approaches

Tools like GLiNER and OpenPipe’s PII-Redact use LLMs themselves for zero-shot PII detection. They’re flexible and can catch non-standard PII types. The obvious downside: you’re sending data to an LLM to check if it’s sensitive enough to not send to an LLM.

Best for: Experimental, or when you need to detect unusual PII types that rule-based systems miss.

Comparison Matrix

Solution Type Strengths Best For
Presidio Open Source Free, flexible, runs locally DIY pipeline, full control
Private AI Commercial High accuracy, managed No-code, just works
Protecto Commercial Handles LLM text alteration Complex LLM interactions
John Snow Labs Commercial Healthcare-specific, 98.6% F1 Medical data, HIPAA
Strac Commercial Browser-level DLP ChatGPT safety net
GLiNER/OpenPipe Open Source Zero-shot, flexible Experimental, novel PII types

Building a Reusable Pipeline

Here’s a more complete example you can drop into a personal project. A wrapper class that handles detection, anonymization, and restoration:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
import hashlib
from typing import Dict, List
from dataclasses import dataclass


@dataclass
class PIIProtectionConfig:
    """Configuration for PII protection behavior."""
    entities_to_detect: List[str]
    anonymization_strategy: str = "replace"  # replace, mask, hash, redact
    score_threshold: float = 0.5
    preserve_mapping: bool = True


class PIIProtector:
    """Wrapper for Presidio-based PII protection."""

    def __init__(self, config: PIIProtectionConfig):
        self.config = config
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()
        self.mapping: Dict[str, str] = {}

    def protect(self, text: str) -> str:
        """Detect and anonymize PII in text."""
        results = self.analyzer.analyze(
            text=text,
            entities=self.config.entities_to_detect,
            language="en",
            score_threshold=self.config.score_threshold
        )

        results = sorted(results, key=lambda x: x.start, reverse=True)

        anonymized_text = text
        for result in results:
            original = text[result.start:result.end]
            replacement = self._get_replacement(original, result.entity_type)

            if self.config.preserve_mapping:
                self.mapping[replacement] = original

            anonymized_text = (
                anonymized_text[:result.start] +
                replacement +
                anonymized_text[result.end:]
            )

        return anonymized_text

    def _get_replacement(self, original: str, entity_type: str) -> str:
        """Generate replacement value based on strategy."""
        if self.config.anonymization_strategy == "hash":
            hash_val = hashlib.sha256(original.encode()).hexdigest()[:8]
            return f"[{entity_type}_{hash_val}]"
        elif self.config.anonymization_strategy == "mask":
            return f"[{entity_type}]"
        elif self.config.anonymization_strategy == "redact":
            return "[REDACTED]"
        else:
            return f"<{entity_type}>"

    def restore(self, text: str) -> str:
        """Restore original values from mapping."""
        restored = text
        for replacement, original in self.mapping.items():
            restored = restored.replace(replacement, original)
        return restored


# Personal finance use case
config = PIIProtectionConfig(
    entities_to_detect=[
        "PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
        "US_SSN", "CREDIT_CARD", "US_BANK_NUMBER",
        "IP_ADDRESS"
    ],
    anonymization_strategy="replace",
    score_threshold=0.6,
    preserve_mapping=True
)

protector = PIIProtector(config)

# Scrub before asking for financial advice
prompt = """My name is Jane Doe, SSN 987-65-4321.
I have $15k in a savings account at Wells Fargo.
Should I move it to a high-yield savings account?"""

safe_prompt = protector.protect(prompt)
print(f"Safe to send:\n{safe_prompt}")

# After getting the LLM response, restore your details
llm_response = "Based on your situation, <PERSON>, moving funds to a HYSA is advisable..."
restored = protector.restore(llm_response)
print(f"\nRestored:\n{restored}")

What Presidio Won’t Catch

No PII detection is perfect. A few things to keep in mind:

  • Context-dependent data: Dollar amounts, dates, and generic numbers aren’t PII on their own, but combined they can identify you. Presidio won’t flag “$8,200 monthly income” as sensitive.
  • Indirect identifiers: “I live in the blue house on the corner of Oak and Main” is identifying but won’t trigger any detector.
  • Domain-specific patterns: Internal account formats, loyalty program numbers, or crypto wallet addresses may not be recognized without custom recognizers.

The practical takeaway: Presidio catches the obvious stuff (names, SSNs, emails, phone numbers, credit cards). For everything else, do a manual pass on your prompt before sending. Think of it as a safety net, not a guarantee.

flowchart TB subgraph Layers["Practical Privacy Layers"] L1["Manual Review
Skim your prompt before sending"] L2["Presidio / PII Detection
Automated scrubbing"] L3["Data Minimization
Only include what the LLM needs"] L4["Provider Choice
Understand data retention policies"] end L1 --> L2 --> L3 --> L4

Key Learnings

  • The proxy pattern works well for personal use: Run Presidio locally, scrub your data, send sanitized text to the LLM, restore on your end
  • Presidio is free and runs entirely on your machine: No data leaves your environment during the detection and anonymization step
  • Faker-based replacement produces better LLM responses: Realistic fake names and values give more natural output than generic <PERSON> tokens
  • No tool catches everything: Dollar amounts, indirect identifiers, and context-dependent data slip through - always do a quick manual check
  • Healthcare data needs specialized tools: General PII detectors score ~0.41 F1 on clinical text vs. 98.6% for purpose-built tools like John Snow Labs
  • Commercial options exist if you don’t want to code: Private AI and Strac provide managed solutions for people who just want a safety net
  • The “response alteration” problem is real: LLMs rephrase text, which can break naive find-and-replace restoration - Faker-based reversible anonymization handles this better
  • Data minimization is half the battle: Before reaching for any tool, ask yourself if the LLM actually needs that detail to answer your question

I’ve been using this pattern for a few weeks now whenever I want AI help with anything financial or personal. The setup is straightforward - an afternoon to get Presidio and LangChain wired up - and after that it’s just part of the workflow. For quick one-off questions, even just manually swapping out names and account numbers before pasting works. But for anything recurring, having the automation in place means you don’t have to think about it.