Sentinel: How AI Systems Learn from Natural Corrections

If you've built conversational AI systems, you've hit the feedback problem. Users correct your AI all the time: "No, that's wrong," "Actually, I moved," "You got that mixed up." But most systems can't tell the difference between a correction and an update.

So they do one of two things:

Ignore corrections → AI repeats the same mistakes
Require explicit feedback → Users have to click buttons, breaking conversational flow

Neither works. Users want to correct AI systems naturally, in conversation. They don't want to stop and click a "this is wrong" button.

Sentinel solves this. It's a lightweight LLM classifier that sits between user input and your database, detecting corrections automatically—no buttons needed.

The Problem: Correction vs. Update

Consider these two user inputs:

Example 1: Correction

User: "No, I'm vegetarian now. You got that wrong."

System should: Decay trust in "user loves steak" memory

Example 2: Update

User: "I just switched to a vegetarian diet. Excited to try new recipes!"

System should: Add new memory, keep "user loves steak" at full trust (it was true in the past)

Both statements contain the same factual change: user is now vegetarian. But the intent is different:

Example 1 is a correction—user is rebuking the system for holding wrong information
Example 2 is an update—user is sharing a life change without implying prior knowledge was wrong

Pattern matching can't distinguish these. Regex can't detect intent. Keyword matching fails. You need semantic understanding.

The Detection Problem

Current systems can't distinguish correctional intent (rebuke) from cooperative updates (informational). They either decay trust on everything or nothing. Neither is correct.

The Solution: Pre-Ingestion Classification

Sentinel is a stateless LLM binary classifier that runs before any database writes. It analyzes user input and retrieved context to determine intent.

1. User Input → "No, I'm vegetarian now"

↓

2. Vector Search → Retrieves relevant memories

↓

3. Sentinel Classifier → Analyzes intent (correction vs. update)

↓

4. Control Signal → JSON: {"action": "flag", "targets": ["mem_001"], "confidence": 0.85}

↓

5. Trust Decay → If confidence >= 0.7, trigger SVTD trust decay

↓

6. Database Write → Store new memory, update trust weights

Key Design Principles

Stateless
No persistent memory between calls. Each classification is independent. Ensures deterministic, reproducible behavior.

Deterministic
Uses temperature=0.0 for consistent classification. Same input → same output. No randomness.

Pre-Ingestion
Runs BEFORE database writes. Can prevent wrong information from being stored, not just fix it after.

Confidence-Gated
Only triggers action when confidence >= 0.7. Ambiguous cases default to NO FLAG. Conservative by design.

Target-Filtered
Only flags memories that were actually retrieved and shown to the user. Can't decay trust in memories the user never saw.

How It Works: Intent Classification

Sentinel distinguishes four categories of user input:

Category 1: Correctional Intent (FLAG → Trust Decay)

Definition: User is rebuking or correcting the system's implied knowledge, signaling that the system held or used wrong/outdated information.

Indicators:

Direct negation: "No, that's not right..."
Error assertion: "Wrong, actually it's..."
Contrast with blame: "I don't do that, it's this instead..."
Corrective tone: User is explicitly correcting system's belief

Correction Examples

"No, I'm vegetarian now. You got that wrong."

"That's incorrect, I moved to Dallas last month."

"You're wrong about my job. I work at Acme Corp now."

→ FLAG: Decay trust in contradicted memories

Category 2: Cooperative Update (NO FLAG → Preserve Trust)

Definition: User is sharing a state change over time without implying prior knowledge was wrong.

Indicators:

Neutral announcement: "I moved to a new house"
Positive transition: "I changed to..."
Informational update: "Just switched to..."

Update Examples

"I just switched to a vegetarian diet. Excited to try new recipes!"

"Update: I moved to Dallas. Loving the new city!"

"I changed jobs recently. Now working at Acme Corp."

→ NO FLAG: Add new memory, preserve existing trust

Category 3: Elaboration (APPEND → No Trust Change)

Definition: User is adding detail or expanding on existing information.

Elaboration Examples

"I also like Python, not just JavaScript."

"In addition to Austin, I've lived in Seattle too."

→ APPEND: Add new memory, no trust modification

Category 4: Noise (IGNORE → No Action)

Definition: Casual chat or non-informational content.

Noise Examples

"Hey, how's it going?"

"Thanks for the help!"

→ IGNORE: No database write, no trust modification

Technical Implementation

The Classification Prompt

Sentinel uses a carefully calibrated prompt that teaches intent distinction, not phrase memorization:

# System Prompt (Sentinel v5 - Enhanced Semantic Disambiguation)
TASK: Detect trust-decay signals in conversation.

CORE MISSION:
Flag memories ONLY when the user is rebuking or correcting 
the system's implied knowledge — i.e., signaling that the 
system held or used a wrong/outdated fact in a blaming way.

DO NOT flag when the user is cooperatively sharing a life 
change, preference shift, or new status without implying 
the prior knowledge was erroneous.

Prioritize conversational intent and tone over pure 
factual compatibility.

CONSTRAINTS:
- No persistent state or memory of previous turns
- Do not answer user queries or generate responses
- Output ONLY structured JSON decision signals
- When ambiguous or neutral tone: default to NO FLAG.
  Trust decay requires clear rebuke signal.

Structured JSON Output

Sentinel outputs structured control signals, not natural language:

# Example Output
{
  "action": "flag",  // "flag" | "none"
  "targets": ["mem_001", "mem_002"],  // Memory IDs to flag
  "confidence": 0.85,  // 0.0-1.0, requires >= 0.7 to trigger
  "reasoning": "User is directly correcting system's belief about dietary preference"
}

Critical: This is a technical control mechanism, not a conversational response. The LLM is used as a binary classifier, not a text generator.

Integration with SVTD

When Sentinel outputs action: "flag" with confidence >= 0.7, it triggers Surgical Vector Trust Decay (SVTD):

# Flow: Sentinel → SVTD
if detection_result["action"] == "flag" and 
   detection_result["confidence"] >= 0.7:
    
    # Apply trust decay to target memories
    for memory_id in detection_result["targets"]:
        apply_trust_decay(memory_id)
        
    # Explicit corrections: aggressive decay (×0.01)
    # User-flagged: escalating penalty (-0.10, -0.25, -0.35)
    # 3rd strike: memory enters "ghost state" (suppressed)

Semantic Disambiguation (v5 Enhancement)

Sentinel v5 includes enhanced semantic disambiguation with 7 explicit rules:

Question/Sequence Reference: Maps Q1, Q2, Q3 to Memory [1], [2], [3]
Clause Reference: Interprets "this one", "that answer" based on context
Domain Overlap: Distinguishes family vs work vs personal domains
Entity Matching: Strong signal (e.g., "BB" → memories with BB in entities)
Lexical/Semantic Matching: Keyword alignment with memory metadata
Multiple Corrections Handling: Parses and maps multiple corrections in one message
Ambiguity Handling: Returns empty list if ambiguous (safety > aggressiveness)

Core Rule: No decay without explicit, defensible targeting. If the system cannot confidently map a correction to specific memories, it stages the correction intent and waits for clarification on the next turn.

Why This Matters

Most AI systems require explicit feedback:

Approach	User Experience	Accuracy
Explicit Feedback Buttons	❌ Breaks conversation flow	✅ High accuracy
Pattern Matching	✅ Natural conversation	❌ Low accuracy (can't detect intent)
Sentinel (LLM Classifier)	✅ Natural conversation	✅ High accuracy (semantic understanding)

Sentinel gives you the best of both worlds: natural conversation with accurate correction detection.

Cost and Performance

Sentinel is designed to be cheap and fast:

Dedicated model: Uses a smaller, cheaper LLM optimized for classification (not your main conversational model)
Small context: Only analyzes user input + retrieved memory IDs (not full conversation history)
Short output: Just JSON control signals (max 200 tokens)
Temperature 0.0: Deterministic, cacheable results
Stateless: No memory overhead between calls

Typical cost: $0.0001-0.0005 per classification (depending on model). Negligible compared to main LLM calls.

Real-World Example

Here's how Sentinel works in practice:

Scenario: User Corrects Location

Turn 1: System retrieves memory: "User lives in Austin"

Turn 2: User says: "No, I moved to Dallas last month. You got that wrong."

Sentinel Analysis:

{
  "action": "flag",
  "targets": ["mem_austin_001"],
  "confidence": 0.92,
  "reasoning": "User is directly rebuking system's belief about location"
}

Result: Trust decay applied to "mem_austin_001". New memory "User lives in Dallas" stored. Next query about location returns Dallas, not Austin.

Scenario: User Shares Update

Turn 1: System retrieves memory: "User lives in Austin"

Turn 2: User says: "Just moved to Dallas! Loving the new city so far."

Sentinel Analysis:

{
  "action": "none",
  "targets": [],
  "confidence": 0.15,
  "reasoning": "User is sharing a life change without rebuking prior knowledge"
}

Result: New memory "User lives in Dallas" stored. "User lives in Austin" remains at full trust (it was true in the past). Both memories can coexist.

The Architecture Layer

Sentinel sits as a pre-ingestion gate in your RAG pipeline:

# Standard RAG Pipeline
User Input → Vector Search → Rank → Return

# With Sentinel
User Input → Vector Search → Sentinel Classifier → 
  Decision: Flag or None → Trust Decay (if flagged) → 
  Database Write → Return

It doesn't replace your retrieval—it adds a classification layer that makes your system smarter about what to trust.

Sentinel Available in Production

Pre-ingestion correction detection is live. Drop-in integration with your existing RAG pipeline. No buttons, no flags—just natural conversation.

Get Started at MemoryGate.io

What You Get

With Sentinel, your AI systems:

✅ Learn from natural corrections. No explicit feedback buttons. Users correct the system in conversation, and it just works.

✅ Distinguish corrections from updates. Understands intent, not just facts. Preserves trust in historical information when appropriate.

✅ Target only relevant memories. Only flags memories that were actually retrieved and shown to the user. Can't decay trust in memories the user never saw.

✅ Conservative by default. Requires high confidence (>= 0.7) to trigger action. Ambiguous cases default to NO FLAG. Prevents false positives.

✅ Cheap and fast. Dedicated classification model. Small context, short output. Negligible cost compared to main LLM.

This is the missing piece. Not better retrieval—better understanding of what users mean.