From b7eca981d49baf15a95400268c41e81ac25f15e1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?PSBigBig=20=C3=97=20MiniPS?= <psbigbig@onestardao.com>
Date: Wed, 25 Feb 2026 19:35:15 +0800
Subject: [PATCH] docs: add RAG failure modes checklist guide (refs #13138)
 (#13204)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

### What problem does this PR solve?

This PR adds a new guide: **"RAG failure modes checklist"**.

RAG systems often fail in ways that are not immediately visible from a
single metric like accuracy or latency. In practice, debugging
production RAG applications requires identifying recurring failure
patterns across retrieval, routing, evaluation, and deployment stages.

This guide introduces a structured, pattern-based checklist (P01–P12) to
help users interpret traces, evaluation results, and dataset behavior
within RAGFlow. The goal is to provide a practical way to classify
incidents (e.g., retrieval hallucination, chunking issues, index
staleness, routing misalignment) and reason about minimal structural
fixes rather than ad-hoc prompt changes.

The change is documentation-only and does not modify any code or
configuration.

Refs #13138


### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
---
 docs/guides/rag_failure_modes_checklist.mdx | 138 ++++++++++++++++++++
 1 file changed, 138 insertions(+)
 create mode 100644 docs/guides/rag_failure_modes_checklist.mdx

diff --git a/docs/guides/rag_failure_modes_checklist.mdx b/docs/guides/rag_failure_modes_checklist.mdx
new file mode 100644
index 0000000000..d3560ae106
--- /dev/null
+++ b/docs/guides/rag_failure_modes_checklist.mdx
@@ -0,0 +1,138 @@
+| id                      | title                        | sidebar_label               |
+| ----------------------- | ---------------------------- | --------------------------- |
+| rag-failure-modes-checklist | RAG failure modes checklist | RAG failure modes checklist |
+
+# RAG failure modes checklist
+
+Retrieval-Augmented Generation (RAG) systems rarely “fail” because of a single metric like accuracy or latency.  
+In practice, debugging a production RAG application means looking at **patterns of incidents** across the whole
+pipeline, not just one bad answer.
+
+This page gives a small, opinionated checklist of common RAG failure patterns and how they typically show up when
+you inspect runs and evaluations in RAGFlow.
+
+It is inspired by an **MIT-licensed open-source 16-problem RAG failure map** used in several evaluation projects,
+and adapted here to match RAGFlow’s terminology and tooling.
+
+---
+
+## Why a failure-modes view?
+
+RAGFlow already gives you traces, evals, and dataset views.  
+The checklist below is about **how to read what you see**:
+
+- When an evaluation score is low, what kind of failure is it?
+- When a trace looks noisy, is it a retrieval issue, a prompt issue, or data quality?
+- When metrics are “good on average” but users still complain, what should you look at next?
+
+Thinking in explicit failure modes makes it easier to design better eval datasets, to triage incidents, and to
+communicate problems to the rest of your team.
+
+---
+
+## How to use this checklist with RAGFlow
+
+A simple way to use this page in day-to-day work:
+
+1. **Start from a symptom**  
+   - Low eval scores on a dataset.  
+   - A single bad run reported by a user.  
+   - A noisy trace or strange tool behavior.
+
+2. **Find the closest pattern**  
+   - Scan the table below for the pattern whose “Typical symptom in RAGFlow” matches what you see.  
+   - It does not need to be perfect; pick the closest one first.
+
+3. **Use the “Where to look in RAGFlow” column**  
+   - Go to the suggested views (traces, evals, datasets, knowledge base, or system logs).  
+   - Confirm or reject the hypothesis.
+
+4. **Record the failure mode**  
+   - Tag the run, add a comment, or track it in your own incident system using the pattern ID (P01–P12).  
+   - Over time you will see which patterns dominate your incidents.
+
+You can also adapt these IDs into your own internal taxonomy if you already have an incident process.
+
+---
+
+## Core failure patterns (P01–P12)
+
+The table below lists 12 reusable patterns.  
+Each describes **what goes wrong**, **how it tends to look in RAGFlow**, and **where to investigate**.
+
+| ID  | Pattern name                                       | Typical symptom in RAGFlow                                                                                  | Where to look in RAGFlow                           |
+| --- | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
+| P01 | Retrieval hallucination / grounding drift          | Answers look fluent and confident, but contradict the retrieved documents or cite facts that are not there. | Trace view (answer vs. retrieved documents), evals with “groundedness” or “factuality” scores |
+| P02 | Chunk boundary or segmentation bug                 | Key sentences are cut in the middle, or relevant context is split across chunks so no single chunk is useful. | Knowledge base / document preview, chunk metadata, retrieval examples |
+| P03 | Embedding mismatch (semantic vs vector distance)   | Top-k results look “close” by vector distance, but human reviewers judge them as off-topic or shallow matches. | Vector search results, embedding configuration, eval datasets that directly check retrieval relevance |
+| P04 | Index skew or staleness                            | Users see old or missing data even though the source of truth has been updated; evals pass on old snapshots. | Ingestion jobs, index timestamps, dataset versions, deployment timeline |
+| P05 | Query rewriting or router misalignment             | Similar user questions get routed to different tools or datasets; some flows never hit the right collection. | Router / tool selection traces, query rewrite logs, routing rules |
+| P06 | Long-chain reasoning drift                         | Multi-step tasks start correctly but violate earlier constraints in later steps (dates, prices, policies, etc.). | Multi-hop traces, intermediate tool outputs, step-by-step evals |
+| P07 | Tool-call misuse or ungrounded tools               | LLM calls tools with wrong arguments, or calls tools when the answer is already in context; wasted latency and quota. | Tool call spans, arguments vs. retrieved context, cost / latency breakdown |
+| P08 | Session memory leak / missing context              | Follow-up questions ignore important details from earlier turns, or accidentally reuse stale context from another session. | Conversation history, session identifiers, memory storage / retrieval configuration |
+| P09 | Evaluation blind spots                             | Evals look “green” but users still report obvious failures; dataset examples are too easy or not representative. | Eval dataset definitions, label guidelines, score distributions vs. real incidents |
+| P10 | Startup ordering / dependency not ready            | Newly deployed versions show spikes of 5xx, empty retrievals, or missing models during the first minutes after release. | Deployment logs, health checks, first-run traces after deploy |
+| P11 | Config or secrets drift across environments        | The same flow works locally but fails in staging or prod; model names, endpoints, or API keys differ silently. | Environment configs, secret management, environment-specific traces |
+| P12 | Multi-tenant / multi-agent interference            | Requests from different tenants or agents interfere with each other’s state, tools, or rate limits.         | Tenant IDs and agent IDs in traces, shared resources (indexes, caches, queues) |
+
+You do not need to use all 12.  
+It is completely fine to start with 3–5 that match your most common issues, then refine or extend the list.
+
+---
+
+## From symptom to pattern: a few examples
+
+Here are three concrete “reading patterns” you can apply inside RAGFlow.
+
+### Example A – Good retrieval, bad answer
+
+- Retrieval evals show high relevance.  
+- Traces confirm that the correct document is in the top-k results.  
+- The model still answers incorrectly or adds extra facts.
+
+This is usually **P01 – Retrieval hallucination / grounding drift**:
+
+- The retriever does its job, but the answer prompt does not strictly tie the response to the retrieved context.  
+- Fixes tend to involve tighter answer prompts, better instructions around quoting sources, or adding explicit groundedness evals.
+
+### Example B – Noisy trace, weak retrieval
+
+- User reports “it answers something, but not what I asked”.  
+- Trace shows multiple tool calls and retries.  
+- Retrieved chunks are partially related but miss the critical detail.
+
+This often indicates a mix of **P02 – Chunk boundary or segmentation bug** and **P03 – Embedding mismatch**:
+
+- Check how documents were split and whether important sentences are being cut.  
+- Check embedding model, normalization, and distance metric.  
+- Add a small retrieval-only eval dataset to isolate the problem from answer generation.
+
+### Example C – Everything is green except production
+
+- Automated evals are mostly high.  
+- Synthetic test questions pass.  
+- Real user traffic still contains surprising failures that evals never catch.
+
+This is classic **P09 – Evaluation blind spots**:
+
+- Your eval set covers only a narrow slice of real queries.  
+- Real incidents fall into different patterns (P01–P08, P10–P12) that were never sampled.  
+- The fix is to feed real incidents back into new eval datasets and label them with the appropriate failure mode.
+
+---
+
+## Extending the checklist for your team
+
+This page is intentionally small. It is meant as a **starting vocabulary**, not a finished ontology.
+
+When you see repeated patterns in your own RAGFlow traces:
+
+- Copy the table and add **team-specific variants** (for example, splitting P03 into separate patterns for different indexes).  
+- Attach pattern IDs (P01–P12 or your own) to incidents, tickets, or run tags.  
+- Use the distribution of failure modes to prioritize engineering work:
+  - If most incidents are P02/P03, invest in ingestion and indexing.  
+  - If most incidents are P01/P06/P07, focus on prompts, tools, and chain design.  
+  - If many incidents are P09/P10/P11, improve deployment, configs, and eval datasets.
+
+Over time, this checklist should evolve into **your own RAG incident map**, built on top of RAGFlow’s traces and
+evaluations, and tailored to your stack and users.