From 74dc43406f4df8f30b0dbb56f9820c2ba404fdfb Mon Sep 17 00:00:00 2001 From: writinwaters <93570324+writinwaters@users.noreply.github.com> Date: Thu, 26 Feb 2026 12:39:58 +0800 Subject: [PATCH] =?UTF-8?q?Docs:=20After=20careful=20consideration,=20the?= =?UTF-8?q?=20RAGFlow=20team=20decided=20to=20hold=20o=E2=80=A6=20(#13226)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit …ff publishing this guide. ### What problem does this PR solve? Removed failsure mode checklist per your request. @JinHai-CN ### Type of change - [x] Documentation Update --- docs/guides/rag_failure_modes_checklist.mdx | 138 -------------------- 1 file changed, 138 deletions(-) delete mode 100644 docs/guides/rag_failure_modes_checklist.mdx diff --git a/docs/guides/rag_failure_modes_checklist.mdx b/docs/guides/rag_failure_modes_checklist.mdx deleted file mode 100644 index d3560ae106..0000000000 --- a/docs/guides/rag_failure_modes_checklist.mdx +++ /dev/null @@ -1,138 +0,0 @@ -| id | title | sidebar_label | -| ----------------------- | ---------------------------- | --------------------------- | -| rag-failure-modes-checklist | RAG failure modes checklist | RAG failure modes checklist | - -# RAG failure modes checklist - -Retrieval-Augmented Generation (RAG) systems rarely “fail” because of a single metric like accuracy or latency. -In practice, debugging a production RAG application means looking at **patterns of incidents** across the whole -pipeline, not just one bad answer. - -This page gives a small, opinionated checklist of common RAG failure patterns and how they typically show up when -you inspect runs and evaluations in RAGFlow. - -It is inspired by an **MIT-licensed open-source 16-problem RAG failure map** used in several evaluation projects, -and adapted here to match RAGFlow’s terminology and tooling. - ---- - -## Why a failure-modes view? - -RAGFlow already gives you traces, evals, and dataset views. -The checklist below is about **how to read what you see**: - -- When an evaluation score is low, what kind of failure is it? -- When a trace looks noisy, is it a retrieval issue, a prompt issue, or data quality? -- When metrics are “good on average” but users still complain, what should you look at next? - -Thinking in explicit failure modes makes it easier to design better eval datasets, to triage incidents, and to -communicate problems to the rest of your team. - ---- - -## How to use this checklist with RAGFlow - -A simple way to use this page in day-to-day work: - -1. **Start from a symptom** - - Low eval scores on a dataset. - - A single bad run reported by a user. - - A noisy trace or strange tool behavior. - -2. **Find the closest pattern** - - Scan the table below for the pattern whose “Typical symptom in RAGFlow” matches what you see. - - It does not need to be perfect; pick the closest one first. - -3. **Use the “Where to look in RAGFlow” column** - - Go to the suggested views (traces, evals, datasets, knowledge base, or system logs). - - Confirm or reject the hypothesis. - -4. **Record the failure mode** - - Tag the run, add a comment, or track it in your own incident system using the pattern ID (P01–P12). - - Over time you will see which patterns dominate your incidents. - -You can also adapt these IDs into your own internal taxonomy if you already have an incident process. - ---- - -## Core failure patterns (P01–P12) - -The table below lists 12 reusable patterns. -Each describes **what goes wrong**, **how it tends to look in RAGFlow**, and **where to investigate**. - -| ID | Pattern name | Typical symptom in RAGFlow | Where to look in RAGFlow | -| --- | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | -------------------------------------------------- | -| P01 | Retrieval hallucination / grounding drift | Answers look fluent and confident, but contradict the retrieved documents or cite facts that are not there. | Trace view (answer vs. retrieved documents), evals with “groundedness” or “factuality” scores | -| P02 | Chunk boundary or segmentation bug | Key sentences are cut in the middle, or relevant context is split across chunks so no single chunk is useful. | Knowledge base / document preview, chunk metadata, retrieval examples | -| P03 | Embedding mismatch (semantic vs vector distance) | Top-k results look “close” by vector distance, but human reviewers judge them as off-topic or shallow matches. | Vector search results, embedding configuration, eval datasets that directly check retrieval relevance | -| P04 | Index skew or staleness | Users see old or missing data even though the source of truth has been updated; evals pass on old snapshots. | Ingestion jobs, index timestamps, dataset versions, deployment timeline | -| P05 | Query rewriting or router misalignment | Similar user questions get routed to different tools or datasets; some flows never hit the right collection. | Router / tool selection traces, query rewrite logs, routing rules | -| P06 | Long-chain reasoning drift | Multi-step tasks start correctly but violate earlier constraints in later steps (dates, prices, policies, etc.). | Multi-hop traces, intermediate tool outputs, step-by-step evals | -| P07 | Tool-call misuse or ungrounded tools | LLM calls tools with wrong arguments, or calls tools when the answer is already in context; wasted latency and quota. | Tool call spans, arguments vs. retrieved context, cost / latency breakdown | -| P08 | Session memory leak / missing context | Follow-up questions ignore important details from earlier turns, or accidentally reuse stale context from another session. | Conversation history, session identifiers, memory storage / retrieval configuration | -| P09 | Evaluation blind spots | Evals look “green” but users still report obvious failures; dataset examples are too easy or not representative. | Eval dataset definitions, label guidelines, score distributions vs. real incidents | -| P10 | Startup ordering / dependency not ready | Newly deployed versions show spikes of 5xx, empty retrievals, or missing models during the first minutes after release. | Deployment logs, health checks, first-run traces after deploy | -| P11 | Config or secrets drift across environments | The same flow works locally but fails in staging or prod; model names, endpoints, or API keys differ silently. | Environment configs, secret management, environment-specific traces | -| P12 | Multi-tenant / multi-agent interference | Requests from different tenants or agents interfere with each other’s state, tools, or rate limits. | Tenant IDs and agent IDs in traces, shared resources (indexes, caches, queues) | - -You do not need to use all 12. -It is completely fine to start with 3–5 that match your most common issues, then refine or extend the list. - ---- - -## From symptom to pattern: a few examples - -Here are three concrete “reading patterns” you can apply inside RAGFlow. - -### Example A – Good retrieval, bad answer - -- Retrieval evals show high relevance. -- Traces confirm that the correct document is in the top-k results. -- The model still answers incorrectly or adds extra facts. - -This is usually **P01 – Retrieval hallucination / grounding drift**: - -- The retriever does its job, but the answer prompt does not strictly tie the response to the retrieved context. -- Fixes tend to involve tighter answer prompts, better instructions around quoting sources, or adding explicit groundedness evals. - -### Example B – Noisy trace, weak retrieval - -- User reports “it answers something, but not what I asked”. -- Trace shows multiple tool calls and retries. -- Retrieved chunks are partially related but miss the critical detail. - -This often indicates a mix of **P02 – Chunk boundary or segmentation bug** and **P03 – Embedding mismatch**: - -- Check how documents were split and whether important sentences are being cut. -- Check embedding model, normalization, and distance metric. -- Add a small retrieval-only eval dataset to isolate the problem from answer generation. - -### Example C – Everything is green except production - -- Automated evals are mostly high. -- Synthetic test questions pass. -- Real user traffic still contains surprising failures that evals never catch. - -This is classic **P09 – Evaluation blind spots**: - -- Your eval set covers only a narrow slice of real queries. -- Real incidents fall into different patterns (P01–P08, P10–P12) that were never sampled. -- The fix is to feed real incidents back into new eval datasets and label them with the appropriate failure mode. - ---- - -## Extending the checklist for your team - -This page is intentionally small. It is meant as a **starting vocabulary**, not a finished ontology. - -When you see repeated patterns in your own RAGFlow traces: - -- Copy the table and add **team-specific variants** (for example, splitting P03 into separate patterns for different indexes). -- Attach pattern IDs (P01–P12 or your own) to incidents, tickets, or run tags. -- Use the distribution of failure modes to prioritize engineering work: - - If most incidents are P02/P03, invest in ingestion and indexing. - - If most incidents are P01/P06/P07, focus on prompts, tools, and chain design. - - If many incidents are P09/P10/P11, improve deployment, configs, and eval datasets. - -Over time, this checklist should evolve into **your own RAG incident map**, built on top of RAGFlow’s traces and -evaluations, and tailored to your stack and users.