fix(es): downgrade LLM-generated invalid SQL to WARNING in ES sql() (#15409) (#15709)

## Summary Fixes #15409. Reporter sees scary ERROR-level stack traces in `ragflow_server.log` on every chat turn against a knowledge base whose spreadsheet has many columns with embedded IDs (e.g. `id-wstc-bios fvt-322-wstc-bios fvt-323`). Simple queries work; complex ones return "No answer" with logs that look like a hard crash. ### What's actually happening 1. The user uploads a wide Excel/CSV. [rag/app/table.py:477-493](rag/app/table.py#L477-L493) turns each header into an ES field with a type suffix, e.g. `id-wstc-bios fvt-322-wstc-bios fvt-323_tks`. This is correct — the parser faithfully encodes the user's column names. 2. The user asks about test case `fvt-085`. The SQL chat path in [api/db/services/dialog_service.py:914 use_sql](api/db/services/dialog_service.py#L914) asks the LLM to write SQL using the field list. The LLM sees the `id-wstc-bios fvt-NNN-wstc-bios fvt-MMM_tks` pattern and pattern-completes a plausible-but-nonexistent column. 3. Elasticsearch rejects with `BadRequestError(400, 'verification_exception')`: `Unknown column [id-wstc-bios fvt-085-wstc-bios fvt-086_tks]` and suggests the closest valid column. 4. **The recovery path already exists**: `use_sql` catches the exception, re-prompts the LLM with the error text (which contains ES's "did you mean" hint), and on second failure the caller at [api/db/services/dialog_service.py:626](api/db/services/dialog_service.py#L626) falls back to vector search. The chat does produce an answer — it's just generated from the vector hits instead of SQL. The only real bug is logging: - [common/doc_store/es_conn_base.py:399](common/doc_store/es_conn_base.py#L399) catches every exception with `self.logger.exception(...)`, which writes a full traceback at **ERROR** level. - For LLM-generated SQL this is the hot path, not an exceptional condition — it can fire twice per turn before the fallback runs. ### Fix Catch `elasticsearch.BadRequestError` (the parent class of `verification_exception` / `parsing_exception` / similar SQL-validity errors) separately and log it at **WARNING** with the SQL plus ES error message. The message still carries the unknown column name and ES's suggested alternative, so it's actionable for anyone investigating "why is my LLM producing bad SQL?" — just without the misleading stack trace. Other exception types (`ConnectionTimeout`, generic `Exception`) keep their original `ERROR`-level traceback treatment; those represent real connectivity / library bugs. This is a one-file, two-line-net change. The retry loop in `use_sql`, the `add_kb_filter` injection, and the vector-search fallback are all unchanged. ### What this PR does NOT change - **The LLM prompts in `use_sql`** — they already specify `Use EXACT field names from the schema` and pass the field list explicitly. Strengthening them risks regressing well-behaved cases and is out of scope for #15409. - **The single-retry policy** — extending it to multi-retry with extracted ES suggestions is a separate enhancement. - **The parser at `rag/app/table.py`** — the field names match the user's actual column headers; the parser is doing its job. ## Files changed - [common/doc_store/es_conn_base.py](common/doc_store/es_conn_base.py) - Add `BadRequestError` to the `elasticsearch` import. - In `ESConnectionBase.sql()`, add an `except BadRequestError` arm above the generic `except Exception` that logs at WARNING and re-raises (so `use_sql` retry/fallback still triggers).
2026-06-29 15:31:05 +08:00 · 2026-06-11 00:04:52 -07:00
parent a1dc2da7b4
commit 3f929e3904
1 changed files with 9 additions and 1 deletions
--- a/common/doc_store/es_conn_base.py
+++ b/common/doc_store/es_conn_base.py
@@ -21,7 +21,7 @@ import time
 import os
 from abc import abstractmethod

-from elasticsearch import NotFoundError
+from elasticsearch import BadRequestError, NotFoundError
 from elasticsearch_dsl import Index
 from elastic_transport import ConnectionTimeout
 from elasticsearch.client import IndicesClient
@@ -395,6 +395,14 @@ class ESConnectionBase(DocStoreConnection):
                time.sleep(3)
                self._connect()
                continue
+            except BadRequestError as e:
+                # LLM-generated SQL routinely references columns that don't exist
+                # (e.g. unknown_column / verification_exception). The caller in
+                # api/db/services/dialog_service.py:use_sql catches this and either
+                # re-prompts the LLM with the error or falls back to vector search,
+                # so a full ERROR-level traceback is misleading — see #15409.
+                self.logger.warning(f"ESConnection.sql rejected by ES (likely invalid LLM-generated SQL). SQL:\n{sql}\nError: {e}")
+                raise Exception(f"SQL error: {e}\n\nSQL: {sql}")
            except Exception as e:
                self.logger.exception(f"ESConnection.sql got exception. SQL:\n{sql}")
                raise Exception(f"SQL error: {e}\n\nSQL: {sql}")