fix(es): downgrade LLM-generated invalid SQL to WARNING in ES sql() (#15409) (#15709)

## Summary

Fixes #15409.

Reporter sees scary ERROR-level stack traces in `ragflow_server.log` on
every chat turn against a knowledge base whose spreadsheet has many
columns with embedded IDs (e.g. `id-wstc-bios fvt-322-wstc-bios
fvt-323`). Simple queries work; complex ones return "No answer" with
logs that look like a hard crash.

### What's actually happening

1. The user uploads a wide Excel/CSV.
[rag/app/table.py:477-493](rag/app/table.py#L477-L493) turns each header
into an ES field with a type suffix, e.g. `id-wstc-bios
fvt-322-wstc-bios fvt-323_tks`. This is correct — the parser faithfully
encodes the user's column names.
2. The user asks about test case `fvt-085`. The SQL chat path in
[api/db/services/dialog_service.py:914
use_sql](api/db/services/dialog_service.py#L914) asks the LLM to write
SQL using the field list. The LLM sees the `id-wstc-bios
fvt-NNN-wstc-bios fvt-MMM_tks` pattern and pattern-completes a
plausible-but-nonexistent column.
3. Elasticsearch rejects with `BadRequestError(400,
'verification_exception')`: `Unknown column [id-wstc-bios
fvt-085-wstc-bios fvt-086_tks]` and suggests the closest valid column.
4. **The recovery path already exists**: `use_sql` catches the
exception, re-prompts the LLM with the error text (which contains ES's
"did you mean" hint), and on second failure the caller at
[api/db/services/dialog_service.py:626](api/db/services/dialog_service.py#L626)
falls back to vector search. The chat does produce an answer — it's just
generated from the vector hits instead of SQL.

The only real bug is logging:

-
[common/doc_store/es_conn_base.py:399](common/doc_store/es_conn_base.py#L399)
catches every exception with `self.logger.exception(...)`, which writes
a full traceback at **ERROR** level.
- For LLM-generated SQL this is the hot path, not an exceptional
condition — it can fire twice per turn before the fallback runs.

### Fix

Catch `elasticsearch.BadRequestError` (the parent class of
`verification_exception` / `parsing_exception` / similar SQL-validity
errors) separately and log it at **WARNING** with the SQL plus ES error
message. The message still carries the unknown column name and ES's
suggested alternative, so it's actionable for anyone investigating "why
is my LLM producing bad SQL?" — just without the misleading stack trace.

Other exception types (`ConnectionTimeout`, generic `Exception`) keep
their original `ERROR`-level traceback treatment; those represent real
connectivity / library bugs.

This is a one-file, two-line-net change. The retry loop in `use_sql`,
the `add_kb_filter` injection, and the vector-search fallback are all
unchanged.

### What this PR does NOT change

- **The LLM prompts in `use_sql`** — they already specify `Use EXACT
field names from the schema` and pass the field list explicitly.
Strengthening them risks regressing well-behaved cases and is out of
scope for #15409.
- **The single-retry policy** — extending it to multi-retry with
extracted ES suggestions is a separate enhancement.
- **The parser at `rag/app/table.py`** — the field names match the
user's actual column headers; the parser is doing its job.

## Files changed

- [common/doc_store/es_conn_base.py](common/doc_store/es_conn_base.py)
  - Add `BadRequestError` to the `elasticsearch` import.
- In `ESConnectionBase.sql()`, add an `except BadRequestError` arm above
the generic `except Exception` that logs at WARNING and re-raises (so
`use_sql` retry/fallback still triggers).
This commit is contained in:
Rene Arredondo
2026-06-11 00:04:52 -07:00
committed by GitHub
parent a1dc2da7b4
commit 3f929e3904

View File

@@ -21,7 +21,7 @@ import time
import os
from abc import abstractmethod
from elasticsearch import NotFoundError
from elasticsearch import BadRequestError, NotFoundError
from elasticsearch_dsl import Index
from elastic_transport import ConnectionTimeout
from elasticsearch.client import IndicesClient
@@ -395,6 +395,14 @@ class ESConnectionBase(DocStoreConnection):
time.sleep(3)
self._connect()
continue
except BadRequestError as e:
# LLM-generated SQL routinely references columns that don't exist
# (e.g. unknown_column / verification_exception). The caller in
# api/db/services/dialog_service.py:use_sql catches this and either
# re-prompts the LLM with the error or falls back to vector search,
# so a full ERROR-level traceback is misleading — see #15409.
self.logger.warning(f"ESConnection.sql rejected by ES (likely invalid LLM-generated SQL). SQL:\n{sql}\nError: {e}")
raise Exception(f"SQL error: {e}\n\nSQL: {sql}")
except Exception as e:
self.logger.exception(f"ESConnection.sql got exception. SQL:\n{sql}")
raise Exception(f"SQL error: {e}\n\nSQL: {sql}")