ragflow/deepdoc/parser/html_parser.py at main

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 15:31:05 +08:00

Files

Yash Raj Pandey 091417980e fix(html_parser): preserve original text when splitting oversized blocks (#16052 )

### Bug

`RAGFlowHtmlParser.chunk_block()` splits an oversized block by slicing
the **tokenized** string and storing the joined tokens:

```python
tks_str = rag_tokenizer.tokenize(block)
...
tokens = tks_str.split(" ")
while start < len(tokens):
    chunks.append(" ".join(tokens[start:start + chunk_token_num]))  # tokenized form, not source
```

On the default (Elasticsearch) backend `rag_tokenizer.tokenize`
transforms text: it lowercases/stems Latin words and inserts spaces
between CJK characters. So any text block longer than `chunk_token_num`
is stored as garbled, lowercased, space-segmented text instead of the
source content. The small-block branch correctly stores the original
`block`, so only oversized blocks are corrupted. Affects HTML and EPUB
ingestion (both go through `chunk_block`), degrading retrieved chunks
and the answers generated from them.

### Real tokenizer behavior (infinity-sdk 0.7.0, ES backend)

```
tokenize("Hello World FOO Bar Baz Qux Jumps")  -> "hello world foo bar baz qux jump"   # lowercased + stemmed
tokenize("你好世界这是一个测试")                 -> "你好世界 这 是 一个 测试"            # spaces inserted
```

### Fix

Split the **original** text: break it into atoms (whitespace-delimited
runs for space-separated scripts, per-character for spaceless scripts
such as Chinese) and pack them into pieces of at most `chunk_token_num`
tokens. This preserves the source characters and still splits scripts
that have no whitespace — a plain whitespace split would leave CJK as
one un-splittable chunk.

### Proof (real tokenizer, before/after)

Running the old vs new split against the real `infinity.rag_tokenizer`:

```
ENGLISH "Hello World FOO Bar Baz Qux Lazy Dogs"  (chunk_token_num=4)
  OLD: ['hello world foo bar', 'baz qux jump over', 'lazi dog']          # lowercased + stemmed
  NEW: ['Hello World FOO Bar ', 'Baz Qux Jumps Over ', 'Lazy Dogs']      # preserved; each <= 4 tokens
  NEW preserves text exactly: True

CHINESE "你好世界这是一个测试用例需要被切分成多个块"  (chunk_token_num=3)
  OLD: ['你好世界 这 是', '一个 测试用例 需要', ...]                      # spurious spaces
  NEW: ['你好世', '界这是', '一个测', ...]                               # preserved; each <= 3 tokens
  NEW preserves text exactly: True
```

### Tests

Added `test/unit_test/deepdoc/parser/test_html_parser.py` (English +
Chinese oversized blocks, plus small-block merge). Before the fix the
two oversized tests fail (English shows lowercasing, Chinese shows
inserted spaces); after the fix all pass. `ruff check` clean.

2026-06-25 16:43:35 +08:00

11 KiB

Raw Permalink Blame History

View Raw

11 KiB Raw Permalink Blame History

11 KiB

Raw Permalink Blame History