### Bug
`RAGFlowHtmlParser.chunk_block()` splits an oversized block by slicing
the **tokenized** string and storing the joined tokens:
```python
tks_str = rag_tokenizer.tokenize(block)
...
tokens = tks_str.split(" ")
while start < len(tokens):
chunks.append(" ".join(tokens[start:start + chunk_token_num])) # tokenized form, not source
```
On the default (Elasticsearch) backend `rag_tokenizer.tokenize`
transforms text: it lowercases/stems Latin words and inserts spaces
between CJK characters. So any text block longer than `chunk_token_num`
is stored as garbled, lowercased, space-segmented text instead of the
source content. The small-block branch correctly stores the original
`block`, so only oversized blocks are corrupted. Affects HTML and EPUB
ingestion (both go through `chunk_block`), degrading retrieved chunks
and the answers generated from them.
### Real tokenizer behavior (infinity-sdk 0.7.0, ES backend)
```
tokenize("Hello World FOO Bar Baz Qux Jumps") -> "hello world foo bar baz qux jump" # lowercased + stemmed
tokenize("你好世界这是一个测试") -> "你好世界 这 是 一个 测试" # spaces inserted
```
### Fix
Split the **original** text: break it into atoms (whitespace-delimited
runs for space-separated scripts, per-character for spaceless scripts
such as Chinese) and pack them into pieces of at most `chunk_token_num`
tokens. This preserves the source characters and still splits scripts
that have no whitespace — a plain whitespace split would leave CJK as
one un-splittable chunk.
### Proof (real tokenizer, before/after)
Running the old vs new split against the real `infinity.rag_tokenizer`:
```
ENGLISH "Hello World FOO Bar Baz Qux Lazy Dogs" (chunk_token_num=4)
OLD: ['hello world foo bar', 'baz qux jump over', 'lazi dog'] # lowercased + stemmed
NEW: ['Hello World FOO Bar ', 'Baz Qux Jumps Over ', 'Lazy Dogs'] # preserved; each <= 4 tokens
NEW preserves text exactly: True
CHINESE "你好世界这是一个测试用例需要被切分成多个块" (chunk_token_num=3)
OLD: ['你好世界 这 是', '一个 测试用例 需要', ...] # spurious spaces
NEW: ['你好世', '界这是', '一个测', ...] # preserved; each <= 3 tokens
NEW preserves text exactly: True
```
### Tests
Added `test/unit_test/deepdoc/parser/test_html_parser.py` (English +
Chinese oversized blocks, plus small-block merge). Before the fix the
two oversized tests fail (English shows lowercasing, Chinese shows
inserted spaces); after the fix all pass. `ruff check` clean.
### What problem does this PR solve?
Feat: add button for remove header & footer in pipeline
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
## Summary
- Fix incorrect Markdown heading mapping for `h4` in `TITLE_TAGS`
dictionary
- `h4` was mapped to `"#####"` (h5 level) instead of `"####"` (correct
h4 level)
Closes#13819
## Details
In `deepdoc/parser/html_parser.py`, the `TITLE_TAGS` dictionary had a
typo where `h4` was assigned 5 `#` characters instead of 4, causing h4
headings to be converted to h5-level Markdown headings during HTML
parsing.
## Test plan
- [ ] Parse an HTML document containing `<h4>` tags and verify the
output uses `####` (4 hashes)
- [ ] Verify other heading levels remain correct
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Asksksn <Asksksn@noreply.gitcode.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
### What problem does this PR solve?
- typos
- IDE warnings
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
##### Problem Description
When parsing HTML files, some page content may be lost.
For example, text inside nested `<font>` tags within multiple `<div>`
elements (e.g.,
`<div><font>Text_1</font></div><div><font>Text_2</font></div>`) fails to
be preserved correctly.
###### Root Cause #1: Block ID propagation is interrupted
1. **Block ID generation**: When the parser encounters a `<div>`, it
generates a new `block_id` because `<div>` belongs to `BLOCK_TAGS`.
2. **Recursive processing**: This `block_id` is passed down recursively
to process the `<div>`’s child nodes.
3. **Interruption occurs**: When processing a child `<font>` tag, the
code enters the `else` branch of `read_text_recursively` (since `<font>`
is a Tag).
4. **Bug location**: The first line in this `else` branch explicitly
sets **`block_id = None`**.
- This discards the valid `block_id` inherited from the parent `<div>`.
- Since `<font>` is not in `BLOCK_TAGS`, it does not generate a new
`block_id`, so it passes `None` to its child text nodes.
5. **Consequence**: The extracted text nodes have an empty `block_id` in
their `metadata`. During the subsequent `merge_block_text` step, these
texts cannot be correctly associated with their original `<div>` block
due to the missing ID. As a result, all text from `<font>` tags gets
merged together, which then triggers a second issue during
concatenation.
6. **Solution:** Remove the forced reset of `block_id` to `None`. When
the current tag (e.g., `<font>`) is not a block-level element, it should
inherit the `block_id` passed down from its parent. This ensures
consistent ownership across the hierarchy: `div` → `font` → `text`.
###### Root Cause #2: Data loss during text concatenation
1. The line `current_content += (" " if current_content else "" +
content)` has a misplaced parenthesis. When `current_content` is
non-empty (`True`):
- The ternary expression evaluates to `" "` (a single space).
- The code executes `current_content += " "`.
- **Result**: Only a space is appended—**the new `content` string is
completely discarded**.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
issue:
[Bug]: Agent component (HTTP Request) "'>' not supported between
instances of 'int' and 'NoneType'"
[#10096](https://github.com/infiniflow/ragflow/issues/10096)
Change:
When the Invoke class instantiates HtmlParser without providing the
chunk_token_num parameter, the value defaults to None, leading to a
comparison error with block_token_count.
This change sets the default chunk_token_num to 512 to prevent such
errors.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: BadwomanCraZY <511528396@qq.com>
### What problem does this PR solve?
add support for eml file parser
#1363
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Zhedong Cen <cenzhedong2@126.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>