5 Commits

Author SHA1 Message Date
Harsh Kashyap
66d86154ab fix(deepdoc): accept GFM table separators with one or more dashes (#16319) 2026-06-25 19:25:57 +08:00
helloxjade
1b2da645c3 fix: deduplicate markdown table chunks (#16143) 2026-06-24 13:22:57 +08:00
jaso0n0818
a70c7e8cc7 fix(deepdoc): attach lone header lines to the following section when delimiter is set (#16109)
## Summary
Fixes #15487 — lone markdown headers are no longer isolated as empty
chunks when a custom `delimiter` is set.

- Merge consecutive lone headers before attaching to the following prose
body
- Skip code fences, tables, lists, and blockquotes via
`_is_attachable_body()`
- Unit tests include the `# Title / ## Intro / Body` regression from
CodeRabbit review

## Validation
- `pytest test/unit_test/deepdoc/parser/test_markdown_parser.py` (11
passed locally)

Closes #15487
2026-06-18 14:24:09 +08:00
bitloi
9f3e289b78 Fix: preserve markdown tables during delimiter extraction (#15632)
### What problem does this PR solve?

Markdown extraction can split tables row by row when delimiter-based
extraction uses a newline delimiter. That loses table structure during
chunking even though delimiters should still split normally outside
tables.

This PR keeps the follow-up to #15482 intentionally narrow:

- preserve Markdown pipe tables during delimiter-based extraction
- preserve borderless pipe tables during delimiter-based extraction
- preserve multiline HTML tables during delimiter-based extraction
- keep delimiter splitting unchanged outside protected table ranges

Refs #15482

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Testing

- `ruff check deepdoc/parser/markdown_parser.py
test/unit_test/deepdoc/parser/test_markdown_parser.py`
- `python3 run_tests.py -t
test/unit_test/deepdoc/parser/test_markdown_parser.py`
- `git diff --check`
2026-06-05 10:35:33 +08:00
bitloi
01a5598aa5 Fix: markdown fenced code block extraction (#15630)
### What problem does this PR solve?

Markdown extraction currently applies custom delimiters before
respecting fenced code blocks. When a delimiter such as a newline is
configured, fenced code can be split into separate chunks, and longer
outer fences can be closed incorrectly by shorter nested fences.

This PR keeps the fix intentionally narrow for the Markdown chunking
discussion in #15482:

- preserve fenced code blocks when delimiter-based extraction is used
- support both backtick and tilde fences
- respect fence length so longer outer fences can contain shorter inner
fences
- keep delimiter splitting unchanged outside fenced blocks

Refs #15482

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Testing

- `ruff check deepdoc/parser/markdown_parser.py
test/unit_test/deepdoc/parser/test_markdown_parser.py`
- `python3 run_tests.py -t
test/unit_test/deepdoc/parser/test_markdown_parser.py`
2026-06-04 13:33:46 +08:00