Commit Graph

11 Commits

Author SHA1 Message Date
bitloi
01a5598aa5 Fix: markdown fenced code block extraction (#15630)
### What problem does this PR solve?

Markdown extraction currently applies custom delimiters before
respecting fenced code blocks. When a delimiter such as a newline is
configured, fenced code can be split into separate chunks, and longer
outer fences can be closed incorrectly by shorter nested fences.

This PR keeps the fix intentionally narrow for the Markdown chunking
discussion in #15482:

- preserve fenced code blocks when delimiter-based extraction is used
- support both backtick and tilde fences
- respect fence length so longer outer fences can contain shorter inner
fences
- keep delimiter splitting unchanged outside fenced blocks

Refs #15482

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Testing

- `ruff check deepdoc/parser/markdown_parser.py
test/unit_test/deepdoc/parser/test_markdown_parser.py`
- `python3 run_tests.py -t
test/unit_test/deepdoc/parser/test_markdown_parser.py`
2026-06-04 13:33:46 +08:00
Idriss Sbaaoui
dd529137eb Fix: markdown table double extraction in parser (#13892)
### What problem does this PR solve?

Fixes markdown tables being parsed twice (once as markdown and again as
generated HTML), which caused duplicate table chunks in the chunk list
UI.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-02 13:31:56 +08:00
Yongteng Lei
7c20c964b4 Fix: incorrect image merging for naive markdown parser (#11520)
### What problem does this PR solve?

Fix incorrect image merging for naive markdown parser. #9349 


[ragflow_readme.webm](https://github.com/user-attachments/assets/ca3f1e18-72b6-4a4c-80db-d03da9adf8dc)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:06 +08:00
Billy Bao
121c51661d Fix: Markdown table extractor (#11018)
### What problem does this PR solve?

Now markdown table extractor supports <table ...>. #10966 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-05 16:10:21 +08:00
buua436
bb9504d1cc Fix:enhance delimiters in markdown parser (#10896)
### What problem does this PR solve?
issue:
[#10890](https://github.com/infiniflow/ragflow/issues/10890)
change:
enhance delimiters in markdown parser
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-30 17:36:51 +08:00
Yongteng Lei
5200711441 Feat: add support for multi-column PDF parsing (#10475)
### What problem does this PR solve?

Add support for multi-columns PDF parsing. #9878, #9919.

Two-column sample:
<img width="1885" height="1020" alt="image"
src="https://github.com/user-attachments/assets/0270c028-2db8-4ca6-a4b7-cd5830882d28"
/>

Three-column sample: 
<img width="1881" height="992" alt="image"
src="https://github.com/user-attachments/assets/9ee88844-d5b1-4927-9e4e-3bd810d6e03a"
/>

Single-column sample:
<img width="1883" height="1042" alt="image"
src="https://github.com/user-attachments/assets/e93d3d18-43c3-4067-b5fa-e454ed0ab093"
/>



### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-10-11 18:46:09 +08:00
Yongteng Lei
382458ace7 Feat: advanced markdown parsing (#9607)
### What problem does this PR solve?

Using AST parsing to handle markdown more accurately, preventing
components from being cut off by chunking. #9564

<img width="1746" height="993" alt="image"
src="https://github.com/user-attachments/assets/4aaf4bf6-5714-4d48-a9cf-864f59633f7f"
/>

<img width="1739" height="982" alt="image"
src="https://github.com/user-attachments/assets/dc00233f-7a55-434f-bbb7-74ce7f57a6cf"
/>

<img width="559" height="100" alt="image"
src="https://github.com/user-attachments/assets/4a556b5b-d9c6-4544-a486-8ac342bd504e"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-08-21 09:36:18 +08:00
Yongteng Lei
51a8604dcb Fix: fixed context loss caused by separating markdown tables from original text (#8844)
### What problem does this PR solve?

Fix context loss caused by separating markdown tables from original
text. #6871, #8804.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-07-15 13:03:01 +08:00
liwenju0
5b0e38060a Feat:Optimize the table extraction logic in the Markdown parser: (#5663)
Enhance the recognition of both borderless and bordered Markdown tables.
Add support for extracting HTML tables, including various scenarios with
nested HTML tags. Improve performance by using conditional checks to
reduce unnecessary regular expression matching.

### What problem does this PR solve?

Optimize the table extraction logic in the Markdown parser:
Enhance the recognition of both borderless and bordered Markdown tables.
Add support for extracting HTML tables, including various scenarios with
nested HTML tags.
Improve performance by using conditional checks to reduce unnecessary
regular expression matching.

### Type of change

- [x] Performance Improvement

Co-authored-by: wenju.li <wenju.li@deepctr.cn>
2025-03-07 17:02:35 +08:00
Jin Hai
3894de895b Update comments (#4569)
### What problem does this PR solve?

Add license statement.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-01-21 20:52:28 +08:00
Zhedong Cen
a95c1d45f0 Support table for markdown file in general parser (#1278)
### What problem does this PR solve?

Support extracting table for markdown file in general parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-06-27 14:38:35 +08:00