ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Files

Manan Bansal 70c0121b78 Fix: preserve tables when parsing DOCX with the laws parser (#16008 ) (#16155 )

## What

Fixes #16008 — tables contained in a DOCX are silently dropped when the
document is parsed with the **laws** chunking method.

## Root cause

`Docx.__call__` in `rag/app/laws.py` iterated `self.doc.paragraphs`,
which only yields paragraph elements. Tables are separate `tbl` blocks
in the document body, so they were never visited and were lost from the
output. (The `naive` parser already handles tables by iterating the
document body.)

## Changes

- Iterate `self.doc._element.body` so tables are visited in document
order alongside paragraphs.
- Add a `__table_to_html` helper that renders each table to HTML,
including merged-cell `colspan` detection (mirrors the `naive` parser's
logic).
- Inject each table into the section tree with a sentinel level deeper
than any heading, so `Node.build_tree` merges it into its **enclosing
section** — keeping the chapter/article title path as retrieval context
rather than producing an orphaned chunk.
- Guard the `h2_level` computation against an empty heading set, so a
tables-only or empty DOCX no longer raises `IndexError`.

This keeps the laws parser's hierarchical chunking **and** adds table
extraction, so users no longer have to choose between losing structure
(naive) or losing tables (laws).

## Tests

Adds `test/unit_test/rag/test_laws_docx_tables.py` covering:
- table content is preserved and carries its section title path,
- merged adjacent cells collapse to `colspan`,
- tables-only document does not crash,
- empty document returns `[]`.

All four pass; `ruff check` / `ruff format` are clean.

2026-06-22 09:46:44 +08:00

__init__.py

Update comments (#4569 )

2025-01-21 20:52:28 +08:00

audio.py

fix: remove duplicate .wav and .aac in audio supported extensions list (#14791 )

2026-05-13 09:42:31 +08:00

book.py

Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 )

2026-04-27 14:57:20 +08:00

email.py

Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 )