ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-01 08:15:44 +08:00

Files

Manan Bansal 70c0121b78 Fix: preserve tables when parsing DOCX with the laws parser (#16008 ) (#16155 )

## What

Fixes #16008 — tables contained in a DOCX are silently dropped when the
document is parsed with the **laws** chunking method.

## Root cause

`Docx.__call__` in `rag/app/laws.py` iterated `self.doc.paragraphs`,
which only yields paragraph elements. Tables are separate `tbl` blocks
in the document body, so they were never visited and were lost from the
output. (The `naive` parser already handles tables by iterating the
document body.)

## Changes

- Iterate `self.doc._element.body` so tables are visited in document
order alongside paragraphs.
- Add a `__table_to_html` helper that renders each table to HTML,
including merged-cell `colspan` detection (mirrors the `naive` parser's
logic).
- Inject each table into the section tree with a sentinel level deeper
than any heading, so `Node.build_tree` merges it into its **enclosing
section** — keeping the chapter/article title path as retrieval context
rather than producing an orphaned chunk.
- Guard the `h2_level` computation against an empty heading set, so a
tables-only or empty DOCX no longer raises `IndexError`.

This keeps the laws parser's hierarchical chunking **and** adds table
extraction, so users no longer have to choose between losing structure
(naive) or losing tables (laws).

## Tests

Adds `test/unit_test/rag/test_laws_docx_tables.py` covering:
- table content is preserved and carries its section title path,
- merged adjacent cells collapse to `colspan`,
- tables-only document does not crash,
- empty document returns `[]`.

All four pass; `ruff check` / `ruff format` are clean.

2026-06-22 09:46:44 +08:00

benchmark

Fix: replace tenant_llm apis (#16131 )

2026-06-18 16:38:32 +08:00

fixtures/mineru

fix(mineru): skip page chrome blocks to prevent duplicate chunks (#15387 )

2026-06-01 20:15:04 +08:00

playwright

Feature: Allow page_size max value 100 (#15292 )

2026-05-28 11:13:01 +08:00

testcases

Fix: replace tenant_llm apis (#16131 )

2026-06-18 16:38:32 +08:00

unit_test

Fix: preserve tables when parsing DOCX with the laws parser (#16008 ) (#16155 )

2026-06-22 09:46:44 +08:00

__init__.py

Feat: UI testing automation with playwright (#12749 )

2026-03-02 13:04:08 +08:00

README.md

Docs: Update version references to v0.26.1 in READMEs and docs (#16158 )

2026-06-17 19:35:32 +08:00

README.md

(1). Deploy RAGFlow services and images

https://ragflow.io/docs/build_docker_image

(2). Configure the required environment for testing

Install Python dependencies (including test dependencies):

uv sync --python 3.13 --only-group test --no-default-groups --frozen

Activate the environment:

source .venv/bin/activate

Install SDK:

uv pip install sdk/python

Modify the .env file: Add the following code:

COMPOSE_PROFILES=${COMPOSE_PROFILES},tei-cpu
TEI_MODEL=BAAI/bge-small-en-v1.5
RAGFLOW_IMAGE=infiniflow/ragflow:v0.26.1 #Replace with the image you are using

Start the container（wait two minutes）:

docker compose -f docker/docker-compose.yml up -d

(3). Test Elasticsearch

a) Run sdk tests against Elasticsearch:

export HTTP_API_TEST_LEVEL=p2
export HOST_ADDRESS=http://127.0.0.1:9380  # Ensure that this port is the API port mapped to your localhost
pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_sdk_api

b) Run http api tests against Elasticsearch:

pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_http_api

(4). Test Infinity

Modify the .env file:

DOC_ENGINE=${DOC_ENGINE:-infinity}

Start the container:

docker compose -f docker/docker-compose.yml down -v 
docker compose -f docker/docker-compose.yml up -d

a) Run sdk tests against Infinity:

DOC_ENGINE=infinity pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_sdk_api

b) Run http api tests against Infinity:

DOC_ENGINE=infinity pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_http_api

README.md Unescape Escape

(1). Deploy RAGFlow services and images

(2). Configure the required environment for testing

(3). Test Elasticsearch

(4). Test Infinity

README.md