ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-03 17:21:59 +08:00

Author	SHA1	Message	Date
Ricardo-M-L	cb606e1c38	fix: correct attribute name typo model_speciess to model_species (#13929 ) ## Summary - Rename misspelled attribute `model_speciess` to `model_species` across 4 files - The extra `s` is a typo — `species` is already plural ## Test plan - [ ] Verify PDF parsing with laws/manual/paper parser types still works correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuj <yuj@ztjzsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-15 14:19:41 +08:00
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
Lin Manhui	2e09db02f3	feat: add paddleocr parser (#12513 ) ### What problem does this PR solve? Add PaddleOCR as a new PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-09 17:48:45 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Yongteng Lei	672958a192	Fix: model not authorized (#12001 ) ### What problem does this PR solve? Fix model not authorized. #11973. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-17 19:48:24 +08:00
Kevin Hu	09a3854ed8	Fix: chunk method error. (#11807 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-08 14:28:23 +08:00
coding	971c1bcba7	Fix: missing parameters in by_plaintext method for PDF naive mode (#11408 ) ### What problem does this PR solve? FIx: missing parameters in by_plaintext method for PDF naive mode ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: lih <dev_lih@139.com>	2025-11-21 09:33:36 +08:00
Billy Bao	4b8ce08050	Fix: fix pdf_parser ignored in rag/app/naive.py (#11065 ) ### What problem does this PR solve? Fix: fix pdf_parser ignored in rag/app/naive.py #11000 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-06 15:20:35 +08:00
Billy Bao	cf9611c96f	Feat: Support more chunking methods (#11000 ) ### What problem does this PR solve? Feat: Support more chunking methods #10772 This PR enables multiple chunking methods — including books, laws, naive, one, and presentation — to be used with all existing PDF parsers (DeepDOC, MinerU, Docling, TCADP, Plain Text, and Vision modes). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-05 13:00:42 +08:00
Jin Hai	bab3fce136	Move some constants to common (#11004 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-05 08:01:39 +08:00
Billy Bao	ab52ffc9c0	Fix: law parser (#10897 ) ### What problem does this PR solve? Fix: law parser #10888 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-30 19:00:11 +08:00
Billy Bao	ca9f30e1a1	Add tree_merge for law parsers, significantly outperforming hierarchical_merge (#10202 ) ### What problem does this PR solve? Add tree_merge for law parsers, significantly outperforming hierarchical_merge, solved: #8637 1. Add tree_merge for law parsers, include build_tree and get_tree by dfs. 2. add Copyright statement for helath_utils ### Type of change - [x] Documentation Update - [x] Performance Improvement	2025-09-22 16:33:21 +08:00
Kevin Hu	d9fe279dde	Feat: Redesign and refactor agent module (#9113 ) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-07-30 19:41:09 +08:00
Kevin Hu	dd0ebbea35	Light GraphRAG (#4585 ) ### What problem does this PR solve? #4543 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-01-22 19:43:14 +08:00
Jin Hai	3894de895b	Update comments (#4569 ) ### What problem does this PR solve? Add license statement. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-01-21 20:52:28 +08:00
Kevin Hu	8fb18f37f6	Code refactor. (#4291 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-12-30 18:38:51 +08:00
Zhichang Yu	0d68a6cd1b	Fix errors detected by Ruff (#3918 ) ### What problem does this PR solve? Fix errors detected by Ruff ### Type of change - [x] Refactoring	2024-12-08 14:21:12 +08:00
Jin Hai	e079656473	Update progress info and start welcome info (#3768 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Refactoring --------- Signed-off-by: jinhai <haijin.chn@gmail.com>	2024-11-30 18:48:06 +08:00
Michal Masrna	c4f2464935	fix: laws.py added missing import logging (#3501 ) ### What problem does this PR solve? _Choosing Laws Chunk Method results in an error when parsing a document. The error is caused by a missing import in the `laws.py` file._ ``` Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 445, in handle_task do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 384, in do_handle_task cks = build(r) ^^^^^^^^ File "/ragflow/rag/svr/task_executor.py", line 196, in build cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"], ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/ragflow/rag/app/laws.py", line 161, in chunk for txt, poss in pdf_parser(filename if not binary else binary, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/ragflow/rag/app/laws.py", line 124, in __call__ logging.debug("layouts:".format( ^^^^^^^ NameError: name 'logging' is not defined. Did you forget to import 'logging' ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: Michal Masrna <m.marna1@gmail.com>	2024-11-20 20:52:05 +08:00
Zhichang Yu	30f6421760	Use consistent log file names, introduced initLogger (#3403 ) ### What problem does this PR solve? Use consistent log file names, introduced initLogger ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2024-11-14 17:13:48 +08:00
Zhichang Yu	a2a5631da4	Rework logging (#3358 ) Unified all log files into one. ### What problem does this PR solve? Unified all log files into one. ### Type of change - [x] Refactoring	2024-11-12 17:35:13 +08:00
yqkcn	570ad420a8	remove unused import (#2679 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-09-30 16:59:39 +08:00
Kevin Hu	fc867cb959	rename get_txt to get_text (#2649 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-09-29 12:47:09 +08:00
yqkcn	aea553c3a8	Add get_txt function (#2639 ) ### What problem does this PR solve? Add get_txt function to reduce duplicate code ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2024-09-29 10:29:56 +08:00
Jin Hai	6b3a40be5c	Format file format from Windows/dos to Unix (#1949 ) ### What problem does this PR solve? Related source file is in Windows/DOS format, they are format to Unix format. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-08-15 09:17:36 +08:00
KevinHuSh	92e9320657	upgrade laws parser of docx (#1332 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-07-01 15:50:24 +08:00
Zhedong Cen	fc7cc1d36c	Optimize docx handle method in laws parser (#1302 ) ### What problem does this PR solve? Optimize docx handle method in laws parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-06-28 17:42:59 +08:00
Zhedong Cen	8dd45459be	Add support for HTML file (#973 ) ### What problem does this PR solve? Add support for HTML file ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-05-30 09:12:55 +08:00
KevinHuSh	7013d7f620	refine text decode (#657 ) ### What problem does this PR solve? #651 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-07 12:25:47 +08:00
KevinHuSh	8c07992b6c	refine code (#595 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-04-28 19:13:33 +08:00
Jin Hai	f1c98aad6b	Update version info (#564 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Documentation Update - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-04-26 20:07:26 +08:00
chrysanthemum-boy	72384b191d	Add `.doc` file parser. (#497 ) ### What problem does this PR solve? Add `.doc` file parser, using tika. ``` pip install tika ``` ``` from tika import parser from io import BytesIO def extract_text_from_doc_bytes(doc_bytes): file_like_object = BytesIO(doc_bytes) parsed = parser.from_buffer(file_like_object) return parsed["content"] ``` ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: chrysanthemum-boy <fannc@qq.com>	2024-04-23 15:31:43 +08:00
KevinHuSh	0dfc8ddc0f	enlarge docker memory usage (#501 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-04-23 14:41:10 +08:00
KevinHuSh	a38e163035	remove doc from supported processing types (#488 ) ### What problem does this PR solve? #474 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-22 15:46:09 +08:00
KevinHuSh	ed6081845a	Fit a lot of encodings for text file. (#458 ) ### What problem does this PR solve? #384 ### Type of change - [x] Performance Improvement	2024-04-19 18:02:53 +08:00
KevinHuSh	f6c7204002	refine log format (#312 ) ### What problem does this PR solve? Issue link:#264 ### Type of change - [x] Documentation Update - [x] Refactoring	2024-04-11 10:13:43 +08:00
KevinHuSh	a0a480b708	continue add layout model for 'laws' (#292 ) ### What problem does this PR solve? Issue link:#289 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-04-10 14:06:36 +08:00
KevinHuSh	243de6ac90	add a new model for 'Laws' (#290 ) ### What problem does this PR solve? Issue link:#289 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-04-10 11:59:00 +08:00
KevinHuSh	fd7fcb5baf	apply pep8 formalize (#155 )	2024-03-27 11:33:46 +08:00
KevinHuSh	da21320b88	fix plainPdf bugs (#152 )	2024-03-26 15:11:07 +08:00
KevinHuSh	f6aee7f230	add use layout or not option (#145 ) * add use layout or not option * trival	2024-03-22 19:21:09 +08:00
KevinHuSh	d7c362f237	adjust hierarchical_merge strategy (#100 )	2024-03-06 09:09:16 +08:00
KevinHuSh	602038ac49	fix task cancling bug (#98 )	2024-03-05 16:33:47 +08:00
KevinHuSh	8a57f2afd5	change callback strategy, add timezone to docker (#96 )	2024-03-05 12:08:41 +08:00
KevinHuSh	685b4d8a95	fix table desc bugs, add positions to chunks (#91 )	2024-03-04 14:42:26 +08:00
KevinHuSh	7fd1eca582	init README of deepdoc, add picture processer. (#71 ) * init README of deepdoc, add picture processer. * add resume parsing	2024-02-23 18:28:12 +08:00
KevinHuSh	cacd36c5e1	use onnx models, new deepdoc (#68 )	2024-02-21 16:32:38 +08:00
KevinHuSh	a8294f2168	Refine resume parts and fix bugs in retrival using sql (#66 )	2024-02-19 19:22:17 +08:00
KevinHuSh	407b2523b6	remove unused codes, seperate layout detection out as a new api. Add new rag methed 'table' (#55 )	2024-02-05 18:08:17 +08:00
KevinHuSh	51482f3e2a	Some document API refined. (#53 ) Add naive chunking method to RAG	2024-02-02 19:21:37 +08:00

1 2

54 Commits