ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-01 16:25:44 +08:00

Author	SHA1	Message	Date
Jack	c1941fd503	Refactor: deco doc-parse API that is not used any more (#14367 ) ### What problem does this PR solve? Delete un-used API "POST /v1/document/parse" ### Type of change - [x] Refactoring	2026-04-27 18:54:49 +08:00
Jack	61a24a2c14	Refactor: migrate doc upload info used in chat (#14359 ) ### What problem does this PR solve? Before migration: POST /v1/document/upload_info/ After migration: POST /api/v1/documentss/upload/ ### Type of change - [x] Refactoring	2026-04-27 16:58:42 +08:00
Wang Qi	d88f7ac8d2	Remove evaluation_app.py and kb_app.py (#14394 ) ### What problem does this PR solve? Delete not used APIs ### Type of change - [x] Refactoring	2026-04-27 16:08:54 +08:00
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
buua436	0b46ab07c5	Refa: restore openai-compatible chat completions api (#14380 ) ### What problem does this PR solve? restore openai-compatible chat completions api ### Type of change - [x] Refactoring	2026-04-27 14:02:19 +08:00
buua436	a9e5724b46	Refa: unify document create flows under REST documents API (#14345 ) ### What problem does this PR solve? unify document create flows under REST documents API ### Type of change - [x] Refactoring	2026-04-27 10:18:16 +08:00
euvre	4dcc42e0e1	feat(api): add unified index API and dataset management endpoints (#14222 ) ### What problem does this PR solve? ## Summary Refactor the dataset API layer into a clean service/REST separation pattern, add a unified `/index` API for graph/raptor/mindmap operations, and introduce several new dataset management endpoints with full test coverage. ## Changes ### Service Layer (`dataset_api_service.py`) - Added `trace_index(dataset_id, tenant_id, index_type)` — unified trace function for all index types - Added `run_index`, `delete_index` service functions - Added `get_dataset`, `get_ingestion_summary`, `list_ingestion_logs`, `get_ingestion_log` - Added `run_embedding`, `list_tags`, `aggregate_tags`, `delete_tags`, `rename_tag` - Added `get_flattened_metadata`, `get_auto_metadata`, `update_auto_metadata` ### REST API Layer (`dataset_api.py`) New unified routes: \| Method \| Route \| Description \| \|--------\|-------\|-------------\| \| POST \| `/datasets/<id>/index?type=graph\\|raptor\\|mindmap` \| Run index task \| \| GET \| `/datasets/<id>/index?type=graph\\|raptor\\|mindmap` \| Trace index task \| \| DELETE \| `/datasets/<id>/<index_type>` \| Delete index \| \| GET \| `/datasets/<id>` \| Get dataset details \| \| GET \| `/datasets/<id>/ingestions/summary` \| Ingestion summary \| \| GET \| `/datasets/<id>/ingestions` \| List ingestion logs \| \| GET \| `/datasets/<id>/ingestions/<log_id>` \| Get single ingestion log \| \| POST \| `/datasets/<id>/embedding` \| Run embedding \| \| GET \| `/datasets/<id>/tags` \| List tags \| \| GET \| `/datasets/tags/aggregation` \| Aggregate tags across datasets \| \| DELETE \| `/datasets/<id>/tags` \| Delete tags \| \| PUT \| `/datasets/<id>/tags` \| Rename tag \| \| GET \| `/datasets/metadata/flattened` \| Get flattened metadata \| \| GET/PUT \| `/datasets/<id>/metadata/config` \| New metadata config path \| Removed routes (replaced by unified `/index`): - `POST /datasets/<id>/mindmap` - `GET /datasets/<id>/mindmap` Preserved legacy routes (backward compatibility): - `/run_graphrag`, `/trace_graphrag`, `/run_raptor`, `/trace_raptor` - `/auto_metadata` GET/PUT ### Test Suite - Updated `common.py` helpers: added `trace_index`, removed `run_mindmap`/`trace_mindmap` - Added 7 new test files with 39 test cases total: \| Test File \| Cases \| \|-----------\|-------\| \| `test_get_dataset.py` \| 4 \| \| `test_ingestion_summary.py` \| 2 \| \| `test_ingestion_logs.py` \| 5 \| \| `test_index_api.py` \| 14 \| \| `test_embedding.py` \| 2 \| \| `test_tags.py` \| 8 \| \| `test_flattened_metadata.py` \| 4 \| - Deleted `test_mindmap_tasks.py` (covered by unified index tests) ## Design Decisions 1. Unified `/index?type=...` — single endpoint replaces 3 separate route pairs for graph/raptor/mindmap 2. Backward compatibility — old routes (`/run_graphrag`, `/run_raptor`, `/auto_metadata`) preserved alongside new paths 3. `_VALID_INDEX_TYPES = {"graph", "raptor", "mindmap"}` — input validation via constant set 4. `_INDEX_TYPE_TO_TASK_ID_FIELD` — maps index type to KB model task ID field for clean dispatch ## Files Changed - `api/apps/restful_apis/dataset_api.py` - `api/apps/services/dataset_api_service.py` - `sdk/python/ragflow_sdk/modules/dataset.py` - `test/testcases/test_http_api/common.py` - `test/testcases/test_http_api/test_dataset_management/` (7 new files) ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 09:38:01 +08:00
Xing Hong	fb95136f39	Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090 ) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-25 14:30:15 +08:00
wdeveloper16	78188ce9e9	Feat: add OpenDataLoader PDF parser backend (#14058 ) (#14097 ) ### What problem does this PR solve? Closes #14058. RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU, Docling, TCADP, PaddleOCR). This PR adds OpenDataLoader ([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf)) as a new optional backend, giving users a deterministic, local-first alternative with competitive table extraction accuracy. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --- ### Changes #### Backend - `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser` class inheriting `RAGFlowPdfParser`. Implements `check_installation()` (guards Python package + Java 11+ runtime), `parse_pdf()` with JSON-first extraction (heading/paragraph/table/list/image/formula) and Markdown fallback, position-tag generation compatible with the shared `@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup. - `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in `PARSERS` dict, added to `chunk_token_num=0` override list. - `rag/flow/parser/parser.py` — `"opendataloader"` branch in the pipeline PDF handler + check validation list. #### Infrastructure - `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH. #### Frontend - `web/src/components/layout-recognize-form-field.tsx` — `OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown. Cascades automatically to the pipeline editor's Parser component. #### Docs - `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader entry and full env-var reference. --- ### Environment variables \| Variable \| Default \| Description \| \|---\|---\|---\| \| `USE_OPENDATALOADER` \| `false` \| Set `true` to install `opendataloader-pdf` on container startup \| \| `OPENDATALOADER_VERSION` \| latest \| Pin the PyPI release (e.g. `==2.2.1`) \| \| `OPENDATALOADER_HYBRID` \| _(unset)_ \| Enable hybrid AI mode (e.g. `docling-fast`) \| \| `OPENDATALOADER_IMAGE_OUTPUT` \| _(unset)_ \| `off` / `embedded` / `external` \| \| `OPENDATALOADER_OUTPUT_DIR` \| _(tmp)_ \| Persistent output dir; temp dir used + cleaned if unset \| \| `OPENDATALOADER_DELETE_OUTPUT` \| `1` \| `0` to retain intermediate files for debugging \| \| `OPENDATALOADER_SANITIZE` \| _(unset)_ \| `1` to filter prompt-injection patterns from output \| --- ### Dependencies - Runtime: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not added to `pyproject.toml` core deps. Installed by `ensure_opendataloader()` at container startup when `USE_OPENDATALOADER=true`. - System: Java 11+ on PATH (JVM is the underlying engine). The installer skips with a warning if `java` is not found. --- ### How to test Standalone parser: ```bash source .venv/bin/activate uv pip install opendataloader-pdf python3 -c " import sys; sys.path.insert(0, '.') from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser p = OpenDataLoaderParser() print('available:', p.check_installation()) s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline') print(f'sections={len(s)} tables={len(t)}') " ``` ### Benchmark vs Docling ``` file parser secs sections tables ---------------------------------------------------------------------- text-heavy.pdf docling 45.29 148 10 text-heavy.pdf opendataloader 3.14 559 0 table-heavy.pdf docling 7.05 76 3 table-heavy.pdf opendataloader 3.71 90 0 complex.pdf docling 42.67 114 8 complex.pdf opendataloader 3.51 180 0 ```	2026-04-25 00:33:02 +08:00
buua436	9ad752f497	Refa：migrate agent webhook routes to REST APIs (#14330 ) ### What problem does this PR solve? migrate agent webhook routes to REST APIs ### Type of change - [x] Refactoring	2026-04-24 17:55:53 +08:00
Wang Qi	199fbceb72	Refactor user REST API (#14334 ) ### What problem does this PR solve? Refactor user REST API ### Type of change - [x] Refactoring	2026-04-24 10:25:15 +08:00
Magicbook1108	c74aece63c	Feat: Agent api (#14157 ) ### What problem does this PR solve? 1. List agents Prev API: - `/v1/canvas/list GET` - `/api/v1/agents GET` Current API: `/api/v2/agents GET` 2. Get canvas template Prev API: `/v1/canvas/templates GET` Current API: `/api/v2/agents/templates GET` 3. Delete an agent Prev API: - `/v1/canvas/rm POST` - `/api/v1/agents/<agent_id> DELETE` Current API: `/api/v2/agents/<agent_id> DELETE` 4. Update an agent Prev API: - `/api/v1/agents/<agent_id> PUT` - `/v1/canvas/setting POST ` Current API: `/api/v2/agents/<agent_id> PATCH` 5. Create an agent Prev API: - `/v1/canvas/set POST` - `/api/v1/agents POST` Current API: `/api/v2/agents POST` 6. Get an agent Prev API: - `/v1/canvas/get/<canvas_id> GET ` Current API: `/api/v2/agents/<agent_id> GET` 7. Reset an agent Prev API: - `/v1/canvas/reset POST` Current API: `/api/v2/agents/<agent_id>/reset POST` 8. Upload a file to an agent Prev API: - `/v1/canvas/upload/<canvas_id> POST` Current API: `/api/v2/agents/<agent_id>/upload POST` 9. Input form Prev API: - `/v1/canvas/input_form GET` Current API: `/api/v2/agents/<agent_id>/components/<component_id>/input-form GET` 10. Debug an agent Prev API: - `/v1/canvas/debug POST` Current API: `/api/v2/agents/<agent_id>/components/<component_id>/debug POST` 11. Trace an agent Prev API: - `/v1/canvas/trace GET` Current API: `/api/v2/agents/<agent_id>/logs/<message_id> GET` 12. Get an agent version list Prev API: - `/v1/canvas/getlistversion/<canvas_id>` Current API: `/api/v2/agents/<agent_id>/versions GET` 13. Get a version of agent Prev API: - `/v1/canvas/getversion/<version_id>` Current API: `/api/v2/agents/<agent_id>/versions/<version_id> GET` 14. Test db connection Prev API: - `/v1/canvas/test_db_connect POST` Current API: `/api/v2/agents/test_db_connection` 15. Rerun the agent Prev API: - `/v1/canvas/rerun POST` Current API: `/api/v2/agents/rerun POST` 16. Get prompts Prev API: - `/v1/canvas/prompts GET` Current API: `/api/v2/agents/prompts GET` ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: chanx <1243304602@qq.com>	2026-04-24 10:02:22 +08:00
Wang Qi	ba47c13eb5	Fix commit override from #14298 of api-key to api_key (#14328 ) ### What problem does this PR solve? Fix commit override from https://github.com/infiniflow/ragflow/pull/14298/ of `api-key` to `api_key` ### Type of change - [x] Refactoring	2026-04-23 17:16:32 +08:00
Wang Qi	4458763a93	API refactor: stats_api and plugin_api (#14324 ) ### What problem does this PR solve? API refactor: stats_api and plugin_api ### Type of change - [x] Refactoring	2026-04-23 17:16:04 +08:00
buua436	7817b0d779	Refa: migrate chunk APIs to RESTful routes (#14291 ) ### What problem does this PR solve? migrate chunk APIs to RESTful routes ### Type of change - [x] Refactoring	2026-04-23 14:17:23 +08:00
Magicbook1108	76b017ca32	Refact: system apis (#14298 ) ### What problem does this PR solve? Refact: system apis ### Type of change - [x] Refactoring	2026-04-23 14:09:42 +08:00
buua436	aa4526266f	Refa: migrate MCP APIs to RESTful api (#14317 ) ### What problem does this PR solve? migrate MCP APIs to RESTful api ### Type of change - [x] Refactoring	2026-04-23 12:51:27 +08:00
Jack	dbf8c6ed90	Refactor: Doc metadata update (#14289 ) ### What problem does this PR solve? Before migration Web API: POST /v1/document/metadata/update After migration, Restful API PATCH /api/v2/datasets/<dataset_id>/documents/metadatas ### Type of change - [x] Refactoring	2026-04-23 12:04:34 +08:00
Wang Qi	aae45b959b	Refactor: API file2document (#14306 ) Refactor: API file2document	2026-04-23 11:40:45 +08:00
Wang Qi	e79b896637	Refactor: REST API langfuse api-key (#14315 ) REST API langfuse api-key	2026-04-23 11:36:16 +08:00
Wang Qi	01753b8f31	Refactor: API connectors (#14228 ) ### What problem does this PR solve? Refactor /api/v1/connectors to be more RESTful. ### Type of change - [x] Refactoring	2026-04-22 20:42:41 +08:00
Jack	c08cd8e090	Refactor: Migrate document metadata config update API (#14286 ) ### What problem does this PR solve? Before migration Web API: POST /v1/document/update_metadata_setting After consolidation, Restful API PUT /api/v1/datasets/<dataset_id>/documents/<document_id>/metadata/config ### Type of change - [x] Refactoring	2026-04-22 20:01:31 +08:00
Magicbook1108	d1c62fc19d	Refact: Tenant api (#14288 ) ### What problem does this PR solve? Refact: Tenant api ### Type of change - [x] Refactoring	2026-04-22 20:00:32 +08:00
Jack	3d8a82c0aa	Refactor: Consolidation WEB API & HTTP API for document delete api (#14254 ) ### What problem does this PR solve? Before consolidation Web API: POST /v1/document/rm Http API - DELETE /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API -- DELETE /api/v1/datasets/<dataset_id>/documents ### Type of change - [x] Refactoring	2026-04-22 10:49:52 +08:00
buua436	6baf74afc1	Refa: align chat and search restful APIs (#14229 ) ### What problem does this PR solve? Refactor /api/v1/chats to be more RESTful. ### Type of change - [x] Refactoring --------- Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-04-22 10:49:11 +08:00
Jack	2d05475693	Refactor: Consolidation WEB API & HTTP API for document infos (#14239 ) ### What problem does this PR solve? Before consolidation Web API: POST /v1/document/infos Http API - GET /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API -- GET /api/v1/datasets/<dataset_id>/documents?ids=id1&ids=id2 ### Type of change - [ ] Refactoring	2026-04-21 19:35:11 +08:00
Jack	009e538a4e	Refactor: Consolidation WEB API & HTTP API for document get_filter (#14248 ) ### What problem does this PR solve? Before consolidation Web API: POST /v1/document/filter Http API - GET /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API -- GET /api/v1/datasets/<dataset_id>/documents?type=filter ### Type of change - [x] Refactoring	2026-04-21 18:55:30 +08:00
Liu An	a33d0737cd	Docs: Update version references to v0.25.0 in READMEs and docs (#14257 ) ### What problem does this PR solve? - Update version tags in README files (including translations) from v0.24.0 to v0.25.0 - Modify Docker image references and documentation to reflect new version - Update version badges and image descriptions - Maintain consistency across all language variants of README files ### Type of change - [x] Documentation Update	2026-04-21 17:26:50 +08:00
Liu An	6e33d8722f	Revert "Fix: forwarding highlight param" (#14249 ) Reverts infiniflow/ragflow#14112	2026-04-21 15:23:18 +08:00
Jack	939933649a	Refactor: Consolidation WEB API & HTTP API for document list_docs (#14176 ) ### What problem does this PR solve? Before consolidation Web API: POST /v1/document/list Http API - GET /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API -- GET /api/v1/datasets/<dataset_id>/documents ### Type of change - [x] Refactoring	2026-04-20 14:54:40 +08:00
Liu An	d5c306de30	Fix: remove unit test checkpoint resume (#14216 ) ### What problem does this PR solve? remove unit test checkpoint resume ### Type of change - [x] Performance Improvement	2026-04-20 11:27:40 +08:00
Daniil Sivak	22c6648348	Fix: forwarding highlight param (#14112 ) Closes #9078 ### What problem does this PR solve? The `retrieval_test` endpoint in `chunk_app.py` never forwarded the `highlight` request parameter to `retriever.retrieval()`, so the search engine never produced highlight snippets. Additionally, the frontend always rendered `content_with_weight` instead of preferring the `highlight` field, and the CSS rule color `var(--accent-primary)` didn't work because the variable stores an RGB triplet `(45,212,191)` requiring the `rgb()` wrapper. ### Before - Search page: displayed raw content_with_weight as a wall of plain white text with no term highlighting, including markdown headings rendered as literal text - Retrieval testing page: showed `content_with_weight` in a plain `<p>` tag, no `<em>` tags rendered, no highlight coloring - Children chunks: when child chunks were consolidated into a parent via `retrieval_by_children`, any highlight data from children was discarded - TOC chunks: chunks fetched via `retrieval_by_toc` had no `highlight` field, appearing as plain text while other chunks had highlights Retrieval testing: <img width="1449" height="1178" alt="before-retrieval-no-highlight-cropped" src="https://github.com/user-attachments/assets/5c6f5a5e-6c11-461a-bdb4-049d7dfb7a33" /> Search: <img width="1378" height="711" alt="before-search-no-highlight-cropped" src="https://github.com/user-attachments/assets/be7b5152-72ef-40da-a8fd-921e997ae7d3" /> ### After - Search page: displays the highlight field with search terms rendered in teal/cyan color (`rgb(var(--accent-primary))`) - Retrieval testing page: sends highlight: true in the request, uses `HighLightMarkdown` component to render `<em>` tags with proper coloring - Children chunks: highlights from child chunks are joined and preserved on the parent - TOC chunks: when other chunks have highlights, TOC-fetched chunks use `content_with_weight` as a highlight fallback Retrieval testing: <img width="1410" height="1015" alt="05-retrieval-testing-results" src="https://github.com/user-attachments/assets/f0cff8cf-0962-4320-b559-cd5037f622d2" /> Search: <img width="1294" height="455" alt="03-search-highlight-results" src="https://github.com/user-attachments/assets/a90e0e3e-3837-46be-8ddd-2412ff7cbc19" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-17 20:59:20 +08:00
Daniil Sivak	c93ec0a1f3	Fix: reject empty/space-only content in update_chunk API (#14082 ) Closes #6541 ### What problem does this PR solve? Add content validation to `update_chunk` (SDK and non-SDK) to reject empty or whitespace-only content before it reaches the embedding model. Before: Calling `update_chunk` with space-only content (like `" "`, `""`, `"\n"`) bypassed validation and was sent directly to the embedding model, which returned an error. This was the same bug previously fixed for `add_chunk` in #6390, but `update_chunk` was missed. After: Empty/whitespace-only content is caught by validation and returns an error: `` `content` is required `` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-15 18:43:53 +08:00
Minal Mahala	f930389311	Refact: improve task resume mechanism for graphrag (#14096 ) ### What problem does this PR solve? Addresses review feedback on #14074 (Checkpoint mechanism for long-running workflow jobs, issue #12494). Changes based on @yuzhichang's review: 1. Renamed `checkpoint_service.py` → `task_checkpoint.py` as suggested. 2. Replaced Redis with direct docEngine queries as suggested — the subgraph already gets persisted to the doc store by `generate_subgraph()`, so we just query for it instead of maintaining a separate checkpoint in Redis. This is simpler, has no extra dependency, and uses a single source of truth. Changes based on CodeRabbit review: 3. Fixed `source_id` query format mismatch — subgraphs are stored with `source_id: [doc_id]` (list), but the original query used `source_id: doc_id` (string). Now follows the same pattern as `does_graph_contains()` in `rag/graphrag/utils.py`: filter by `knowledge_graph_kwd` only, then match `source_id` in Python. This avoids ambiguity across Elasticsearch / Infinity / OceanBase backends. ### Changes \| File \| Change \| \|---\|---\| \| `api/db/services/task_checkpoint.py` (new) \| `load_subgraph_from_store()` and `has_raptor_chunks()` — docEngine-based checkpoint queries \| \| `rag/graphrag/general/index.py` \| `build_one()` calls `load_subgraph_from_store()` before running LLM extraction \| \| `rag/svr/task_executor.py` \| RAPTOR per-doc loop calls `has_raptor_chunks()` before processing \| \| `test/unit_test/rag/graphrag/test_checkpoint_resume.py` (new) \| 10 unit tests covering subgraph loading, source_id filtering, edge cases \| ### How it works - GraphRAG: Before running expensive LLM entity/relation extraction for a doc, checks the doc store for an existing subgraph (saved by a previous interrupted run). If found, loads it directly and skips LLM calls. - RAPTOR: Before processing a doc, checks if RAPTOR chunks (`raptor_kwd="raptor"`) already exist for it. If yes, skips. ### Testing - 10 new unit tests — all passing - Full existing suite: 617 passed ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2026-04-15 17:37:28 +08:00
Ea001	38cefd88e2	Fix tag_feas code injection in retrieval ranking (#13923 ) ## Summary - remove eval-based parsing from retrieval rank feature scoring - validate `tag_feas` at write time in chunk APIs and SDK routes - add regression tests for safe parsing and malicious payload rejection ## Details `tag_feas` is intended to be structured rank-feature data, but the retrieval ranking path was evaluating stored values as Python expressions. This change treats `tag_feas` strictly as data. ### What changed - replace `eval()` in `rag/nlp/search.py` with safe parsing via `json.loads()` and optional `ast.literal_eval()` compatibility for legacy Python-dict strings - strictly filter parsed values down to `dict[str, finite number]` - reject invalid `tag_feas` payloads at write time in web chunk routes and SDK document chunk routes - add focused regression tests to prove executable strings are ignored and invalid payloads are rejected ## Validation - `python -m pytest test/unit_test/common/test_tag_feature_utils.py test/unit_test/rag/test_rank_feature_scores.py -q` --------- Co-authored-by: unknown <zhenglinkai@CCN.Local> Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>	2026-04-15 16:31:11 +08:00
Eden	1f33ca1099	fix(dialog): restore decorated answer in async_ask final SSE event (#13917 ) ## What's the problem Both `async_chat()` and `async_ask()` call `decorate_answer()` to build the final SSE payload — it inserts citation markers (`##N$$`) into the answer text and prunes `doc_aggs` to only the cited documents. Immediately after, both functions overwrite `final["answer"]` with `""`: ```python # async_chat(), line ~774 (issue #13828) final = decorate_answer(thought + full_answer) final["final"] = True final["audio_binary"] = None final["answer"] = "" # discards decorated text yield final # async_ask(), line ~1444 (same bug, different path) final = decorate_answer(full_answer) final["final"] = True final["answer"] = "" # discards decorated text yield final ``` The client receives filtered references (built for a citation-decorated answer it never sees) while displaying the raw, undecorated streaming text. Citations can never match. ## Root cause `final["answer"] = ""` was left over from an earlier design where clients were meant to reconstruct the full answer purely from delta events. Once `decorate_answer()` started placing citation markers, this blank-out broke the contract: the final event is where the decorated answer should land. ## Fix Remove the two blank-override lines — one in `async_chat()`, one in `async_ask()`: ```diff - final["answer"] = "" ``` `decorate_answer()` already sets `final["answer"]` to the correct decorated string; there is nothing to override. ## Relation to #13828 Issue #13828 and PR #13835 identify the bug in `async_chat()`. This PR absorbs that fix and also corrects the identical pattern in `async_ask()` (used by the `/retrieval` route in `chat_api.py`), which PR #13835 does not touch. ## Regression test Added `test/unit_test/api/db/services/test_dialog_service_final_answer.py` with three tests: \| Test \| Purpose \| \|------\|---------\| \| `test_buggy_pattern_drops_answer` \| Documents the old behaviour: blank-override empties the final answer \| \| `test_fixed_pattern_preserves_decorated_answer` \| Core invariant: final event carries the decorated text from `decorate_answer()` \| \| `test_final_event_reference_matches_decorated_result` \| Citation markers in the answer must match the pruned `doc_aggs` in the same event \| Local run result: ``` test_dialog_service_final_answer.py::test_buggy_pattern_drops_answer PASSED test_dialog_service_final_answer.py::test_fixed_pattern_preserves_decorated_answer PASSED test_dialog_service_final_answer.py::test_final_event_reference_matches_decorated_result PASSED 3 passed in 0.04s ``` `ruff check` passes with no issues on all changed files. --------- Co-authored-by: edenfunf <edenfunf@gmail.com> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-04-15 14:10:36 +08:00
Jack	bc5f78996b	Consolidateion of document upload API (#14106 ) ### What problem does this PR solve? Consolidation WEB API & HTTP API for document upload Before consolidation Web API: POST /v1/document/upload Http API - POST /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API -- POST /api/v1/datasets/<dataset_id>/documents ### Type of change - [x] Refactoring	2026-04-15 11:27:43 +08:00
Jin Hai	8e9cef3687	Remove unused API (#14046 ) ### What problem does this PR solve? 1. Remove unused token related API 2. Fix typo ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-04-14 19:32:16 +08:00
Jack	576431de99	Refactor: Change update doc from PUT to patch (#14067 ) ### What problem does this PR solve? Before change, update_document in api/apps/restful_apis/document_api.py is using "PUT". After change, it will use "PATCH" which is more suitable. ### Type of change - [x] Refactoring	2026-04-14 17:12:23 +08:00
Idriss Sbaaoui	d6987b4d8f	Fix p3 ci fails (#14069 ) ### What problem does this PR solve? fix issue with stale tests on p3 level ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-14 10:47:07 +08:00
bitloi	853021ff2a	feat: support multiple canvas_types for agent templates and remove duplicate files (#14030 ) ### What problem does this PR solve? Closes #13907 The template catalog had duplicate files (e.g. `*_r.json`) only to place the same template into multiple sidebar groups. This increases maintenance cost and makes template updates error-prone. This PR adds first-class support for multiple template categories in a single file via `canvas_types`, then removes duplicate template files. What changed: - Added `canvas_types` to `CanvasTemplate` model and DB migration. - Added normalization logic when loading templates: - accepts legacy `canvas_type` - accepts new `canvas_types` - merges/deduplicates values - preserves backward compatibility by keeping `canvas_type` as first normalized value. - Updated template import flow to load only `.json` files and in stable sorted order. - Updated frontend template filtering to match on `canvas_types` first, with fallback to legacy `canvas_type`. - Consolidated duplicated template pairs into single files and removed: - `deep_search_r.json` - `reflective_academic_paper_generator_r.json` - `seo_article_writer_r.json` - Added regression/edge-case tests for category normalization and route serialization expectations. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-04-13 20:26:30 +08:00
Tong Liu	6fdca2d212	[Security] Fix jinja2 SSTI vulnerability using SandboxedEnvironment (#14068 )	2026-04-13 19:24:13 +08:00
Jack	51ce6aab01	Consolidate set_meta into update_document (#14045 ) ### What problem does this PR solve? Consolidate "set_meta" API into "update_document" . Before consolidation Web API: POST /api/v1/document/set_meta Http API - PUT /v1/datasets/<dataset_id>/document/<document_id> After consolidation, Restful API -- PUT /v1/datasets/<dataset_id>/document/<document_id> ### Type of change - [x] Refactoring	2026-04-13 12:47:17 +08:00
Jack	4046a4cfb6	Consolidateion metadata summary API (#14031 ) ### What problem does this PR solve? Consolidation WEB API & HTTP API for document metadata summary Before consolidation Web API: POST /api/v1/document/metadata/summary Http API - GET /v1/datasets/<dataset_id>/metadata/summary After consolidation, Restful API -- GET /v1/datasets/<dataset_id>/metadata/summary ### Type of change - [x] Refactoring	2026-04-10 18:41:30 +08:00
Yongteng Lei	b33d2fdea5	Refa: GraphRAG to use async chat methods instead of thread pool execution (#14002 ) ### What problem does this PR solve? GraphRAG _async_chat. ### Type of change - [x] Refactoring - [x] Performance Improvement <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Refactor * Unified chat calls to an async invocation across extractors, improving timeout handling and ensuring task IDs propagate reliably. * Tests * Added and expanded unit tests and mocks to cover extractor behavior, timeout scenarios, and safe test-package imports, reducing regression risk. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2026-04-09 19:57:35 +08:00
Zhichang Yu	b7744e053e	fix: support dense_vector from ES fields response (ES 9.x compatibility) (#13972 ) fix: support dense_vector from ES fields response (ES 9.x compatibility) - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Configuration Chore (non-breaking change which updates configuration) ## Summary by CodeRabbit * Bug Fixes * More accurate handling and unwrapping of dense-vector fields so returned values have correct shapes. * Field selection reliably limits returned data and falls back to alternate result locations when needed. * Use of consistent result IDs and tolerant handling when score values are missing. * Chores / Configuration * Increased build memory and adjusted build-time flags for the frontend build. * Simplified runtime model/GPU checks and removed an automated runtime GPU-install attempt. * Build Fixes * `web/vite.config.ts`: make `build.minify` and `build.sourcemap` respect `VITE_MINIFY` and `VITE_BUILD_SOURCEMAP` env vars from Dockerfile instead of hardcoding `terser` and `true`. * Environment * Allow stack version override and default the runtime image tag to "latest". <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Correct unwrapping of dense-vector fields and reliable field selection with fallback locations. * Consistent use of hit-level IDs and tolerant handling when score values are missing. * Chores / Configuration * Increased frontend build memory and added build-time minify/sourcemap flags; build minification and sourcemap now configurable. * Removed runtime GPU detection for model initialization; force CPU initialization. * Environment * Allow stack version override and default runtime image tag to "latest". <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 17:44:13 +08:00
Jack	577c96bf2a	Refactor: Merge document update API (#13962 ) ### What problem does this PR solve? Refactor: merge document.rename into document.update_document ### Type of change - [x] Refactoring <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added a unified document update API (PUT) supporting name, metadata, parser/chunk settings, and status changes. * Breaking Changes * Legacy single-parameter rename endpoint removed; renames now require dataset + document identifiers. * `/list` now reads dataset id from a different query parameter. * Validation / Bug Fixes * Stricter meta_fields and parser-config validation; unauthenticated requests return 401. * Frontend * UI now sends dataset id when saving document names. * Tests * Numerous unit and HTTP tests adjusted or removed to match new API and validations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: MkDev11 <94194147+MkDev11@users.noreply.github.com> Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com> Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com> Co-authored-by: Qi Wang <wangq8@outlook.com> Co-authored-by: dataCenter430 <161712630+dataCenter430@users.noreply.github.com> Co-authored-by: balibabu <cike8899@users.noreply.github.com>	2026-04-09 11:17:38 +08:00
Magicbook1108	c5871c1078	Fix: dsl import/export (#13992 ) ### What problem does this PR solve? Fix: dsl import/export ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Enhanced JSON import functionality for agents to automatically populate components from imported graph structures. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 10:55:22 +08:00
Jin Hai	fa75aee3b9	Refactor system API (#13958 ) ### What problem does this PR solve? - ping - token - log level ### Type of change - [x] Refactoring <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Refactor * System endpoints consolidated under /api/v1/system: ping, health check, and token management moved to the centralized API surface. * Token management unified at /api/v1/system/tokens with list/create/delete behavior. * Documentation * API reference updated to reflect the new /api/v1/system paths. * Tests * Client fixtures and test utilities updated to use /api/v1/system/tokens; one unit test for health/oceanbase status removed. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-04-08 15:26:18 +08:00
Jin Hai	ad789f5c43	Fix list files (#13960 ) ### What problem does this PR solve? As title. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Standardized the query parameter used when listing documents so listings behave consistently across the web and client interfaces. * Clarified the error message shown when a required dataset ID is missing to give clearer guidance to users. * Tests * Updated test coverage to reflect the standardized dataset identifier usage. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-04-08 13:38:30 +08:00

1 2 3 4 5

208 Commits