ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Author	SHA1	Message	Date
rhinoceros.xn	4e992de91f	Add tongyi gte-rerank-v2 (#14215 ) https://bailian.console.aliyun.com/cn-beijing?tab=api#/api/?type=model&url=2780056 ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Other (please describe): add gte-rerank-v2、qwen3-rerank	2026-04-20 11:39:17 +08:00
Daniil Sivak	22c6648348	Fix: forwarding highlight param (#14112 ) Closes #9078 ### What problem does this PR solve? The `retrieval_test` endpoint in `chunk_app.py` never forwarded the `highlight` request parameter to `retriever.retrieval()`, so the search engine never produced highlight snippets. Additionally, the frontend always rendered `content_with_weight` instead of preferring the `highlight` field, and the CSS rule color `var(--accent-primary)` didn't work because the variable stores an RGB triplet `(45,212,191)` requiring the `rgb()` wrapper. ### Before - Search page: displayed raw content_with_weight as a wall of plain white text with no term highlighting, including markdown headings rendered as literal text - Retrieval testing page: showed `content_with_weight` in a plain `<p>` tag, no `<em>` tags rendered, no highlight coloring - Children chunks: when child chunks were consolidated into a parent via `retrieval_by_children`, any highlight data from children was discarded - TOC chunks: chunks fetched via `retrieval_by_toc` had no `highlight` field, appearing as plain text while other chunks had highlights Retrieval testing: <img width="1449" height="1178" alt="before-retrieval-no-highlight-cropped" src="https://github.com/user-attachments/assets/5c6f5a5e-6c11-461a-bdb4-049d7dfb7a33" /> Search: <img width="1378" height="711" alt="before-search-no-highlight-cropped" src="https://github.com/user-attachments/assets/be7b5152-72ef-40da-a8fd-921e997ae7d3" /> ### After - Search page: displays the highlight field with search terms rendered in teal/cyan color (`rgb(var(--accent-primary))`) - Retrieval testing page: sends highlight: true in the request, uses `HighLightMarkdown` component to render `<em>` tags with proper coloring - Children chunks: highlights from child chunks are joined and preserved on the parent - TOC chunks: when other chunks have highlights, TOC-fetched chunks use `content_with_weight` as a highlight fallback Retrieval testing: <img width="1410" height="1015" alt="05-retrieval-testing-results" src="https://github.com/user-attachments/assets/f0cff8cf-0962-4320-b559-cd5037f622d2" /> Search: <img width="1294" height="455" alt="03-search-highlight-results" src="https://github.com/user-attachments/assets/a90e0e3e-3837-46be-8ddd-2412ff7cbc19" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-17 20:59:20 +08:00
Yongteng Lei	fac46ef67f	Refa: change Minimax base url to mainland by default to align with UI (#14195 ) ### What problem does this PR solve? Change Minimax base url to mainland by default to align with UI. ### Type of change - [x] Refactoring	2026-04-17 19:08:57 +08:00
euvre	0cd49e14dd	fix: make Infinity connection pool size configurable and add retry logic for GraphRAG write bursts (#14143 ) ### What problem does this PR solve? Resolve #14137 . ### Problem Graph resolution succeeds (nodes/edges merged, pagerank updated), but the subsequent burst of Infinity write operations in `set_graph` exhausts the connection pool with `TOO_MANY_CONNECTIONS` errors. Root causes: 1. Hardcoded pool size — `infinity_conn_pool.py` hardcoded `ConnectionPool(max_size=4)` on initial creation and `max_size=32` on refresh. Operators cannot tune this without patching code. 2. No retry on transient failures — a single `TOO_MANY_CONNECTIONS` on edge deletes or chunk inserts kills the entire resolution+community pipeline with no retry. ### Changes #### `common/doc_store/infinity_conn_pool.py` - Read `ConnectionPool` `max_size` from the `INFINITY_POOL_MAX_SIZE` environment variable (default: `4`), applied consistently to both initial creation and refresh paths. - Log the actual pool size on startup for easier debugging. #### `rag/graphrag/utils.py` — `set_graph()` - Edge deletes: add exponential-backoff retry (3 attempts, 1s/2s/4s delays) so transient `TOO_MANY_CONNECTIONS` errors are retried instead of failing the entire job. Concurrency continues to be gated by the existing `chat_limiter`. - Batch inserts: add exponential-backoff retry (3 attempts, 1s/2s/4s delays) for the same reason. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-16 15:40:54 +08:00
Qi Wang	969ce3a79f	[Bug fix #14133 ] fix graph rag, raptor, mindmap log cannot show correctly in UI (#14136 ) ### What problem does this PR solve? Fix #14133, knowledge graph, raptor, mindmap log cannot show correctly in UI <img width="1930" height="982" alt="Image" src="https://github.com/user-attachments/assets/d2f8e6c1-d82d-4b00-a377-949aada545ca" /> After Fix: <img width="2108" height="805" alt="image" src="https://github.com/user-attachments/assets/b37426c1-83d3-4a32-a83c-9d340d69e0e6" /> <img width="2173" height="1067" alt="image" src="https://github.com/user-attachments/assets/30105222-3310-43a0-9f83-1e320d05e413" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-16 13:08:36 +08:00
Magicbook1108	944a90d645	Feat: add button to turn off vlm parsing (#14125 ) ### What problem does this PR solve? Feat: add button to turn off vlm parsing ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: chanx <1243304602@qq.com>	2026-04-15 19:06:00 +08:00
Magicbook1108	d51789e2be	Feat: update templates && add resume template (#14124 ) ### What problem does this PR solve? Feat: update templates && add resume template ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-15 18:42:29 +08:00
Minal Mahala	f930389311	Refact: improve task resume mechanism for graphrag (#14096 ) ### What problem does this PR solve? Addresses review feedback on #14074 (Checkpoint mechanism for long-running workflow jobs, issue #12494). Changes based on @yuzhichang's review: 1. Renamed `checkpoint_service.py` → `task_checkpoint.py` as suggested. 2. Replaced Redis with direct docEngine queries as suggested — the subgraph already gets persisted to the doc store by `generate_subgraph()`, so we just query for it instead of maintaining a separate checkpoint in Redis. This is simpler, has no extra dependency, and uses a single source of truth. Changes based on CodeRabbit review: 3. Fixed `source_id` query format mismatch — subgraphs are stored with `source_id: [doc_id]` (list), but the original query used `source_id: doc_id` (string). Now follows the same pattern as `does_graph_contains()` in `rag/graphrag/utils.py`: filter by `knowledge_graph_kwd` only, then match `source_id` in Python. This avoids ambiguity across Elasticsearch / Infinity / OceanBase backends. ### Changes \| File \| Change \| \|---\|---\| \| `api/db/services/task_checkpoint.py` (new) \| `load_subgraph_from_store()` and `has_raptor_chunks()` — docEngine-based checkpoint queries \| \| `rag/graphrag/general/index.py` \| `build_one()` calls `load_subgraph_from_store()` before running LLM extraction \| \| `rag/svr/task_executor.py` \| RAPTOR per-doc loop calls `has_raptor_chunks()` before processing \| \| `test/unit_test/rag/graphrag/test_checkpoint_resume.py` (new) \| 10 unit tests covering subgraph loading, source_id filtering, edge cases \| ### How it works - GraphRAG: Before running expensive LLM entity/relation extraction for a doc, checks the doc store for an existing subgraph (saved by a previous interrupted run). If found, loads it directly and skips LLM calls. - RAPTOR: Before processing a doc, checks if RAPTOR chunks (`raptor_kwd="raptor"`) already exist for it. If yes, skips. ### Testing - 10 new unit tests — all passing - Full existing suite: 617 passed ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2026-04-15 17:37:28 +08:00
Ea001	38cefd88e2	Fix tag_feas code injection in retrieval ranking (#13923 ) ## Summary - remove eval-based parsing from retrieval rank feature scoring - validate `tag_feas` at write time in chunk APIs and SDK routes - add regression tests for safe parsing and malicious payload rejection ## Details `tag_feas` is intended to be structured rank-feature data, but the retrieval ranking path was evaluating stored values as Python expressions. This change treats `tag_feas` strictly as data. ### What changed - replace `eval()` in `rag/nlp/search.py` with safe parsing via `json.loads()` and optional `ast.literal_eval()` compatibility for legacy Python-dict strings - strictly filter parsed values down to `dict[str, finite number]` - reject invalid `tag_feas` payloads at write time in web chunk routes and SDK document chunk routes - add focused regression tests to prove executable strings are ignored and invalid payloads are rejected ## Validation - `python -m pytest test/unit_test/common/test_tag_feature_utils.py test/unit_test/rag/test_rank_feature_scores.py -q` --------- Co-authored-by: unknown <zhenglinkai@CCN.Local> Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>	2026-04-15 16:31:11 +08:00
NeedmeFordev	1a1b5aa53e	Fix: respect the internet toggle before running Tavily web search (#14051 ) (#14052 ) ### What problem does this PR solve? Fixes #14051. The chat UI already sends an `internet` flag with each request, but the backend previously triggered Tavily web retrieval whenever `prompt_config.tavily_api_key` was configured. As a result, web search could still run even when the internet toggle was off. This PR makes web search an explicit opt-in at request time: - `tavily_api_key` only indicates that web search is available - Tavily retrieval runs only when `internet` is explicitly enabled - the same behavior now applies to both the normal retrieval path and the deep-research / reasoning path This also fixes the no-KB fallback case so chats without KBs fall back to normal solo chat when `internet` is off. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-14 19:55:20 +08:00
Idriss Sbaaoui	de6a8e789a	Fix: rerank overflow by enforcing top_k and 64 cap (#14084 ) ### What problem does this PR solve? This fixes rerank overflow where retrieval could send more documents than allowed (for example 66 when `page_size=6`), causing provider 400 errors and bypassing the user’s `top_k` intent in rerank-enabled paths. this pr fixes #14081 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-14 10:47:25 +08:00
Tong Liu	6fdca2d212	[Security] Fix jinja2 SSTI vulnerability using SandboxedEnvironment (#14068 )	2026-04-13 19:24:13 +08:00
Zhichang Yu	a9ca4ea1a1	Disable flask and quart debug (#14042 ) ### What problem does this PR solve? Visit `http://127.0.0.1:9381/?__debugger__=yes&cmd=resource&f=debugger.js` will expose the flask code: ``` docReady(() => { if (!EVALEX_TRUSTED) { initPinBox(); } // if we are in console mode, show the console. if (CONSOLE_MODE && EVALEX) { createInteractiveConsole(); } const frames = document.querySelectorAll("div.traceback div.frame"); if (EVALEX) { addConsoleIconToFrames(frames); } addEventListenersToElements(document.querySelectorAll("div.detail"), "click", () => document.querySelector("div.traceback").scrollIntoView(false) ); addToggleFrameTraceback(frames); addToggleTraceTypesOnClick(document.querySelectorAll("h2.traceback")); addInfoPrompt(document.querySelectorAll("span.nojavascript")); wrapPlainTraceback(); }); function addToggleFrameTraceback(frames) { frames.forEach((frame) => { frame.addEventListener("click", () => { frame.getElementsByTagName("pre")[0].parentElement.classList.toggle("expanded"); }); }) } ``` ### Type of change - [x] Other (please describe): Fix security risk	2026-04-10 18:01:49 +08:00
Magicbook1108	18cafff790	Fix: markdown parser in pipeline (#14032 ) ### What problem does this PR solve? Fix: markdown parser in pipeline ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-10 14:11:14 +08:00
Magicbook1108	87a87a7122	Feat: pipeline support ONE chunking method (#14024 ) ### What problem does this PR solve? Feat: pipeline support ONE chunking method ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-04-10 13:11:22 +08:00
Magicbook1108	27329b40ed	Refact: refact on parser structure (#14012 ) ### What problem does this PR solve? Refact: refact on parser structure ### Type of change - [x] Refactoring	2026-04-10 10:03:44 +08:00
Magicbook1108	52f5880d21	Fix: support vlm fall back in pipeline (#14007 ) ### What problem does this PR solve? Fix: support vlm fall back in pipeline for img/table parsing ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-09 20:20:11 +08:00
Yongteng Lei	b33d2fdea5	Refa: GraphRAG to use async chat methods instead of thread pool execution (#14002 ) ### What problem does this PR solve? GraphRAG _async_chat. ### Type of change - [x] Refactoring - [x] Performance Improvement <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Refactor * Unified chat calls to an async invocation across extractors, improving timeout handling and ensuring task IDs propagate reliably. * Tests * Added and expanded unit tests and mocks to cover extractor behavior, timeout scenarios, and safe test-package imports, reducing regression risk. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2026-04-09 19:57:35 +08:00
Octopus	c2ce49e037	fix: strip single quotes from synonym terms to prevent Infinity TokenError (#13969 ) Fixes #13823 ## Problem When querying with words like `cat`, RAGFlow's query expansion system looks up synonyms via WordNet, which can return terms containing single quotes (e.g., `cat-o'-nine-tails`). When using Infinity as the document store, these unescaped single quotes in the query string cause a `TokenError` because Infinity's lexer treats `'` as a string delimiter. ``` TokenError: Error tokenizing ' OR "big cat" OR "computerized tomography")^0.7)': Missing ' from 1:531 ``` ## Solution Strip single quotes from synonym terms before they are inserted into query expressions, consistent with how single quotes are already stripped from the input query text (line 51 of `query.py`): - `common/query_base.py`: In `sub_special_char()`, strip `'` before escaping other special characters. This fixes the Chinese text processing path and the `paragraph()` method. - `rag/nlp/query.py`: In the English text path, strip `'` from tokenized synonym terms. - `memory/services/query.py`: Same fix for the memory query English text path. ## Testing The fix can be verified by: 1. Using Infinity as the document store (`DOC_ENGINE=infinity`) 2. Creating a dataset and running a retrieval test with the keyword `cat` 3. Confirming no `TokenError` is raised and results are returned normally <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Enhanced special character handling in query processing and synonym expansion by properly sanitizing single quotes before text processing. * Simplified OCR detection output by removing timing metadata while preserving core detection accuracy. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ximi <octo-patch@github.com>	2026-04-09 19:10:34 +08:00
Zhichang Yu	b7744e053e	fix: support dense_vector from ES fields response (ES 9.x compatibility) (#13972 ) fix: support dense_vector from ES fields response (ES 9.x compatibility) - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Configuration Chore (non-breaking change which updates configuration) ## Summary by CodeRabbit * Bug Fixes * More accurate handling and unwrapping of dense-vector fields so returned values have correct shapes. * Field selection reliably limits returned data and falls back to alternate result locations when needed. * Use of consistent result IDs and tolerant handling when score values are missing. * Chores / Configuration * Increased build memory and adjusted build-time flags for the frontend build. * Simplified runtime model/GPU checks and removed an automated runtime GPU-install attempt. * Build Fixes * `web/vite.config.ts`: make `build.minify` and `build.sourcemap` respect `VITE_MINIFY` and `VITE_BUILD_SOURCEMAP` env vars from Dockerfile instead of hardcoding `terser` and `true`. * Environment * Allow stack version override and default the runtime image tag to "latest". <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Correct unwrapping of dense-vector fields and reliable field selection with fallback locations. * Consistent use of hit-level IDs and tolerant handling when score values are missing. * Chores / Configuration * Increased frontend build memory and added build-time minify/sourcemap flags; build minification and sourcemap now configurable. * Removed runtime GPU detection for model initialization; force CPU initialization. * Environment * Allow stack version override and default runtime image tag to "latest". <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 17:44:13 +08:00
Magicbook1108	107fe6cf90	Feat: support doc for pipeline parser in word (#14005 ) ### What problem does this PR solve? Feat: support doc for pipeline parser in word ### Type of change - [x] New Feature (non-breaking change which adds functionality) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added support for processing legacy Word `.doc` file formats, extending document compatibility. * Bug Fixes * Enhanced error handling during document parsing to improve reliability and prevent processing failures.	2026-04-09 16:40:42 +08:00
Magicbook1108	8d52ef2893	Feat: enable sync deleted files for connector (#14000 ) ### What problem does this PR solve? Feat: enable sync deleted files for connector 1. first comes with github ### Type of change - [x] New Feature (non-breaking change which adds functionality) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added "sync deleted files" feature for data sources, enabling automatic removal of files deleted from the source system. * Added multilingual support for the new sync deleted files setting across multiple languages. * UI Improvements * Improved checkbox form field rendering and layout. * Enhanced full-width display for authentication token input fields.	2026-04-09 16:40:14 +08:00
MkDev11	cfee2bc9db	feat: Auto-adjust chunk recall weights based on user feedback (#12689 ) ### What problem does this PR solve? Implements automatic adjustment of knowledge base chunk recall weights based on user feedback (upvotes/downvotes). When users upvote or downvote a response, the system locates the corresponding knowledge snippets and adjusts their recall weight to improve future retrieval quality. Closes #12670 How it works: 1. User upvotes/downvotes a response via `POST /thumbup` 2. System extracts chunk IDs from the conversation reference 3. For each referenced chunk: - Reads current `pagerank_fea` value from document store - Increments (+1) for upvote or decrements (-1) for downvote - Clamps weight to [0, 100] range - Updates chunk in ES/Infinity/OceanBase 4. Future retrievals score these chunks higher/lower based on accumulated feedback Files changed: - `api/db/services/chunk_feedback_service.py` - New service for updating chunk pagerank weights - `api/apps/conversation_app.py` - Integrated feedback service into thumbup endpoint - `test/testcases/test_web_api/test_chunk_feedback/` - Unit tests ### Type of change - [x] New Feature (non-breaking change which adds functionality) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Chat message feedback now updates per-chunk relevance weights (feature-flag gated), with configurable weighting and atomic updates across storage backends. * Bug Fixes * Stricter validation for message feedback inputs and more robust handling of feedback transitions. * Tests * Expanded test coverage for chunk-feedback behavior, weighting strategies, storage backends, and thumb-flip scenarios. * Chores * CI workflow extended to run the new chunk-feedback web API tests. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com> Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>	2026-04-08 09:52:18 +08:00
Yang_Ming	bc8d67ce78	feat: add region parameter support to MinIO connection (#13954 ) ## Summary - Add optional `region` parameter to `Minio()` client constructor in `rag/utils/minio_conn.py` - Reads from `MINIO.region` in settings, defaults to `None` when not configured - Required by some S3-compatible storage services (e.g., AWS S3, Tencent COS) for proper bucket access ## Motivation When using RAGFlow with S3-compatible storage that requires a region (such as AWS S3 or Tencent Cloud COS), the MinIO client fails to access buckets because the `region` parameter is not passed through. The `Minio()` Python client already supports the `region` parameter natively — this PR simply wires it up from the RAGFlow configuration. ## Changes - `rag/utils/minio_conn.py`: Pass `region=settings.MINIO.get("region", None) or None` to `Minio()` constructor ## Backward Compatibility - No breaking changes. When `region` is not configured, it defaults to `None`, preserving the existing behavior exactly. ## Test Plan - [ ] Verified with MinIO (no region set) — works as before - [x] Verified with S3-compatible storage requiring region — bucket access succeeds <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Enhanced MinIO client initialization with regional configuration support for improved compatibility with region-specific deployments. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Jarry Wang <code-better-life@users.noreply.github.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-04-07 16:38:23 +08:00
Ricardo-M-L	424aee5bec	fix: correct typos in code comments, docstrings and docs (#13931 ) ## Summary - Fix `a image` → `an image` in README and log message - Fix `colomn` → `column` in table structure recognizer comment - Fix `formated` → `formatted` in confluence connector docstring - Fix `tabel of content` → `table of contents` in TOC prompt ## Test plan - [ ] Documentation and comment changes, no functional impact 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuj <yuj@ztjzsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-04-07 13:05:39 +08:00
Jack	c4b0aaa874	Fix: #6098 - Add validation logic for parser_config when update document (#13911 ) ### What problem does this PR solve? Add validation logic for parser_config. Refactor the processing flow. Before change, validation logics and update logics are mixed up - some validation logis executes followed by some update logic executes and then another such "validation-and-then-update" which is not good. After change, all validation logic executes firstly. Update logic will be executed after ALL validation logic executed. Validation logic for parameters (that come from front end) will be checked using Pydantic. For validation logic that depends on data from DB, they will be in separate methods. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2026-04-07 11:33:05 +08:00
Idriss Sbaaoui	ff27ce86d6	fix: gpt-5 name-based config clearing from base chat path (#13949 ) ### What problem does this PR solve? fix #13944 where OpenAI-compatible custom endpoints failed verification when model names contained `gpt-5` becauser of incorrect name-based handling in the Base/backend=`base` path. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-07 11:24:47 +08:00
buildearth	a0be7c7ca7	Fix(connector): expose id_column, timestamp_column, metadata_columns for MySQL/PostgreSQL incremental sync (#13849 ) ### What problem does this PR solve? The MySQL and PostgreSQL sync classes in `sync_data_source.py` were not passing `id_column`, `timestamp_column`, and `metadata_columns` to `RDBMSConnector`, making incremental sync and document update impossible even when configured. - Without `id_column`: updated records generate new documents instead of overwriting existing ones (doc ID is derived from content hash, so any change produces a new ID). - Without `timestamp_column`: `poll_source` always falls back to full sync, ignoring the configured time range. - The three fields existed in the frontend default values but had no form inputs, so users had no way to fill them in. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) ### Changes - Backend (`rag/svr/sync_data_source.py`): pass `id_column`, `timestamp_column`, and `metadata_columns` from `self.conf` to `RDBMSConnector` for both `MySQL` and `PostgreSQL` sync classes. - Frontend (`web/src/pages/user-setting/data-source/constant/index.tsx`): add `ID Column`, `Timestamp Column`, and `Metadata Columns` form fields to MySQL and PostgreSQL data source configuration UI with tooltips. Signed-off-by: lixintao <lixintao@uniontech.com> Co-authored-by: lixintao <lixintao@uniontech.com>	2026-04-07 10:24:30 +08:00
qinling0210	49386bc1b5	Implement UpdateDataset and UpdateMetadata in GO (#13928 ) ### What problem does this PR solve? Implement UpdateDataset and UpdateMetadata in GO Add cli: UPDATE CHUNK <chunk_id> OF DATASET <dataset_name> SET <update_fields> REMOVE TAGS 'tag1', 'tag2' from DATASET 'dataset_name'; SET METADATA OF DOCUMENT <doc_id> TO <meta> ### Type of change - [ ] Refactoring	2026-04-07 09:44:51 +08:00
Magicbook1108	69264b3a70	Feat: Refact pipeline (#13826 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 19:26:45 +08:00
Zhichang Yu	ab358fe949	feat: make Azure cloud authority configurable for SPN auth (#13898 ) ## Summary - The Azure SPN storage handler hardcoded `AzureAuthorityHosts.AZURE_CHINA`, preventing users in Azure Public Cloud regions (UK-South, EU, US, etc.) from authenticating - Add a `cloud` config option (env: `AZURE_CLOUD`) supporting all four Azure sovereignties: `public`, `china`, `government`, `germany` - Defaults to `public` (global Azure) — the most common international use case Closes #13259 ## Test plan - [ ] Verify default (`cloud: public`) connects to Azure Public Cloud endpoints - [ ] Verify `cloud: china` retains existing behavior for Azure China users - [ ] Verify `AZURE_CLOUD` env var overrides the config file value 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 12:51:26 +08:00
qinling0210	f02f5fa435	Get ROW_ID from search() in Infinity (#13901 ) ### What problem does this PR solve? 1. Search() in Infinity can return row_id now 2. To Get ROW_ID from search(), refer to handling of retrieval_test. example ``` $ curl -s -X POST "http://localhost:$PORT/v1/chunk/retrieval_test" -H "Authorization: $TOKEN" -H "Content-Type: application/json" -d '{"kb_id": "4fcd01582ca911f1954184ba59049aa3", "question": "曹操"}' ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-02 18:56:43 +08:00
NeedmeFordev	6b7989b4b4	Add file type validation (#13802 ) ### What problem does this PR solve? This PR fixes WebDAV sync behavior for unsupported file types ([#13795](https://github.com/infiniflow/ragflow/issues/13795)). Previously, the WebDAV connector selected files primarily by modified time (and size threshold) and could still pass unsupported extensions into the download/document-generation path. This caused unnecessary processing and inconsistent behavior compared with connectors that validate file type earlier. This change adds extension validation in two places: 1. Early filter during recursive listing to skip unsupported files before they enter the download flow. 2. Defensive filter before download/document creation to prevent unsupported files from being processed if any listing edge case slips through. It also wires `allow_images` into the WebDAV sync path so image extension handling follows connector policy. Scope is intentionally limited to WebDAV for a focused bug-fix PR. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### How was this tested? - Manual verification with mixed file types under the configured WebDAV path: - supported: `.pdf`, `.txt`, `.md` - unsupported: `.exe`, `.bin`, `.dat` - Triggered full sync and polling sync. - Confirmed unsupported files are skipped before download. - Confirmed supported files are still indexed normally. - Confirmed image handling follows `allow_images` setting. Fixes: #13795	2026-04-02 14:12:27 +08:00
Ricardo-M-L	09a09a5b20	fix: correct typo in IterationItem name check and incomplete error message (#13890 ) Two small fixes: 1. iterationitem.py line 72: Typo "interationitem" → "iterationitem" (missing 't'). The component name check never matched IterationItem components. 2. raptor.py line 94: Error message "Embedding error: " had a trailing colon with no details. Changed to "Embedding error: empty embeddings returned".	2026-04-02 10:35:28 +08:00
qinling0210	bb4a06f759	Implement InsertDataset and InsertMetadata in GO (#13883 ) ### What problem does this PR solve? Implement InsertDataset and InsertMetadata in GO new internal cli for go: INSERT DATASET FROM FILE "file_name" INSERT METADATA FROM FILE "file_name" ### Type of change - [x] Refactoring	2026-04-01 16:16:25 +08:00
qinling0210	620fe215a4	Fix python metadata search (#13727 ) ### What problem does this PR solve? Fix python metadata search ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-30 19:37:19 +08:00
qinling0210	0462c20113	Fix special characters in matching text of search() (#13852 ) ### What problem does this PR solve? Fix special characters in matching text of search(). We should escape some special characters(such as ?, *,:) before passing to matching_text of search() Fix https://github.com/infiniflow/ragflow/issues/13729 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-30 18:47:10 +08:00
Heyang Wang	641b319647	feat: support reading tags via API (#12891 ) (#13732 ) ### What problem does this PR solve? Enable reading Tag Set tags via API (expose tag_kwd field). The result of the queried list chunks is as shown below: <img width="1422" height="818" alt="image" src="https://github.com/user-attachments/assets/abd1960a-fe34-489e-9d72-525f8e574938" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: heyang.why <heyang.why@alibaba-inc.com>	2026-03-29 20:17:01 +08:00
KeJun	cb78ce0a7b	feat: support rss datasource (#13721 ) ### What problem does this PR solve? Supporting public RSS/Atom feed URLs as data sources for RagFlow. link https://github.com/infiniflow/ragflow/issues/12313 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-27 22:58:44 +08:00
Jin Hai	24fcd6bbc7	Update CI (#13774 ) ### What problem does this PR solve? CI isn't stable, try to fix it. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-25 18:17:52 +08:00
Stephen Hu	d32967eda8	refactor: let excel use lazy image loader (#13558 ) ### What problem does this PR solve? let excel use lazy image loader ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-23 21:24:40 +08:00
Magicbook1108	f991cd362e	Fix: type check in resume parsing method (#13740 ) ### What problem does this PR solve? Fix: type check in resume parsing method ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-23 21:19:09 +08:00
Yongteng Lei	dd839f30e8	Fix: code supports matplotlib (#13724 ) ### What problem does this PR solve? Code as "final" node: ![img_v3_02vs_aece4caf-8403-4939-9e68-9845a22c2cfg](https://github.com/user-attachments/assets/9d87b8df-da6b-401c-bf6d-8b807fe92c22) Code as "mid" node: ![img_v3_02vv_f74f331f-d755-44ab-a18c-96fff8cbd34g](https://github.com/user-attachments/assets/c94ef3f9-2a6c-47cb-9d2b-19703d2752e4) ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-20 20:32:00 +08:00
tmimmanuel	13d0df1562	feat: add Perplexity contextualized embeddings API as a new model provider (#13709 ) ### What problem does this PR solve? Adds Perplexity contextualized embeddings API as a new model provider, as requested in #13610. - `PerplexityEmbed` provider in `rag/llm/embedding_model.py` supporting both standard (`/v1/embeddings`) and contextualized (`/v1/contextualizedembeddings`) endpoints - All 4 Perplexity embedding models registered in `conf/llm_factories.json`: `pplx-embed-v1-0.6b`, `pplx-embed-v1-4b`, `pplx-embed-context-v1-0.6b`, `pplx-embed-context-v1-4b` - Frontend entries (enum, icon mapping, API key URL) in `web/src/constants/llm.ts` - Updated `docs/guides/models/supported_models.mdx` - 22 unit tests in `test/unit_test/rag/llm/test_perplexity_embed.py` Perplexity's API returns `base64_int8` encoded embeddings (not OpenAI-compatible), so this uses a custom `requests`-based implementation. Contextualized vs standard model is auto-detected from the model name. Closes #13610 ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update	2026-03-20 10:47:48 +08:00
yH	757d8d42dd	Fix: use configured OrderByExpr in _community_retrieval_ (#13683 ) The `odr` variable was configured with `desc("weight_flt")` but a new empty `OrderByExpr()` was passed to `dataStore.search()` instead, causing the descending sort to have no effect. ### What problem does this PR solve? In `_community_retrieval_`, the configured `OrderByExpr` with `desc("weight_flt")` was discarded — a new empty `OrderByExpr()` was passed to `dataStore.search()` instead, so community reports were never sorted by weight. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-19 17:55:40 +08:00
Idriss Sbaaoui	7827f0fce5	fix : empty mind map (#13693 ) ### What problem does this PR solve? Fix graphrag extractor chat response parsing and skip truncated cache values ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-19 13:53:06 +08:00
NeedmeFordev	c3f79dbcb0	fix(jira): prevent missed incremental updates after issue edits (#13674 ) ### What problem does this PR solve? Fixes [#13505](https://github.com/infiniflow/ragflow/issues/13505): Jira incremental sync could miss updated issues after initial sync, especially near time boundaries. Root cause: - Jira JQL uses minute-level precision for `updated` filters. - Incremental windows had no overlap buffer, so boundary updates could be skipped. - Sync log cursor tracking used a backward-facing update for `poll_range_start`. - Existing-doc updates in `upload_document` lacked a KB ownership guard for doc-id collisions. What changed: - Added Jira incremental overlap buffer (`time_buffer_seconds`, defaulting to `JIRA_SYNC_TIME_BUFFER_SECONDS`) when building JQL lower-bound time. - Preserved second-level post-filtering to avoid duplicate reprocessing while still catching boundary updates. - Improved Jira sync logging to include start/end window and overlap configuration. - Updated sync cursor tracking in `increase_docs` to keep `poll_range_start` moving forward with max update time. - Added KB ID safety check before updating existing document records in `upload_document`. Verification performed: - Python syntax compile checks passed for modified files. - Manual verification flow: 1. Run full Jira sync. 2. Edit an already-indexed Jira issue. 3. Run next incremental sync. 4. Confirm updated content is re-ingested into KB. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-18 23:31:05 +08:00
Idriss Sbaaoui	9070408b04	Fix : model-specific handling (#13675 ) ### What problem does this PR solve? add a handler for gpt 5 models that do not accept parameters by dropping them, and centralize all models with specific paramter handling function into a single helper. solves issue #13639 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2026-03-18 17:28:20 +08:00
Daniil Sivak	60ad32a0c2	Feat: support epub parsing (#13650 ) Closes #1398 ### What problem does this PR solve? Adds native support for EPUB files. EPUB content is extracted in spine (reading) order and parsed using the existing HTML parser. No new dependencies required. ### Type of change - [x] New Feature (non-breaking change which adds functionality) To check this parser manually: ```python uv run --python 3.12 python -c " from deepdoc.parser import EpubParser with open('$HOME/some_epub_book.epub', 'rb') as f: data = f.read() sections = EpubParser()(None, binary=data, chunk_token_num=512) print(f'Got {len(sections)} sections') for i, s in enumerate(sections[:5]): print(f'\n--- Section {i} ---') print(s[:200]) " ```	2026-03-17 20:14:06 +08:00
Stephen Hu	77483b1e58	refactor: remove useless variable in raptor (#13648 ) ### What problem does this PR solve? remove useless variable in raptor ### Type of change - [x] Refactoring	2026-03-17 15:56:51 +08:00

1 2 3 4 5 ...

1365 Commits