ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Author	SHA1	Message	Date
Tim Wang	ca96d61e73	Feat: Add New API model provider for OpenAI-compatible gateways (#15991 ) ## Summary Add support for "New API" as a model provider, enabling connection to [New API](https://github.com/QuantumNous/new-api) / [one-api](https://github.com/songquanpeng/one-api) compatible gateways that aggregate multiple LLM backends behind a unified OpenAI-compatible `/v1` endpoint. ### Features - All model types: Chat, Embedding, Rerank, Image2Text, TTS, Speech2Text - List Models discovery: `NewAPI(OpenAIAPICompatible)` class in `model_meta.py` queries the gateway's `/v1/models` to auto-discover available models via the native `GET /api/v1/providers/<name>/models` endpoint - Model parameter editing: Pencil icon on each discovered model row to edit `model_type`, `max_tokens`, and `features` (e.g. tool call support) before submitting - Custom model addition: "Add Custom Model" button at the bottom of the List Models dropdown for models not returned by the API - Gear icon settings: Enabled the Settings gear button on provider instances to manage models on existing instances (viewMode) - viewMode credential passthrough: Fixed List Models in viewMode — merges `initialValues` credentials when `api_key`/`base_url` fields are hidden by `hideWhenInstanceExists` ### Changes Backend (8 files): - `rag/llm/chat_model.py` — `NewAPIChat(Base)` class - `rag/llm/embedding_model.py` — `NewAPIEmbed(OpenAIEmbed)` class (no auto `/v1` append) - `rag/llm/rerank_model.py` — `NewAPIRerank(Base)` class (uses `/rerank` endpoint) - `rag/llm/cv_model.py` — `NewAPICv(GptV4)` class - `rag/llm/tts_model.py` — `NewAPITTS(OpenAITTS)` class - `rag/llm/sequence2txt_model.py` — `NewAPISeq2txt(GPTSeq2txt)` class - `rag/llm/model_meta.py` — `NewAPI(OpenAIAPICompatible)` class for List Models discovery - `conf/llm_factories.json` — New API factory entry with all model type tags Frontend (8 files + 1 new SVG): - `web/src/assets/svg/llm/new-api.svg` — New API logo icon - `web/src/constants/llm.ts` — `LLMFactory.NewAPI` enum + `IconMap` entry - `web/src/components/svg-icon.tsx` — `NewAPI` added to `svgIcons` - `web/src/pages/user-setting/setting-model/modal/provider-modal/field-config/local-llm-configs.ts` — New API `buildLocalConfig` - `web/src/pages/user-setting/setting-model/modal/provider-modal/constants.ts` — `LIST_MODEL_PROVIDERS` includes NewAPI - `web/src/pages/user-setting/setting-model/components/used-model.tsx` — Enable Settings gear button - `web/src/pages/user-setting/setting-model/modal/provider-modal/hooks/use-list-models-picker.ts` — viewMode credential merge + model editing state/handlers - `web/src/pages/user-setting/setting-model/modal/provider-modal/hooks/use-list-models-options.tsx` — Pencil edit icon per model row - `web/src/pages/user-setting/setting-model/modal/provider-modal/index.tsx` — `AddCustomModelDialog` import + edit dialog rendering Note on Go implementation: A Go model driver (`NewAPIModel` delegating to `OpenAIModel`) has been prepared but is deferred until the Go runtime is enabled in a future release (current v0.26.0 images use `API_PROXY_SCHEME=python` and do not compile Go binaries). Will submit as a follow-up PR. ## Related - Depends on: #15996 (provider instance API improvements — server-side credential lookup, idempotent `add_model`, security fixes — required for viewMode gear icon and batch model submission) ## Test plan - [ ] Add New API provider with api_key and base_url pointing to an OpenAI-compatible gateway - [ ] Click "List Models" — should discover and display available models from `/v1/models` - [ ] Click pencil icon on a model — should open edit dialog to change model_type, max_tokens, features - [ ] Select multiple models and click OK — should add all selected models - [ ] Click gear icon on the added instance — should open viewMode with List Models working - [ ] In viewMode, select new models including pre-existing ones, click OK — should succeed (requires #15996) - [ ] Verify all model types work: create a Chat assistant, Embedding KB, Rerank setting 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Tim Wang <wanghualoong@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-26 18:47:20 +08:00
Öndery	8081a77c7c	Fix missing move and copy methods in Python RAGFlowS3 storage implementation (#16350 )	2026-06-26 15:51:24 +08:00
Harsh Kashyap	c7052f4dd1	fix(rag/nlp): treat string input as one phrase in is_english (#16308 )	2026-06-25 20:07:09 +08:00
cleanjunc	e8bb534b90	fix: naive_merge splits oversized sections and counts overlap tokens correctly (#15802 )	2026-06-25 19:19:38 +08:00
Lynn	ede46e0bb8	Fix: guess volc embedding model (#16298 )	2026-06-24 14:11:55 +08:00
helloxjade	1b2da645c3	fix: deduplicate markdown table chunks (#16143 )	2026-06-24 13:22:57 +08:00
minion1227	14565b289a	Fix: docx parsing raises ValueError on 'Heading' styles (#16284 )	2026-06-24 13:16:16 +08:00
Günter Lukas	398f488b1b	fix: support Google Cloud Gemini eu/us multipoint endpoints (#15990 ) fix: support Google Cloud Gemini eu/us multipoint endpoints (#15990)	2026-06-24 11:07:05 +08:00
Rander	017adf841f	fix(paddleocr): support PP-OCRv6 ocrResults fallback and integrate image parsing (#16150 ) ## Summary This PR fixes two issues discovered during testing of the PaddleOCR async API refactoring: ### 1. PP-OCRv6 returns `ocrResults` instead of `layoutParsingResults` Models like PP-OCRv6 are pure text recognition models that return results in `ocrResults.prunedResult.rec_texts` format rather than the `layoutParsingResults.prunedResult.parsing_res_list` format used by layout-aware models (PaddleOCR-VL series). Changes: - `deepdoc/parser/paddleocr_parser.py`: Extract `ocrResults` alongside `layoutParsingResults` in `_send_request()`, add fallback logic in `_transfer_to_sections()` and `parse_image()` - `internal/entity/models/paddleocr.go`: Add `ocrResults` struct and fallback extraction in Go OCR handler ### 2. Image parsing not integrated into picture chunker The `parse_image()` method existed in PaddleOCRParser but was never called from `rag/app/picture.py` (the module that handles image file uploads). Users configuring PaddleOCR as their layout recognizer would still get local deepdoc OCR for images. Changes: - `rag/app/picture.py`: When `layout_recognize` is set to PaddleOCR, use `PaddleOCROcrModel.parse_image()` instead of local OCR. Falls back gracefully to local OCR on failure. ## Testing Verified end-to-end in Docker: - PaddleOCR-VL-1.6 PDF parsing: ✅ (10 text blocks with bbox) - PaddleOCR-VL-1.6 image parsing: ✅ (219 chars) - PP-OCRv6 PDF parsing with ocrResults fallback: ✅ (10 text blocks) - PP-OCRv6 image parsing with ocrResults fallback: ✅ (136 chars) ## Related PRs - #15967 (merged) - PaddleOCR async Job API refactoring + new models - #16086 (merged) - PaddleOCR image parsing support	2026-06-23 22:02:54 +08:00
Harsh Kashyap	b4a8a90c73	fix(rag/raptor): handle max_cluster edge case in GMM cluster selection (#16199 ) ### What problem does this PR solve? `_get_optimal_clusters` in `rag/raptor.py` had two edge-case issues in GMM cluster-count selection: 1. It used `np.arange(1, max_clusters)`, which never evaluates the upper-bound candidate (`max_clusters`). 2. When effective `max_clusters` becomes `1`, the candidate list was empty and `argmin` crashed. This PR makes candidate evaluation inclusive (`1..max_clusters`) and guards the single-cluster case by returning `1` directly. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Validation - `pytest test/unit_test/rag/test_raptor_psi_tree_builder.py --config-file pyproject.toml -q` - `ruff check rag/raptor.py test/unit_test/rag/test_raptor_psi_tree_builder.py` ### Tests added - Regression test for `max_cluster == 1` path (no crash, returns 1) - Regression test verifying upper-bound candidate is evaluated and can be selected _AI-assistance disclosure: parts of this change (bug triage and test scaffolding) were drafted with AI assistance and fully reviewed and verified by me._ --------- Co-authored-by: Harsh Kashyap <harshkashyap@Harshs-MacBook-Pro.local> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-23 21:07:26 +08:00
Manan Bansal	70c0121b78	Fix: preserve tables when parsing DOCX with the laws parser (#16008 ) (#16155 ) ## What Fixes #16008 — tables contained in a DOCX are silently dropped when the document is parsed with the laws chunking method. ## Root cause `Docx.__call__` in `rag/app/laws.py` iterated `self.doc.paragraphs`, which only yields paragraph elements. Tables are separate `tbl` blocks in the document body, so they were never visited and were lost from the output. (The `naive` parser already handles tables by iterating the document body.) ## Changes - Iterate `self.doc._element.body` so tables are visited in document order alongside paragraphs. - Add a `__table_to_html` helper that renders each table to HTML, including merged-cell `colspan` detection (mirrors the `naive` parser's logic). - Inject each table into the section tree with a sentinel level deeper than any heading, so `Node.build_tree` merges it into its enclosing section — keeping the chapter/article title path as retrieval context rather than producing an orphaned chunk. - Guard the `h2_level` computation against an empty heading set, so a tables-only or empty DOCX no longer raises `IndexError`. This keeps the laws parser's hierarchical chunking and adds table extraction, so users no longer have to choose between losing structure (naive) or losing tables (laws). ## Tests Adds `test/unit_test/rag/test_laws_docx_tables.py` covering: - table content is preserved and carries its section title path, - merged adjacent cells collapse to `colspan`, - tables-only document does not crash, - empty document returns `[]`. All four pass; `ruff check` / `ruff format` are clean.	2026-06-22 09:46:44 +08:00
qinling0210	563d855780	Implement OpenAI chat completions in GO (#16177 ) ### What problem does this PR solve? Implement OpenAI chat completions in GO POST /api/v1/openai/<chat_id>/chat/completions OpenAI chat cli: internal/development.md ### Type of change - [x] Refactoring	2026-06-18 18:07:27 +08:00
Lynn	a5cce29f22	Fix: add mimo (#16136 ) ### What problem does this PR solve? Add chat model factory for Xiaomi model. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-17 19:02:33 +08:00
Rander	62698725ca	feat(paddleocr): add image parsing support with async Job API (#16086 ) ## Summary Add image parsing capability to PaddleOCR integration, building on top of #15967 (async Job API migration). ## Changes ### `deepdoc/parser/paddleocr_parser.py` - Add `parse_image()` method that uses the same async Job API flow as `parse_pdf()` - Extracts text from `layoutParsingResults` → `prunedResult` → `parsing_res_list` - Returns concatenated block content as a single string ### `rag/llm/ocr_model.py` - Add `parse_image()` wrapper to `PaddleOCROcrModel` with availability check and logging ## Relationship to other PRs - Depends on: #15967 (async Job API migration) — this PR is based on that branch - Replaces: #14826 (original image processing PR based on old sync API) ## Notes This PR uses `base_url` and the async Job API (submit → poll → fetch) consistent with #15967, rather than the old `api_url` + sync POST pattern from #14826.	2026-06-16 19:34:38 +08:00
Rander	1235da7093	refactor(paddleocr): migrate from sync API to async Job API (#15967 ) ## Summary Migrate PaddleOCR integration from the deprecated synchronous HTTP API to the new asynchronous Job API (`submit → poll → fetch`), aligning with PaddleOCR 3.6.0+ architecture. ## Changes ### Python (`deepdoc/parser/paddleocr_parser.py`) - Replace synchronous `requests.post()` with async Job API flow (submit → poll → fetch) - Authentication: `token {token}` → `Bearer {token}` - File transfer: base64 JSON body → multipart file upload - Polling: exponential backoff (initial 3s, ×1.5, max 15s, timeout controlled by `request_timeout`) - Result: fetch full JSONL from result URL, preserving `prunedResult` with bbox info for crop functionality - Rename `api_url` → `base_url` (backward compatible: `api_url` still accepted as fallback) ### Python (`rag/llm/ocr_model.py`) - Prefer `paddleocr_base_url` / `PADDLEOCR_BASE_URL`, fallback to `paddleocr_api_url` / `PADDLEOCR_API_URL` ### Go (`internal/entity/models/paddleocr.go`) - Add `Client-Platform: ragflow` header to submit and poll requests - Change polling from fixed 3s to exponential backoff (initial 3s, ×1.5, max 15s) ### Python (`common/constants.py`) - Add `PADDLEOCR_BASE_URL` to env keys and default config ## Backward Compatibility - Old env var `PADDLEOCR_API_URL` still works (used as fallback) - Frontend field `paddleocr_api_url` still works (backend reads it as fallback) - No user-facing configuration changes required for existing setups ## Why not use the `paddleocr` SDK package directly? RAGFlow's `_transfer_to_sections()` relies on `prunedResult` (containing `block_bbox`, `block_label`, `parsing_res_list`) from the raw API response for PDF crop functionality. The SDK's public `parse_document()` API only returns `DocParsingResult` with `markdown_text`, discarding the bbox data. Therefore we implement the async Job API flow directly via HTTP, following the same logic as the SDK internally.	2026-06-16 19:34:21 +08:00
galuis116	6bfaa3f21e	Fix: SSRF in markdown parser remote image fetch (#15438 ) ### What problem does this PR solve? `rag/app/naive.py` `Markdown.load_images_from_urls` fetched image URLs parsed straight out of an untrusted uploaded markdown document via a raw `requests.get`, with no SSRF validation. Markdown chunking always reaches this path (`return_section_images=True`), so any authenticated user who uploads a `.md`/`.markdown`/`.mdx` file to a knowledge base could make the server issue requests to internal services or cloud-metadata endpoints, e.g. `![x](http://169.254.169.254/latest/meta-data/...)`. The `image/` Content-Type check only gates decoding — the outbound request (the SSRF) always fires. This was the one user-controlled fetch site missed by the project's existing SSRF-hardening (`common/ssrf_guard.py`, already applied to the crawler, SearXNG, RSS connector, MCP/document APIs, and OAuth avatar download). The fix validates and DNS-pins every hop with `common.ssrf_guard.assert_url_is_safe` before connecting, and follows redirects manually so each redirect target is re-validated (closing the DNS-rebinding / redirect-bypass window), mirroring `common/data_source/rss_connector.py`. Blocked URLs are skipped and logged like any other unreachable image, so legitimate public images are unaffected. Adds a regression test at `test/unit_test/rag/app/test_markdown_image_ssrf.py`. Closes #15437 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Ubuntu <ubuntu@ubuntu-2204.linuxvmimages.local> Co-authored-by: galuis116 <galuis116@users.noreply.github.com>	2026-06-16 18:54:55 +08:00
buua436	5751a22444	fix: add toc field to extractor output (#16059 ) ### What problem does this PR solve? TOC chunks now include a toc field so the agent pipeline logs expose the data the frontend expects. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-16 13:27:45 +08:00
Wang Qi	f6a2075ad0	Fix one data source can be synced to multiple dataset (#16023 ) Fix one data source can be synced to multiple dataset Test add/delete - worked.	2026-06-15 16:54:25 +08:00
Yufeng He	0d836afd34	fix: keep max pagerank for repeated n-hop edges (#15696 ) ## Summary Fixes #15695. The Python GraphRAG path already accumulates similarity when several N-hop paths produce the same edge, but PageRank was overwritten by the last path. That makes ranking depend on path order for repeated edges. This keeps the strongest PageRank seen for a repeated edge in the Python implementation: - `rag/graphrag/search.py` The similarity score still accumulates exactly as before. ## To verify - `python -m py_compile rag\graphrag\search.py` - `git diff --check` - `git diff --stat upstream/main` -> only `rag/graphrag/search.py` I originally included the Go implementation too, but removed it after maintainer feedback because the Go version is still under development and not released yet.	2026-06-11 20:53:11 +08:00
Dexterity	bde2b1fc6d	fix(llm): correct error handling, token accounting, and truncation in embedding providers (#15424 ) ### Summary Closes #15423 `rag/llm/embedding_model.py` hosts about 40 embedding providers that shared several defects affecting indexing reliability, cost accounting, and error visibility. This PR fixes four concrete bugs. Masked, inconsistent errors (27 sites). Nearly every provider ran `log_exception(_e, res)` followed by `raise Exception(f"Error: {res}")`. Because `log_exception` always raises, the second line was dead code, and the surfaced exception varied with whether the SDK response exposed a `.text` attribute. Every failure path now raises a single `EmbeddingError` that includes the underlying response detail, so the cause of a failed embedding is consistent and visible. Fabricated token counts. `LocalAIEmbed` returned a hardcoded `1024` and `OllamaEmbed` added `128` per text. These values feed `used_tokens` and therefore billing and usage tracking. Both now report the real count from the API (Ollama `prompt_eval_count`, LocalAI `usage`) and fall back to a local token count only when the server omits it. Truncation overshoot. The `8196` limit used by Mistral and Bedrock exceeded the standard `8192` ceiling and could push boundary sized inputs past the model limit. Limits are corrected to `8192` and made intentional per provider, and providers that rely on server side truncation now request it explicitly (Ollama `truncate=True`, Cohere `truncate="END"`). Missing batching on Zhipu and Ollama. Both issued one request per text. They now batch like the other OpenAI compatible providers, turning N round trips into `ceil(N / batch_size)`. Batched results are realigned by response `index` so a chunk always keeps its own vector. A shared `Base._batched_encode` helper owns the batch loop, optional truncation, result accumulation, and the single error path. It is the mechanism that lets these fixes live in one place instead of across 27 duplicated sites. The public `encode()` and `encode_queries()` contract stays the same, so existing callers are unaffected. Tests covering all four fixes are added under `test/unit_test/rag/llm/test_embedding_model.py`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-11 19:29:46 +08:00
jaso0n0818	d4fbc013b9	fix: tolerate raw api_key string in AzureEmbed and AzureGptV4 __init__ (#15877 ) Fixes #15587 ## Problem `AzureEmbed.__init__` in `rag/llm/embedding_model.py` and `AzureGptV4.__init__` in `rag/llm/cv_model.py` both call `json.loads(key)` unconditionally: ```python api_key = json.loads(key).get("api_key", "") api_version = json.loads(key).get("api_version", "2024-02-01") ``` When a user stores a plain API key string (not a JSON object) in the model configuration — which is a valid and common way to configure Azure OpenAI — `json.loads` raises `JSONDecodeError`. This makes the model fail to initialize and causes document parsing/embedding to return a 500 error. ## Fix Wrap `json.loads` in `try/except (json.JSONDecodeError, TypeError)` and fall back to using the raw string as the `api_key` with the default `api_version`. This is the same pattern already applied to the Azure chat model in PR #15604. ## Files changed - `rag/llm/embedding_model.py` — `AzureEmbed.__init__` - `rag/llm/cv_model.py` — `AzureGptV4.__init__` Fixes #15857 --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-11 16:28:29 +08:00
Idriss Sbaaoui	9871a7e0b6	fix: replicate model provider (#15933 ) ### What problem does this PR solve? FIx replicate model provider failing with valid api key ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-11 15:08:33 +08:00
Rene Arredondo	19104168a6	fix(sync): tolerate list inputs for Discord server_ids / channels (#15790 ) (#15809 ) ## Summary Fixes #15790. Every Discord sync launched from the current Web UI crashes immediately with: ``` 'list' object has no attribute 'split' ``` The error is raised in [rag/svr/sync_data_source.py:650-651](rag/svr/sync_data_source.py#L650-L651): ```python server_ids=server_ids.split(",") if server_ids else [], channel_names=channel_names.split(",") if channel_names else [], ``` ### Root cause Three independent bugs stack here, all in the Discord branch of `sync_data_source.py`: 1. Type mismatch (the user's exact error). The current form at [web/src/pages/user-setting/data-source/constant/index.tsx:833-843](web/src/pages/user-setting/data-source/constant/index.tsx#L833-L843) uses `FormFieldType.Tag` for both Server IDs and Channels: ```tsx { label: 'Server IDs', name: 'config.server_ids', type: FormFieldType.Tag, required: false }, { label: 'Channels', name: 'config.channels', type: FormFieldType.Tag, required: false }, ``` Tag inputs serialise to lists, not comma-separated strings. The backend `.split(",")` then explodes on the very first sync. 2. Field-name mismatch. The form writes `config.channels`. The backend reads `self.conf.get("channel_names", None)`. Even if `.split(",")` were fixed, channels would silently be empty for every UI-created source. 3. Int conversion missing. [common/data_source/discord_connector.py:82](common/data_source/discord_connector.py#L82) types `server_ids` as `list[int]` (Discord guild IDs are integers); the previous `.split(",")` produced strings, so the `channel.guild.id not in server_ids` filter at [discord_connector.py:92](common/data_source/discord_connector.py#L92) silently never matched. So even the configurations that didn't crash were also broken — there is no path through the current code that actually filtered by server id from a UI-created source. ### Fix A 39-line patch in one function: - New `Discord._coerce_str_list` static method: accepts `None` / `""` / `list` / `tuple` / `set` / scalar / comma-separated str, returns a clean `list[str]` with whitespace trimmed and empty entries dropped. Smoke-tested against the 10 input shapes that can hit it (see Test plan). - `_generate` reads `config.channels` first (the form's actual key) and falls back to `config.channel_names`, so SDK callers and legacy configs that already shipped with the old key keep working. - `server_ids` is coerced to `list[int]`. Non-integer entries are logged and dropped instead of crashing the sync, so a single malformed tag from the form doesn't tank the rest of the run. ### What this PR does NOT change - Web form key (`config.channels`) — kept as-is. Renaming it to `channel_names` would force a UI migration and break in-flight configs; the backend fallback solves the same problem more safely. - `common/data_source/discord_connector.py` — its signature was already correct. - Other connectors (Slack, Gmail, Confluence, etc.) — they don't crash today and were not in the issue's scope. ## Test plan `Discord._coerce_str_list` has been exercised against all ten realistic input shapes — list, tuple, set, comma-separated string, str with extra whitespace, empty entries, integers from a Tag input, None, empty list, single trailing comma. All pass.	2026-06-11 13:27:42 +08:00
Jack	0d3e410826	fix: strip Ollama-style tag suffix from LocalAI model names (#15908 ) ## Summary LocalAI exposes two API surfaces with conflicting naming conventions: - `GET /api/tags` returns model names with `:latest` suffix (Ollama format) - `POST /v1/chat/completions` expects names without `:latest` (OpenAI format) RAGFlow discovered models via `/api/tags` and stored the tagged name, then used it with `/v1/chat/completions`, causing a 404 error because LocalAI didn't recognize `model:latest`. ## Fix In `LocalAI.get_model_list()`, strip the tag suffix from model names using `model["name"].rsplit(":", 1)[0]`, so stored names match what the OpenAI-compatible endpoints expect.	2026-06-10 19:05:05 +08:00
Lynn	7355db183f	Fix: model list (#15905 ) ### What problem does this PR solve? Set OpenDataLoader and call in parser and naive ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-10 17:44:50 +08:00
Idriss Sbaaoui	357cb84cd4	Fix: cohere call failing (#15899 ) ### What problem does this PR solve? cohere api call failing because of missing prefix ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-10 15:57:10 +08:00
Lynn	478c9846a1	Fix: model list (#15860 ) ### What problem does this PR solve? Remove tenant_llm call in rag. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-10 14:59:57 +08:00
buua436	093eec3105	fix: handle qwen rerank error response (#15881 ) ### What problem does this PR solve? Fix QWen rerank error handling so DashScope error responses without a text attribute do not raise a secondary KeyError and hide the real provider error. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-10 13:05:24 +08:00
Wang Qi	9aa81e7cad	Fix paddle ocr / minerU cannot add (#15858 ) Fix paddle ocr / minerU cannot add	2026-06-10 13:04:13 +08:00
cleanjunc	88e4d6bddb	Fix: restore GraphRAG entity ranking by indexing pagerank and n-hop paths (#15797 ) ### Summary Closes #15795 Knowledge-graph queries rank entities by `pagerank * sim` in `KGSearch`, but the entity chunks written at index time stopped carrying the values that ranking depends on. `graph_node_to_chunk` only stored `entity_type`, `description`, and `source_id`, dropping the node `pagerank` and the n-hop neighbour paths, while `search.py` still read them back as `rank_flt` and `n_hop_with_weight`. The producer of these fields, `update_nodes_pagerank_nhop_neighbour`, was removed in #6513, but the read side in `KGSearch` was never updated. The result is that on every knowledge-graph query: - `pagerank` resolves to `0`, so the `pagerank * sim` sort key is `0` for every entity and selection falls back to arbitrary order. - Every displayed entity score is `0.00`. - The n-hop relation-enrichment block is dead code because `n_hop_ents` is always empty, leaving `merge_tuples` and `is_continuous_subsequence` orphaned. This PR restores the missing index-time fields so the documented `P(E\|Q) = pagerank * sim` ranking and the n-hop enrichment work again. What changed: - `graph_node_to_chunk` now writes `rank_flt` from the node pagerank and `n_hop_with_weight` from the recomputed n-hop neighbour paths. - Reintroduced the n-hop path computation (`n_neighbor`) in `rag/graphrag/utils.py`, reusing the previously orphaned `merge_tuples` / `is_continuous_subsequence` helpers, with a direction-agnostic edge-weight lookup for undirected graphs. `set_graph` computes the paths per added or updated node and passes them through. - `KGSearch` now selects `n_hop_with_weight` in the entity keyword search so Infinity and OceanBase return it (Elasticsearch and OpenSearch already read it from `_source`), and the read is hardened against missing keys or empty strings before `json.loads`. - Added the `n_hop_with_weight` column to OceanBase, including the `EXTRA_COLUMNS` migration entry so existing tables get it. The other engines already map both fields via dynamic templates or the Infinity mapping. Scope note: pagerank and n-hop are re-indexed for the added or updated nodes in each pass, consistent with the existing incremental indexing design. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Testing Added unit tests in `test/unit_test/rag/graphrag/test_graphrag_utils.py`: - `n_neighbor`: path and weight shape, one-hop vs two-hop, isolated nodes, missing weights, and direction-agnostic lookup. - `graph_node_to_chunk`: `rank_flt` populated from pagerank and defaulting to `0`, `n_hop_with_weight` serialized and defaulting to an empty list. ``` uv run pytest test/unit_test/rag/graphrag/ # 106 passed uv run ruff check rag/graphrag/ rag/utils/ob_conn.py ```	2026-06-09 20:50:45 +08:00
Jack	3eff41361b	fix: prevent None values in auto-metadata from causing KeyError (#15842 ) ## Problem When users configure auto-metadata for a dataset, parsing crashes with: ``` KeyError: 'properties' in gen_metadata → schema["properties"] ``` ## Root Cause Pydantic `AutoMetadataField` defaults `enum` and `description` to `None` when the frontend omits these fields: ```python class AutoMetadataField(Base): enum: Annotated[list[str] \| None, Field(default=None)] description: Annotated[str \| None, Field(default=None)] ``` These `None` values propagate through the call chain and cause two crashes:	2026-06-09 19:10:48 +08:00
euvre	f97d6396b4	fix: BaiduYiyan API key validation fails in set_api_key (#15828 ) ### What problem does this PR solve? When setting the API key for the BaiduYiyan provider, all model validations fail with the error "Fail to access model using this api key. No valid response received". Root cause: 1. `BaiduYiyanChat` in `rag/llm/chat_model.py` does not override `async_chat_streamly()`. The `verify_api_key()` function uses `mdl.async_chat_streamly()` to validate, but `BaiduYiyanChat` inherits `Base.async_chat_streamly()` which uses the OpenAI client, not the Baidu Qianfan SDK (qianfan). Since BaiduYiyan has no OpenAI-compatible base_url, validation always fails. 2. `verify_api_key()` in `provider_api_service.py` does not format the raw API key string into the JSON format (`{"yiyan_ak": "...", "yiyan_sk": "..."}`) that `BaiduYiyanChat.__init__()` expects via `json.loads(key)`. Fix: 1. Add `async_chat_streamly()` method to `BaiduYiyanChat` using the qianfan SDK, consistent with the existing `chat_streamly()` method. 2. Add BaiduYiyan API key formatting in `provider_api_service.py` `verify_api_key()` to match the format expected by `BaiduYiyanChat.__init__()`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-06-09 19:05:58 +08:00
buua436	7b8d6f34b3	fix: force image parser json output (#15847 ) ### What problem does this PR solve? Force image parser runtime output format to JSON so downstream chunking reads OCR results from the JSON output and image parser chunks can be displayed. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-09 19:02:37 +08:00
buua436	c1496ffd43	fix: propagate memory tenant id in task collect (#15837 ) ### What problem does this PR solve? Propagate `tenant_id` from memory task messages into task collection so refactored task execution can build a valid context. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-09 17:47:48 +08:00
Jonathan Chang	c586292993	feat: Implement checkpoint/resume support for GraphRAG community extraction and entity resolution (#15523 ) ## Summary This PR adds checkpoint/resume support for the GraphRAG `extract_community` and `resolve_entities` stages. The implementation stores successful intermediate results in the document store so interrupted ingestion can resume without repeating already-completed LLM work. Checkpoints are loaded before each stage, reused when available, saved after successful batch/community processing, and cleaned up after the stage completes successfully. ## Related Issue Closes: #15518 ## Change Type - [x] Feature - [x] Bug fix - [x] Test - [ ] Refactor - [ ] Documentation - [ ] Breaking change ## Real Behavior Proof Validation commands run locally: ```bash uv run python -m py_compile \ rag/graphrag/checkpoints.py \ rag/graphrag/general/community_reports_extractor.py \ rag/graphrag/entity_resolution.py \ rag/graphrag/general/index.py \ test/unit_test/rag/graphrag/test_checkpoints.py ``` Result: ```text Passed ``` ```bash uv run pytest test/unit_test/rag/graphrag/test_checkpoints.py ``` Result: ```text 4 passed ``` ```bash uv run pytest \ test/unit_test/rag/graphrag/test_phase_markers.py \ test/unit_test/rag/graphrag/test_graphrag_utils.py \ test/unit_test/rag/graphrag/test_checkpoints.py ``` Result: ```text 95 passed ``` ```bash git diff --check ``` Result: ```text Passed ``` ## Checklist - [x] Implemented checkpoint/resume support for `extract_community`. - [x] Implemented checkpoint/resume support for `resolve_entities`. - [x] Avoided touching unrelated API behavior. - [x] Added unit tests for the new checkpoint helper logic. - [x] Verified Python syntax compilation. - [x] Ran related GraphRAG unit tests successfully. - [x] Ran `git diff --check`. - [ ] Ran full project test suite. --------- Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-09 15:34:47 +08:00
Wang Qi	93e4f6bc09	Fix: Add bge as embedding (#15784 ) Fix: Add bge as embedding	2026-06-09 09:31:24 +08:00
Yash Raj Pandey	f2aadd3871	Fix: is_english() returns False for any list argument (broken language detection) (#15489 ) ### What problem does this PR solve? `is_english()` in `rag/nlp/__init__.py` compiles a single-character regex class and `fullmatch`es it against each item: ```python pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]") # no quantifier ... eng = sum(1 for t in texts if pattern.fullmatch(t.strip())) ``` For a string argument the text is first split into single characters (`texts = list(texts)`), so each `fullmatch` sees one character and works. But for a list argument each item is a whole multi-character string, and `fullmatch` of a one-character pattern against a multi-character string always fails — so `is_english()` returns `False` for any list, regardless of content. ```python is_english("This is English") # True (ok) is_english(["The quick brown fox jumps.", "Hello world."]) # False (bug — should be True) is_english(["这是中文。"]) # False (right answer, wrong reason) ``` Many call sites pass lists and were therefore silently always-`False`, e.g.: - `rag/llm/chat_model.py:1088`, `rag/llm/cv_model.py:168,1155` — `is_english([ans])` when an answer is truncated at `max_tokens`, so an English reply gets the Chinese "······由于长度的原因，回答被截断了，要继续吗？" continuation suffix instead of the English one. - `rag/app/book.py` — `remove_contents_table(..., eng=is_english([...sections...]))`, so English books have their contents table stripped in Chinese mode. - `common/doc_store/es_conn_base.py:339`, `rag/utils/opensearch_conn.py:733` — `is_english(txt.split())` in highlight handling. - plus `rag/app/qa.py`, `rag/flow/parser/utils.py`, `common/doc_store/infinity_conn_base.py`. ### Fix Add a `+` quantifier so an all-English multi-character item matches: ```python pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]+") ``` The string path is unchanged (single characters still match) and non-English lists still return `False`. Adds `test/unit_test/rag/test_is_english.py`; the two list cases fail before this change and pass after. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Used the Claude CLI while working on this.	2026-06-08 20:25:23 +08:00
Lynn	b9f06e6095	Feat: model list (#15774 ) ### What problem does this PR solve? Support model list for VolcEngine. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-06-08 20:18:00 +08:00
Wang Qi	c5d0060e0b	Delete not supported model providers list (#15783 ) Delete not supported model providers list	2026-06-08 20:06:03 +08:00
Wang Qi	8e4fba6cd2	Fix OpenRouter key JSONDecodeError (#15776 ) Fix OpenRouter key JSONDecodeError	2026-06-08 19:19:10 +08:00
euvre	d9a04ef702	fix: support auto mode in table parser document metadata aggregation (#15780 ) ### What problem does this PR solve? Table parser metadata aggregation previously only ran when `table_column_mode` was set to `manual`. In auto mode (default), all columns default to `"both"` role, meaning they should also be aggregated into document-level metadata for UI/chat filters. Additionally, the task snapshot could be stale — `table_column_names` are written to KB `parser_config` during `chunk()` but the task may have been created before that. Changes: - Renames `aggregate_table_manual_doc_metadata` → `aggregate_table_doc_metadata` - Supports both `"manual"` and `"auto"` `table_column_mode` (defaults to `"auto"`) - Reloads `table_column_names` from KB DB when missing from task snapshot - Removes the manual-only guard in `task_executor` and refactored `post_processor` - Updates all tests with new function name and adds auto mode test cases ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-08 19:08:23 +08:00
euvre	2c64febc93	feat: add ModelMeta implementations for Xinference, LocalAI, BaiduYiyan, and Tencent Cloud (#15752 ) ### What problem does this PR solve? This PR adds `ModelMeta` implementations for four additional LLM/RAG ecosystem platforms, building on the ModelMeta infrastructure introduced in #15711. Currently, only `Ollama` and `VolcEngine` have `ModelMeta` classes that enable remote model list fetching. This PR extends that support to four more platforms. ### Changes Added four new `ModelMeta` subclasses in `rag/llm/model_meta.py`: \| Platform \| `_FACTORY_NAME` \| Has model list \| Has full model info \| Approach \| \|----------\|-----------------\|----------------\|---------------------\|----------\| \| Xinference \| `"Xinference"` \| ✅ \| ✅ \| Parses `model_type` and `context_length` from `/v1/models` response. Maps 6 model types (LLM/embedding/rerank/image/TTS/speech2text). \| \| LocalAI \| `"LocalAI"` \| ✅ \| ✅ \| Uses Ollama-compatible `GET /api/tags` + `POST /api/show` endpoints. Returns capabilities (completion/embedding/vision/tools/thinking) and `general.context_length`. \| \| BaiduYiyan \| `"BaiduYiyan"` \| ✅ \| ✅ \| Uses Qianfan SDK static model catalog + `get_model_info()` for `max_input_tokens`. Returns 60 models (56 chat + 4 embedding) with real context lengths. \| \| Tencent Cloud \| `"Tencent Cloud"` \| ❌ \| ❌ \| `NotImplementedError` — uses SDK-based SID/SK HMAC signing, no model list REST API available. \| All classes are automatically discovered and registered via the existing `__init__.py` mechanism — no additional configuration needed. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-06-08 19:05:25 +08:00
天海蒼灆	17f27b9df2	fix(browser): show resolved variables in workflow run log input (#15325 ) ### What problem does this PR solve? Browser parsed sys.query from prompts but never called set_input_value, so node_finished inputs displayed null in the agent orchestration run log. Additionally, Browser’s tenant-model path could trigger unsupported structured-output modes (response_format/tool_choice) for some OpenAI-compatible providers (notably DeepSeek thinking models), causing step failures. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-08 18:12:56 +08:00
Rintaro	453ade288c	fix(opensearch): keep "id" in _source on insert so document metadata isn't empty (#15473 ) ### What problem does this PR solve? Follow-up to #15393. After #15393 fixed the OpenSearch `search()` signature and the doc-meta mapping, document metadata still renders as "0 fields" for every document on the OpenSearch backend (`DOC_ENGINE=opensearch`). Root cause. `OSConnection.insert()` pops `id` out of the document before indexing: meta_id = d_copy.pop("id", "") # id used as _id, then DROPPED from _source so the stored `_source` never contains an `id` field. But the doc-meta read path filters and sorts on that field: - `DocMetadataService.get_metadata_for_documents()` builds `condition = {"kb_id": kb_id, "id": doc_ids}` -> `OSConnection.search()` emits `Q("terms", id=doc_ids)` (a term query on the `id` field), and - `_search_metadata()` sorts with `order_by.asc("id")`. With `id` absent from `_source`, the terms filter matches nothing, so `get_metadata_for_documents()` returns an empty map and the UI shows "0 fields" -- even though the metadata was written correctly (it is visible via a kb_id-only query). `ESConnection.insert()` already keeps `id` (`d_copy.get("id", "")`) with the comment "also keep 'id' as a regular field for sorting". This is a plain OpenSearch-only divergence (`pop()` vs `get()`). ### Fix Mirror Elasticsearch: use `get("id")` instead of `pop("id")` so `id` survives in `_source`. The doc-meta mapping already declares `id` as `keyword`, so the field is searchable/sortable once populated. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Affected backends OpenSearch only. Elasticsearch already keeps `id`; Infinity / OceanBase unaffected. ### How to reproduce 1. `DOC_ENGINE=opensearch`, create a KB, upload/parse a document, set metadata. 2. Open the document list -> every document shows "0 fields" (the metadata exists in the `ragflow_doc_meta_` index but its `_source` has no `id` field). ### Risk & backward compatibility `insert()` is shared with the main chunk index; keeping `id` in `_source` brings OpenSearch in line with Elasticsearch (which already does this), so it is parity, not new behavior. No default / ES / Infinity / OceanBase behavior change. Note: affects new inserts only. Existing `ragflow_doc_meta_` indices created before this change have no `id` in `_source`; re-sync metadata, or backfill once with `_update_by_query` (`ctx._source.id = ctx._id`). ### Test plan - [ ] OpenSearch: after the fix the document list shows correct metadata field counts (not "0 fields"); metadata filter/sort by id works. - [ ] Elasticsearch regression: unchanged.	2026-06-08 17:31:04 +08:00
seekmistar01	68b9360536	fix(nlp): tokenize content_tks by whitespace in FulltextQueryer.paragraph (#15721 ) ## Summary Closes #15720 `FulltextQueryer.paragraph` normalized its `content_tks` token string with `[c.strip() for c in content_tks.strip() ...]`, which iterates the string character by character — `"machine learning model"` becomes 20 single characters instead of 3 tokens. Those single chars are fed to `tw.weights(..., preprocess=False)`, producing meaningless term weights and a garbage `MatchTextExpr`. `paragraph()` backs `Dealer.tag_content` (the KB auto-tagging feature), so tag retrieval/scoring is silently broken for tag-enabled knowledge bases. Every other method in this file tokenizes with `.split()` — this is a `.strip()`-vs-`.split()` typo. ## Change - `rag/nlp/query.py` — change `content_tks.strip()` to `content_tks.split()` in the `paragraph` token-normalization line. ## Why it's safe - The caller passes a space-separated token string; `.split()` recovers the real tokens, matching the contract of `tw.weights` and the `.split()` tokenization used by the sibling methods (`similarity`, `question`). - No behavior depends on the per-character expansion. ## Verification - `python -m py_compile rag/nlp/query.py` — OK. - Demonstrated: `"machine learning model"` → 20 single-character entries before, 3 real tokens after. No test references `paragraph`. Co-authored-by: seekmistar01 <seekmistar01@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:16:30 +08:00
Wang Qi	4bbd59823a	Addd OpenRouter OpenAI API compatible list models (#15764 ) Addd OpenRouter OpenAI API compatible list models 1. openrouter 2. OpenAI API compatible 3. VLLM 4. LM Studio Open Router <img width="1318" height="1217" alt="image" src="https://github.com/user-attachments/assets/1d11b1e3-8c72-44fd-bff2-e9502d88d97d" /> VLLM <img width="1433" height="931" alt="image" src="https://github.com/user-attachments/assets/088801a6-0481-4623-976b-e7e93253ea07" />	2026-06-08 16:42:17 +08:00
Danut Matei	e2b0da9eea	fix(opensearch): keep the BM25 leg in hybrid search (#15760 ) ### What problem does this PR solve? Fixes the OpenSearch side of #10747: hybrid search drops the keyword (BM25) leg and ends up doing plain vector search. When a search has both a text and a vector leg, `OSConnection.search()` throws the text query away: del q["query"] q["query"] = {"knn": knn_query} The text clause only stays on as a filter inside the knn query, so it narrows the candidate set but doesn't count towards scoring. So hybrid search on OpenSearch behaves like plain vector search, unlike the Elasticsearch backend. What I changed: - when both legs are present, send a real hybrid query `{"hybrid": {"queries": [bm25, {"knn": ...}]}}` and let a normalization-processor search pipeline score and combine the two legs - only the actual filters (kb_id, available_int, ...) go in the knn filter, not the text must clause - create the pipeline on startup if it's missing, so there's no separate provisioning step. name and weights can be set under `os:` in service_conf.yaml, or via `OS_HYBRID_PIPELINE`; defaults are `ragflow_hybrid_pipeline` and `[0.5, 0.5]` - normalization-processor needs OpenSearch 2.10+. on older clusters, or when the pipeline can't be created, log a warning and fall back to vector-only instead of pointing at a pipeline that doesn't exist This is only the hybrid-search fix; `create_doc_meta_idx` is already on main. Testing (there's no OpenSearch path in CI): added a unit test (`test/unit_test/rag/utils/test_opensearch_hybrid_search.py`, no services needed) that checks the query built in each case — hybrid + pipeline param for text+vector, plain knn for vector-only, plain bool for text-only, the knn filter never carrying the text query_string, and the vector-only fallback when the pipeline isn't available. Also ran it against a real OpenSearch 2.19.1 container with a doc that matches the keyword but sits outside the knn top-k: pure knn returns `['D1','D2','D5']` (keyword doc missing), the hybrid query returns `['A','D1','D2','D5']` (keyword doc present). ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Signed-off-by: Danut Matei <matei.danut.dm@gmail.com>	2026-06-08 16:17:47 +08:00
buua436	6bf7056422	feat: add placeholder model metas (#15753 ) ### What problem does this PR solve? add placeholder model metas ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-08 14:54:59 +08:00
cleanjunc	38f9ea5fec	fix(rerank): normalize reranker scores onto a single scale before hybrid blend (#15429 ) ### What problem does this PR solve? Closes #15428 The hybrid score in `rag/nlp/search.py` (`rerank_by_model`) blends reranker similarity with token similarity on a fixed `[0, 1]` scale: ```python return tkweight * np.array(tksim) + vtweight * vtsim + rank_fea # tkweight=0.3, vtweight=0.7 ``` The reranker implementations did not agree on that scale. Only three of roughly 17 providers normalized their output, and `NvidiaRerank` returned raw, unbounded logits. Weighted at `0.7`, a negative logit could push a genuinely relevant chunk below pure keyword matches, and its magnitude swamped `tksim`, which lives in `[0, 1]`. The practical effect was that the same query produced differently scaled scores depending on the configured reranker, and logit based providers degraded retrieval quality instead of improving it. This PR enforces a single scoring contract in one place: - `Base.similarity` is now the only public entry point. It short-circuits empty input and guarantees a normalized result. Each provider implements its raw scoring in `_compute_rank`, which removes sixteen duplicated empty input guards and the three scattered normalization calls. - Normalization is range aware. Providers that already return calibrated `[0, 1]` relevance scores (Cohere, Jina, Voyage, and others) keep their absolute magnitudes, so `similarity_threshold` filtering and the reported `vector_similarity` stay meaningful. Only out-of-range output such as NVIDIA logits is min-max rescaled into `[0, 1]`. - The twelve leftover `[DEBUG ...]` prints in `rerank_by_model`, introduced in #14231, are removed. They ran on every retrieval, added per chunk overhead, and leaked queries, keywords, and document content to stdout and logs. A new regression suite in `test/unit_test/rag/llm/test_rerank_normalization.py` covers logit rescaling (positive, negative, and flat batches), preservation of already calibrated scores, ordering, empty input handling, and the per provider HTTP path. It also asserts that no provider overrides `similarity()`, so the contract cannot silently drift. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-08 11:53:22 +08:00
cleanjunc	91983106f2	fix(retrieval): keep rerank window aligned to page_size for deep pagination (#15434 ) ### What problem does this PR solve? Closes #15433 Reranked retrieval drops results and returns short pages once pagination crosses the first candidate block, for the common page sizes 10 and 30. In `rag/nlp/search.py`, the candidate window (`RERANK_LIMIT`) is rounded up to a multiple of `page_size` to keep block based pagination aligned, and then clamped back to 64: ```python RERANK_LIMIT = math.ceil(64 / page_size) * page_size if page_size > 1 else 1 # e.g. 70 for page_size=10 RERANK_LIMIT = max(30, RERANK_LIMIT) if rerank_mdl and top > 0: RERANK_LIMIT = min(RERANK_LIMIT, top, 64) # clamps back to 64, breaking the multiple ``` `RERANK_LIMIT` is used both as the backend block size (`page = global_offset // RERANK_LIMIT`) and as the modulus that slices a page out of a reranked block (`begin = global_offset % RERANK_LIMIT`). When it stops being a multiple of `page_size`, the block that gets fetched and the slice taken from it no longer agree. With `page_size=10` and `top=1024`, page 7 returns only 4 of 10 results and the head of the next block is never shown on any page. This happens whenever the result set spans more than one block, which is the default. Fix The window math is moved into a small reusable helper, `Dealer._rerank_window`, which: - targets a pool of about 64 candidates, - bounds it by `top` when a reranker is active, and - always rounds to a whole number of pages, so the window stays an exact multiple of `page_size`. The call site becomes a single line, and the alignment invariant now lives in one documented place. Behavior is unchanged on every path that was already aligned (the non reranked path and any `top` that already produced a page multiple). Verification A simulation of the full retrieval path (per block rerank, similarity threshold filter, and the exact `page // window` and `offset % window` math) confirms the fix loses nothing where the old code lost real results: ``` ps=10 top=1024: new window=70 dropped_valid=0 \| old window=64 dropped_valid=16 ps=30 top=1024: new window=90 dropped_valid=0 \| old window=64 dropped_valid=66 ``` New unit tests in `test/unit_test/rag/test_search_pagination.py` cover the alignment invariant, cross block pagination (every candidate surfaced once, in order, no gaps, no short interior pages), the reported regression, and parity with the old window on the previously correct paths. All 114 cases pass and `ruff check` is clean. Fixes the reranked deep pagination data loss described above. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-08 11:53:12 +08:00

1 2 3 4 5 ...

1547 Commits