### What problem does this PR solve?
feat(file): Add file ancestor directory lookup feature by go
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
refactor: Remove knowledge base-related API handlers that are already
included in the dataset.
### Type of change
- [x] Refactoring
## Summary
- Replace `json.load(open(...))` with `with open(...) as f:
json.load(f)` in 2 resume parser files
- Fixes 4 leaked file descriptors in `corporations.py` (3) and
`schools.py` (1)
## Why
In a long-running server process like RAGFlow, leaked file handles can
accumulate and hit the OS file descriptor limit (`OSError: [Errno 24]
Too many open files`). The other instances mentioned in the issue
(`infinity_conn_base.py` and `init_data.py`) have already been fixed.
## Test plan
- [x] Verified affected files use `with` statement after fix
- [x] Grep confirms no remaining `json.load(open(` patterns in codebase
Fixes#13996🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### What problem does this PR solve?
This fixes rerank overflow where retrieval could send more documents
than allowed (for example 66 when `page_size=6`), causing provider 400
errors and bypassing the user’s `top_k` intent in rerank-enabled paths.
this pr fixes#14081
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
fix issue with stale tests on p3 level
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix: The indented tree text generated on the search page overlaps.
#14077
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
Feat: Hide the download button embedded in the agent page.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Summary
When setting a default model for an OpenAI-API-Compatible provider,
ensure_tenant_model_id_for_params called get_api_key
without a model_type filter. If the same model name was registered under
multiple types (e.g., both chat and embedding),
it could return the wrong tenant_llm_id, leading to Model(@None) not
authorized errors during chat.
This applies the same type-scoped fix that PR #13569 introduced in
get_model_config_by_type_and_name — now consistently
in tenant_utils.py as well.
Changes
- Added _KEY_TO_MODEL_TYPE mapping in tenant_utils.py
- Each model key (llm_id, embd_id, etc.) now passes its correct LLMType
to get_api_key
Fixes#13775
### What problem does this PR solve?
- Implemented a helper function to convert markdown cell text to native
numeric types for Excel output.
- Ensured that leading zeros are preserved and handled various numeric
formats, including those with thousand separators and scientific
notation.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Closes#13907
The template catalog had duplicate files (e.g. `*_r.json`) only to place
the same template into multiple sidebar groups.
This increases maintenance cost and makes template updates error-prone.
This PR adds first-class support for multiple template categories in a
single file via `canvas_types`, then removes duplicate template files.
What changed:
- Added `canvas_types` to `CanvasTemplate` model and DB migration.
- Added normalization logic when loading templates:
- accepts legacy `canvas_type`
- accepts new `canvas_types`
- merges/deduplicates values
- preserves backward compatibility by keeping `canvas_type` as first
normalized value.
- Updated template import flow to load only `.json` files and in stable
sorted order.
- Updated frontend template filtering to match on `canvas_types` first,
with fallback to legacy `canvas_type`.
- Consolidated duplicated template pairs into single files and removed:
- `deep_search_r.json`
- `reflective_academic_paper_generator_r.json`
- `seo_article_writer_r.json`
- Added regression/edge-case tests for category normalization and route
serialization expectations.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
Fix: The chat page is not displaying the meta tags.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Upgrades Apache Tika from 3.2.3 to 3.3.0 to address the security
vulnerability GHSA-72hv-8253-57qq (TIKA-4687).
Closes#13601
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Changes
- `Dockerfile`: Updated tika JAR filename and `TIKA_SERVER_JAR` env var
from 3.2.3 to 3.3.0
- `Dockerfile.deps`: Updated tika JAR filename in COPY instruction from
3.2.3 to 3.3.0
- `download_deps.py`: Updated both Maven Central and Huawei Cloud mirror
download URLs from 3.2.3 to 3.3.0
### References
- Apache Tika 3.3.0 release:
https://www.apache.org/dyn/closer.lua/tika/3.3.0/tika-app-3.3.0.jar
- TIKA-4687: https://issues.apache.org/jira/browse/TIKA-4687
- GHSA-72hv-8253-57qq
### What problem does this PR solve?
Update search
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Sandbox cannot accept large args list.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Consolidate "set_meta" API into "update_document" .
Before consolidation
Web API: POST /api/v1/document/set_meta
Http API - PUT /v1/datasets/<dataset_id>/document/<document_id>
After consolidation, Restful API -- PUT
/v1/datasets/<dataset_id>/document/<document_id>
### Type of change
- [x] Refactoring
Close#14018
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Problem
In Agent applications, even with the cite option enabled, only inline
[ID: x] citation markers are visible (showing chunk content on hover).
The Agent does not display the referenced file cards below the response,
unlike Chat applications.
### Root Cause
The Agent's Retrieval tool (agent/tools/retrieval.py) calls
retriever.retrieval() with aggs=False, which means the retrieval results
do not include doc_aggs (document aggregation) data. Without doc_aggs,
the frontend ReferenceDocumentList component has no data to render the
file cards.
In contrast, the Chat application (api/db/services/dialog_service.py)
calls the same retriever.retrieval() method with aggs=True.
### Fix
Changed aggs=False to aggs=True in agent/tools/retrieval.py so that
document aggregation data is returned along with the retrieved chunks.
### What problem does this PR solve?
Fix: When creating a dataset, if no `chunk_method` is selected, there is
no indication that this is a required field.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Consolidation WEB API & HTTP API for document metadata summary
Before consolidation
Web API: POST /api/v1/document/metadata/summary
Http API - GET /v1/datasets/<dataset_id>/metadata/summary
After consolidation, Restful API -- GET
/v1/datasets/<dataset_id>/metadata/summary
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Fix: The dataset on the search page is not displaying the required field
error message.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Visit
`http://127.0.0.1:9381/?__debugger__=yes&cmd=resource&f=debugger.js`
will expose the flask code:
```
docReady(() => {
if (!EVALEX_TRUSTED) {
initPinBox();
}
// if we are in console mode, show the console.
if (CONSOLE_MODE && EVALEX) {
createInteractiveConsole();
}
const frames = document.querySelectorAll("div.traceback div.frame");
if (EVALEX) {
addConsoleIconToFrames(frames);
}
addEventListenersToElements(document.querySelectorAll("div.detail"), "click", () =>
document.querySelector("div.traceback").scrollIntoView(false)
);
addToggleFrameTraceback(frames);
addToggleTraceTypesOnClick(document.querySelectorAll("h2.traceback"));
addInfoPrompt(document.querySelectorAll("span.nojavascript"));
wrapPlainTraceback();
});
function addToggleFrameTraceback(frames) {
frames.forEach((frame) => {
frame.addEventListener("click", () => {
frame.getElementsByTagName("pre")[0].parentElement.classList.toggle("expanded");
});
})
}
```
### Type of change
- [x] Other (please describe): Fix security risk
### What problem does this PR solve?
As title.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Feat: pipeline support ONE chunking method
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
As title
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
## Summary
Fixes#13996
Replace `json.load(open(...))` with `with open(...) as f: json.load(f)`
in two files to ensure file descriptors are properly closed.
**Affected files:**
- `common/doc_store/infinity_conn_base.py` — schema loading for Infinity
doc store
- `api/db/init_data.py` — agent template loading at startup
## Why this matters
In a long-running server process like RAGFlow, leaked file descriptors
from `json.load(open(...))` can accumulate over time. While CPython's
refcounting usually cleans these up, it's not guaranteed (especially
under memory pressure or with alternative Python runtimes), and can lead
to `OSError: [Errno 24] Too many open files`.
## Test plan
- [ ] Verify Infinity doc store schema loading still works correctly
- [ ] Verify agent templates load correctly on startup
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Refactor**
* Improved file handling in internal data processing to ensure proper
resource cleanup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: easonysliu <easonysliu@tencent.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### What problem does this PR solve?
feat: Implement file-related functionality
- Implement file deletion API and business logic
- Add context support for file deletion operations and prevent root
folder deletion
- Implement file move functionality
- Add File Download API Endpoints and Utility Functions
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
Closes https://github.com/infiniflow/ragflow/issues/13939
## What problem does this PR solve?
The Google Drive connector fails to detect new files after the initial
sync (#13939). The root cause is that `generate_time_range_filter()`
applies a strict `modifiedTime > poll_range_start` cutoff when querying
the Google Drive API. Files uploaded to Google Drive that retain their
original `modifiedTime` (common behavior) get silently excluded if their
timestamp predates the last sync's cutoff.
Unlike the Confluence and Jira connectors which use a configurable time
buffer (`CONFLUENCE_SYNC_TIME_BUFFER_SECONDS`) to offset
`poll_range_start` backward, the Google Drive connector had no such
mechanism — resulting in a razor-sharp timestamp boundary with zero
tolerance for overlap.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Summary
* **New Features**
* Added a configurable time buffer for Google Drive synchronization to
address timing delays and improve sync reliability.
* Improved file detection logic to include recently created files
alongside modified ones, reducing missed synchronizations.
### What problem does this PR solve?
This PR fixes a mismatch between the MCP retrieval contract and the
backend retrieval API.
`ragflow_retrieval` already describes `dataset_ids` as optional, but
`/api/v1/retrieval` still rejected omitted or empty `dataset_ids` with
`` `dataset_ids` is required. ``. That made MCP retrieval fail even
though the tool schema promised that the request could search across all
available datasets.
This change updates `/api/v1/retrieval` to accept missing or empty
`dataset_ids`, resolve all accessible datasets for the authenticated
user, and keep the route schema aligned with the new runtime behavior.
It also adds focused unit coverage for the fallback resolution path and
the no-accessible-datasets case.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Fixes: #13981
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Improved dataset resolution to reliably discover all accessible
datasets through proper pagination, replacing the previous parsing
method.
* Enhanced error handling with clearer messaging when no datasets are
available for retrieval.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
### What problem does this PR solve?
As title.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Fix: The knowledge base selected by the retrieval node is not displayed.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix: support vlm fall back in pipeline for img/table parsing
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
As title
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
GraphRAG _async_chat.
### Type of change
- [x] Refactoring
- [x] Performance Improvement
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Refactor**
* Unified chat calls to an async invocation across extractors, improving
timeout handling and ensuring task IDs propagate reliably.
* **Tests**
* Added and expanded unit tests and mocks to cover extractor behavior,
timeout scenarios, and safe test-package imports, reducing regression
risk.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Fixes#13823
## Problem
When querying with words like `cat`, RAGFlow's query expansion system
looks up synonyms via WordNet, which can return terms containing single
quotes (e.g., `cat-o'-nine-tails`). When using Infinity as the document
store, these unescaped single quotes in the query string cause a
`TokenError` because Infinity's lexer treats `'` as a string delimiter.
```
TokenError: Error tokenizing ' OR "big cat" OR "computerized tomography")^0.7)': Missing ' from 1:531
```
## Solution
Strip single quotes from synonym terms before they are inserted into
query expressions, consistent with how single quotes are already
stripped from the input query text (line 51 of `query.py`):
- **`common/query_base.py`**: In `sub_special_char()`, strip `'` before
escaping other special characters. This fixes the Chinese text
processing path and the `paragraph()` method.
- **`rag/nlp/query.py`**: In the English text path, strip `'` from
tokenized synonym terms.
- **`memory/services/query.py`**: Same fix for the memory query English
text path.
## Testing
The fix can be verified by:
1. Using Infinity as the document store (`DOC_ENGINE=infinity`)
2. Creating a dataset and running a retrieval test with the keyword
`cat`
3. Confirming no `TokenError` is raised and results are returned
normally
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Enhanced special character handling in query processing and synonym
expansion by properly sanitizing single quotes before text processing.
* Simplified OCR detection output by removing timing metadata while
preserving core detection accuracy.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: ximi <octo-patch@github.com>
### What problem does this PR solve?
As title
### Type of change
- [x] Refactoring
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Chores**
* Improved authentication error logging to better distinguish between
JWT and API token failures.
* Enhanced code documentation with clarifying comments for better
maintainability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Feat: Integrate the name, avatar, and description of chat and search
into a single component.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Inline-editable avatar, name, and description fields
* Expandable content blocks in search results
* New RAGFlow heading/logo component
* **Refactor**
* Replaced scattered form fields with a composed Avatar/Name/Description
component
* Mindmap drawer converted to a sheet-based drawer and layout cleanup
* Simplified search page controls and layout; improved scroll viewport
handling
* **Chores**
* Added/updated English and Chinese localization keys (placeholders,
view more/less)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>