Commit Graph

1348 Commits

Author SHA1 Message Date
Yongteng Lei
b33d2fdea5 Refa: GraphRAG to use async chat methods instead of thread pool execution (#14002)
### What problem does this PR solve?

GraphRAG _async_chat.

### Type of change

- [x] Refactoring
- [x] Performance Improvement


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Unified chat calls to an async invocation across extractors, improving
timeout handling and ensuring task IDs propagate reliably.
* **Tests**
* Added and expanded unit tests and mocks to cover extractor behavior,
timeout scenarios, and safe test-package imports, reducing regression
risk.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-04-09 19:57:35 +08:00
Octopus
c2ce49e037 fix: strip single quotes from synonym terms to prevent Infinity TokenError (#13969)
Fixes #13823

## Problem

When querying with words like `cat`, RAGFlow's query expansion system
looks up synonyms via WordNet, which can return terms containing single
quotes (e.g., `cat-o'-nine-tails`). When using Infinity as the document
store, these unescaped single quotes in the query string cause a
`TokenError` because Infinity's lexer treats `'` as a string delimiter.

```
TokenError: Error tokenizing ' OR "big cat" OR "computerized tomography")^0.7)': Missing ' from 1:531
```

## Solution

Strip single quotes from synonym terms before they are inserted into
query expressions, consistent with how single quotes are already
stripped from the input query text (line 51 of `query.py`):

- **`common/query_base.py`**: In `sub_special_char()`, strip `'` before
escaping other special characters. This fixes the Chinese text
processing path and the `paragraph()` method.
- **`rag/nlp/query.py`**: In the English text path, strip `'` from
tokenized synonym terms.
- **`memory/services/query.py`**: Same fix for the memory query English
text path.

## Testing

The fix can be verified by:
1. Using Infinity as the document store (`DOC_ENGINE=infinity`)
2. Creating a dataset and running a retrieval test with the keyword
`cat`
3. Confirming no `TokenError` is raised and results are returned
normally

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced special character handling in query processing and synonym
expansion by properly sanitizing single quotes before text processing.
* Simplified OCR detection output by removing timing metadata while
preserving core detection accuracy.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ximi <octo-patch@github.com>
2026-04-09 19:10:34 +08:00
Zhichang Yu
b7744e053e fix: support dense_vector from ES fields response (ES 9.x compatibility) (#13972)
fix: support dense_vector from ES fields response (ES 9.x compatibility)

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Configuration Chore (non-breaking change which updates
configuration)


## Summary by CodeRabbit

* **Bug Fixes**
* More accurate handling and unwrapping of dense-vector fields so
returned values have correct shapes.
* Field selection reliably limits returned data and falls back to
alternate result locations when needed.
* Use of consistent result IDs and tolerant handling when score values
are missing.

* **Chores / Configuration**
* Increased build memory and adjusted build-time flags for the frontend
build.
* Simplified runtime model/GPU checks and removed an automated runtime
GPU-install attempt.

* **Build Fixes**
* `web/vite.config.ts`: make `build.minify` and `build.sourcemap`
respect `VITE_MINIFY` and `VITE_BUILD_SOURCEMAP` env vars from
Dockerfile instead of hardcoding `terser` and `true`.

* **Environment**
* Allow stack version override and default the runtime image tag to
"latest".

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Correct unwrapping of dense-vector fields and reliable field selection
with fallback locations.
* Consistent use of hit-level IDs and tolerant handling when score
values are missing.

* **Chores / Configuration**
* Increased frontend build memory and added build-time minify/sourcemap
flags; build minification and sourcemap now configurable.
* Removed runtime GPU detection for model initialization; force CPU
initialization.

* **Environment**
* Allow stack version override and default runtime image tag to
"latest".

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 17:44:13 +08:00
Magicbook1108
107fe6cf90 Feat: support doc for pipeline parser in word (#14005)
### What problem does this PR solve?

Feat: support doc for pipeline parser in word

### Type of change

- [x] New Feature (non-breaking change which adds functionality)



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added support for processing legacy Word `.doc` file formats,
extending document compatibility.

* **Bug Fixes**
* Enhanced error handling during document parsing to improve reliability
and prevent processing failures.
2026-04-09 16:40:42 +08:00
Magicbook1108
8d52ef2893 Feat: enable sync deleted files for connector (#14000)
### What problem does this PR solve?

Feat: enable sync deleted files for connector
1. first comes with github

### Type of change

- [x] New Feature (non-breaking change which adds functionality)



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added "sync deleted files" feature for data sources, enabling
automatic removal of files deleted from the source system.
* Added multilingual support for the new sync deleted files setting
across multiple languages.

* **UI Improvements**
  * Improved checkbox form field rendering and layout.
  * Enhanced full-width display for authentication token input fields.
2026-04-09 16:40:14 +08:00
MkDev11
cfee2bc9db feat: Auto-adjust chunk recall weights based on user feedback (#12689)
### What problem does this PR solve?

Implements automatic adjustment of knowledge base chunk recall weights
based on user feedback (upvotes/downvotes). When users upvote or
downvote a response, the system locates the corresponding knowledge
snippets and adjusts their recall weight to improve future retrieval
quality.

**Closes #12670**

**How it works:**
1. User upvotes/downvotes a response via `POST /thumbup`
2. System extracts chunk IDs from the conversation reference
3. For each referenced chunk:
   - Reads current `pagerank_fea` value from document store
   - Increments (+1) for upvote or decrements (-1) for downvote
   - Clamps weight to [0, 100] range
   - Updates chunk in ES/Infinity/OceanBase
4. Future retrievals score these chunks higher/lower based on
accumulated feedback

**Files changed:**
- `api/db/services/chunk_feedback_service.py` - New service for updating
chunk pagerank weights
- `api/apps/conversation_app.py` - Integrated feedback service into
thumbup endpoint
- `test/testcases/test_web_api/test_chunk_feedback/` - Unit tests

### Type of change

- [x] New Feature (non-breaking change which adds functionality)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Chat message feedback now updates per-chunk relevance weights
(feature-flag gated), with configurable weighting and atomic updates
across storage backends.

* **Bug Fixes**
* Stricter validation for message feedback inputs and more robust
handling of feedback transitions.

* **Tests**
* Expanded test coverage for chunk-feedback behavior, weighting
strategies, storage backends, and thumb-flip scenarios.

* **Chores**
  * CI workflow extended to run the new chunk-feedback web API tests.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com>
Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>
2026-04-08 09:52:18 +08:00
Yang_Ming
bc8d67ce78 feat: add region parameter support to MinIO connection (#13954)
## Summary
- Add optional `region` parameter to `Minio()` client constructor in
`rag/utils/minio_conn.py`
- Reads from `MINIO.region` in settings, defaults to `None` when not
configured
- Required by some S3-compatible storage services (e.g., AWS S3, Tencent
COS) for proper bucket access

## Motivation
When using RAGFlow with S3-compatible storage that requires a region
(such as AWS S3 or Tencent Cloud COS), the MinIO client fails to access
buckets because the `region` parameter is not passed through.

The `Minio()` Python client already supports the `region` parameter
natively — this PR simply wires it up from the RAGFlow configuration.

## Changes
- `rag/utils/minio_conn.py`: Pass `region=settings.MINIO.get("region",
None) or None` to `Minio()` constructor

## Backward Compatibility
- No breaking changes. When `region` is not configured, it defaults to
`None`, preserving the existing behavior exactly.

## Test Plan
- [ ] Verified with MinIO (no region set) — works as before
- [x] Verified with S3-compatible storage requiring region — bucket
access succeeds

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced MinIO client initialization with regional configuration
support for improved compatibility with region-specific deployments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Jarry Wang <code-better-life@users.noreply.github.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2026-04-07 16:38:23 +08:00
Ricardo-M-L
424aee5bec fix: correct typos in code comments, docstrings and docs (#13931)
## Summary
- Fix `a image` → `an image` in README and log message
- Fix `colomn` → `column` in table structure recognizer comment
- Fix `formated` → `formatted` in confluence connector docstring
- Fix `tabel of content` → `table of contents` in TOC prompt

## Test plan
- [ ] Documentation and comment changes, no functional impact

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yuj <yuj@ztjzsoft.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2026-04-07 13:05:39 +08:00
Jack
c4b0aaa874 Fix: #6098 - Add validation logic for parser_config when update document (#13911)
### What problem does this PR solve?

Add validation logic for parser_config.
Refactor the processing flow. Before change, validation logics and
update logics are mixed up - some validation logis executes followed by
some update logic executes and then another such
"validation-and-then-update" which is not good. After change, all
validation logic executes firstly. Update logic will be executed after
ALL validation logic executed.
Validation logic for parameters (that come from front end) will be
checked using Pydantic. For validation logic that depends on data from
DB, they will be in separate methods.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2026-04-07 11:33:05 +08:00
Idriss Sbaaoui
ff27ce86d6 fix: gpt-5 name-based config clearing from base chat path (#13949)
### What problem does this PR solve?

fix #13944 where OpenAI-compatible custom endpoints failed verification
when model names contained `gpt-5` becauser of incorrect name-based
handling in the Base/backend=`base` path.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-07 11:24:47 +08:00
buildearth
a0be7c7ca7 Fix(connector): expose id_column, timestamp_column, metadata_columns for MySQL/PostgreSQL incremental sync (#13849)
### What problem does this PR solve?
The MySQL and PostgreSQL sync classes in `sync_data_source.py` were not
passing `id_column`, `timestamp_column`, and `metadata_columns` to
`RDBMSConnector`,
making incremental sync and document update impossible even when
configured.
   
- Without `id_column`: updated records generate new documents instead of
overwriting existing ones (doc ID is derived from content hash, so any
change produces a new ID).
- Without `timestamp_column`: `poll_source` always falls back to full
sync,
ignoring the configured time range.
- The three fields existed in the frontend default values but had no
form
inputs, so users had no way to fill them in.
### Type of change
  - [x] Bug Fix (non-breaking change which fixes an issue)        
  - [x] New Feature (non-breaking change which adds functionality)

### Changes
   
- **Backend** (`rag/svr/sync_data_source.py`): pass `id_column`,
    `timestamp_column`, and `metadata_columns` from `self.conf` to
`RDBMSConnector` for both `MySQL` and `PostgreSQL` sync classes.
- **Frontend**
(`web/src/pages/user-setting/data-source/constant/index.tsx`):
add `ID Column`, `Timestamp Column`, and `Metadata Columns` form fields
    to MySQL and PostgreSQL data source configuration UI with tooltips.

Signed-off-by: lixintao <lixintao@uniontech.com>
Co-authored-by: lixintao <lixintao@uniontech.com>
2026-04-07 10:24:30 +08:00
qinling0210
49386bc1b5 Implement UpdateDataset and UpdateMetadata in GO (#13928)
### What problem does this PR solve?

Implement UpdateDataset and UpdateMetadata in GO

Add cli:
UPDATE CHUNK <chunk_id> OF DATASET <dataset_name> SET <update_fields>
REMOVE TAGS 'tag1', 'tag2' from DATASET 'dataset_name';
SET METADATA OF DOCUMENT <doc_id> TO <meta>


### Type of change

- [ ] Refactoring
2026-04-07 09:44:51 +08:00
Magicbook1108
69264b3a70 Feat: Refact pipeline (#13826)
### What problem does this PR solve?

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 19:26:45 +08:00
Zhichang Yu
ab358fe949 feat: make Azure cloud authority configurable for SPN auth (#13898)
## Summary
- The Azure SPN storage handler hardcoded
`AzureAuthorityHosts.AZURE_CHINA`, preventing users in Azure Public
Cloud regions (UK-South, EU, US, etc.) from authenticating
- Add a `cloud` config option (env: `AZURE_CLOUD`) supporting all four
Azure sovereignties: `public`, `china`, `government`, `germany`
- Defaults to `public` (global Azure) — the most common international
use case

Closes #13259

## Test plan
- [ ] Verify default (`cloud: public`) connects to Azure Public Cloud
endpoints
- [ ] Verify `cloud: china` retains existing behavior for Azure China
users
- [ ] Verify `AZURE_CLOUD` env var overrides the config file value

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 12:51:26 +08:00
qinling0210
f02f5fa435 Get ROW_ID from search() in Infinity (#13901)
### What problem does this PR solve?

1. Search() in Infinity can return row_id now

2.  To Get ROW_ID from search(), refer to handling of retrieval_test.

example
```
$ curl -s -X POST "http://localhost:$PORT/v1/chunk/retrieval_test" -H "Authorization: $TOKEN" -H "Content-Type: application/json" -d '{"kb_id": "4fcd01582ca911f1954184ba59049aa3", "question": "曹操"}'
```


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-02 18:56:43 +08:00
NeedmeFordev
6b7989b4b4 Add file type validation (#13802)
### What problem does this PR solve?

This PR fixes WebDAV sync behavior for unsupported file types
([#13795](https://github.com/infiniflow/ragflow/issues/13795)).

Previously, the WebDAV connector selected files primarily by modified
time (and size threshold) and could still pass unsupported extensions
into the download/document-generation path. This caused unnecessary
processing and inconsistent behavior compared with connectors that
validate file type earlier.

This change adds extension validation in two places:

1. **Early filter during recursive listing** to skip unsupported files
before they enter the download flow.
2. **Defensive filter before download/document creation** to prevent
unsupported files from being processed if any listing edge case slips
through.

It also wires `allow_images` into the WebDAV sync path so image
extension handling follows connector policy.

Scope is intentionally limited to WebDAV for a focused bug-fix PR.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### How was this tested?

- Manual verification with mixed file types under the configured WebDAV
path:
  - supported: `.pdf`, `.txt`, `.md`
  - unsupported: `.exe`, `.bin`, `.dat`
- Triggered full sync and polling sync.
- Confirmed unsupported files are skipped before download.
- Confirmed supported files are still indexed normally.
- Confirmed image handling follows `allow_images` setting.

Fixes: #13795
2026-04-02 14:12:27 +08:00
Ricardo-M-L
09a09a5b20 fix: correct typo in IterationItem name check and incomplete error message (#13890)
Two small fixes:

1. **iterationitem.py line 72**: Typo "interationitem" → "iterationitem"
(missing 't'). The component name check never matched IterationItem
components.

2. **raptor.py line 94**: Error message "Embedding error: " had a
trailing colon with no details. Changed to "Embedding error: empty
embeddings returned".
2026-04-02 10:35:28 +08:00
qinling0210
bb4a06f759 Implement InsertDataset and InsertMetadata in GO (#13883)
### What problem does this PR solve?

Implement InsertDataset and InsertMetadata in GO

new internal cli for go:

INSERT DATASET FROM FILE "file_name"
INSERT METADATA FROM FILE "file_name"

### Type of change

- [x] Refactoring
2026-04-01 16:16:25 +08:00
qinling0210
620fe215a4 Fix python metadata search (#13727)
### What problem does this PR solve?

Fix python metadata search

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-30 19:37:19 +08:00
qinling0210
0462c20113 Fix special characters in matching text of search() (#13852)
### What problem does this PR solve?

Fix special characters in matching text of search(). We should escape
some special characters(such as ?, *,:) before passing to matching_text
of search()

Fix https://github.com/infiniflow/ragflow/issues/13729

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-30 18:47:10 +08:00
Heyang Wang
641b319647 feat: support reading tags via API (#12891) (#13732)
### What problem does this PR solve?

Enable reading Tag Set tags via API (expose tag_kwd field). The result
of the queried list chunks is as shown below:

<img width="1422" height="818" alt="image"
src="https://github.com/user-attachments/assets/abd1960a-fe34-489e-9d72-525f8e574938"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Co-authored-by: heyang.why <heyang.why@alibaba-inc.com>
2026-03-29 20:17:01 +08:00
KeJun
cb78ce0a7b feat: support rss datasource (#13721)
### What problem does this PR solve?

Supporting public RSS/Atom feed URLs as data sources for RagFlow.

link https://github.com/infiniflow/ragflow/issues/12313

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-03-27 22:58:44 +08:00
Jin Hai
24fcd6bbc7 Update CI (#13774)
### What problem does this PR solve?

CI isn't stable, try to fix it.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-03-25 18:17:52 +08:00
Stephen Hu
d32967eda8 refactor: let excel use lazy image loader (#13558)
### What problem does this PR solve?

let excel use lazy image loader

### Type of change

- [x] Refactoring

---------

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-03-23 21:24:40 +08:00
Magicbook1108
f991cd362e Fix: type check in resume parsing method (#13740)
### What problem does this PR solve?

Fix: type check in resume parsing method
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-23 21:19:09 +08:00
Yongteng Lei
dd839f30e8 Fix: code supports matplotlib (#13724)
### What problem does this PR solve?

Code as "final" node: 

![img_v3_02vs_aece4caf-8403-4939-9e68-9845a22c2cfg](https://github.com/user-attachments/assets/9d87b8df-da6b-401c-bf6d-8b807fe92c22)

Code as "mid" node:

![img_v3_02vv_f74f331f-d755-44ab-a18c-96fff8cbd34g](https://github.com/user-attachments/assets/c94ef3f9-2a6c-47cb-9d2b-19703d2752e4)


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-03-20 20:32:00 +08:00
tmimmanuel
13d0df1562 feat: add Perplexity contextualized embeddings API as a new model provider (#13709)
### What problem does this PR solve?

Adds Perplexity contextualized embeddings API as a new model provider,
as requested in #13610.

- `PerplexityEmbed` provider in `rag/llm/embedding_model.py` supporting
both standard (`/v1/embeddings`) and contextualized
(`/v1/contextualizedembeddings`) endpoints
- All 4 Perplexity embedding models registered in
`conf/llm_factories.json`: `pplx-embed-v1-0.6b`, `pplx-embed-v1-4b`,
`pplx-embed-context-v1-0.6b`, `pplx-embed-context-v1-4b`
- Frontend entries (enum, icon mapping, API key URL) in
`web/src/constants/llm.ts`
- Updated `docs/guides/models/supported_models.mdx`
- 22 unit tests in `test/unit_test/rag/llm/test_perplexity_embed.py`

Perplexity's API returns `base64_int8` encoded embeddings (not
OpenAI-compatible), so this uses a custom `requests`-based
implementation. Contextualized vs standard model is auto-detected from
the model name.

Closes #13610

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
2026-03-20 10:47:48 +08:00
yH
757d8d42dd Fix: use configured OrderByExpr in _community_retrieval_ (#13683)
The `odr` variable was configured with `desc("weight_flt")` but a new
empty `OrderByExpr()` was passed to `dataStore.search()` instead,
causing the descending sort to have no effect.

### What problem does this PR solve?

In `_community_retrieval_`, the configured `OrderByExpr` with
`desc("weight_flt")` was discarded — a new empty `OrderByExpr()` was
passed to `dataStore.search()` instead, so community reports were never
sorted by weight.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-19 17:55:40 +08:00
Idriss Sbaaoui
7827f0fce5 fix : empty mind map (#13693)
### What problem does this PR solve?

Fix graphrag extractor chat response parsing and skip truncated cache
values

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-19 13:53:06 +08:00
NeedmeFordev
c3f79dbcb0 fix(jira): prevent missed incremental updates after issue edits (#13674)
### What problem does this PR solve?

Fixes [#13505](https://github.com/infiniflow/ragflow/issues/13505): Jira
incremental sync could miss updated issues after initial sync,
especially near time boundaries.

Root cause:
- Jira JQL uses minute-level precision for `updated` filters.
- Incremental windows had no overlap buffer, so boundary updates could
be skipped.
- Sync log cursor tracking used a backward-facing update for
`poll_range_start`.
- Existing-doc updates in `upload_document` lacked a KB ownership guard
for doc-id collisions.

What changed:
- Added Jira incremental overlap buffer (`time_buffer_seconds`,
defaulting to `JIRA_SYNC_TIME_BUFFER_SECONDS`) when building JQL
lower-bound time.
- Preserved second-level post-filtering to avoid duplicate reprocessing
while still catching boundary updates.
- Improved Jira sync logging to include start/end window and overlap
configuration.
- Updated sync cursor tracking in `increase_docs` to keep
`poll_range_start` moving forward with max update time.
- Added KB ID safety check before updating existing document records in
`upload_document`.

Verification performed:
- Python syntax compile checks passed for modified files.
- Manual verification flow:
  1. Run full Jira sync.
  2. Edit an already-indexed Jira issue.
  3. Run next incremental sync.
  4. Confirm updated content is re-ingested into KB.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-18 23:31:05 +08:00
Idriss Sbaaoui
9070408b04 Fix : model-specific handling (#13675)
### What problem does this PR solve?

add a handler for gpt 5 models that do not accept parameters by dropping
them, and centralize all models with specific paramter handling function
into a single helper.
solves issue #13639 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2026-03-18 17:28:20 +08:00
Daniil Sivak
60ad32a0c2 Feat: support epub parsing (#13650)
Closes #1398

### What problem does this PR solve?

Adds native support for EPUB files. EPUB content is extracted in spine
(reading) order and parsed using the existing HTML parser. No new
dependencies required.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

To check this parser manually:

```python
uv run --python 3.12 python -c "
from deepdoc.parser import EpubParser

with open('$HOME/some_epub_book.epub', 'rb') as f:
  data = f.read()

sections = EpubParser()(None, binary=data, chunk_token_num=512)
print(f'Got {len(sections)} sections')
for i, s in enumerate(sections[:5]):
  print(f'\n--- Section {i} ---')
  print(s[:200])
"
```
2026-03-17 20:14:06 +08:00
Stephen Hu
77483b1e58 refactor: remove useless variable in raptor (#13648)
### What problem does this PR solve?

remove useless variable in raptor

### Type of change


- [x] Refactoring
2026-03-17 15:56:51 +08:00
Yingfeng
b686a60713 Switch from demo.ragflow.io to cloud.ragflow.io (#13624)
### What problem does this PR solve?

Switch from demo.ragflow.io to cloud.ragflow.io

### Type of change

- [x] Documentation Update
2026-03-16 14:44:39 +08:00
apps-lycusinc
8b984c9d5f Fixing WordNetCorpusReader object has no attribute _LazyCorpusLoader_… (#13600)
### What problem does this PR solve?

Forces NLTK to load the corpus synchronously once, preventing concurrent
tasks from triggering the lazy-loading race condition that cause Fixing
WordNetCorpusReader object has no attribute _LazyCorpusLoader_… #13590


### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: shakeel <shakeel@lollylaw.com>
2026-03-13 19:55:01 +08:00
Idriss Sbaaoui
810692dfa3 fix: restore cross_languages default chat-model fallback for retrieval (#13471)
### What problem does this PR solve?

issue #13465 
POST /api/v1/retrieval failed with
{"code":100,...,"message":"Exception('Model Name is required')"} when
cross_languages was provided and no explicit llm_id was passed.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-13 10:52:37 +08:00
Ethan Clarke
35cd56f990 feat: add MiniMax-M2.5 and M2.5-highspeed models (#13557)
## Summary

Add MiniMax's latest M2.5 model family to the model registry and update
the default API base URL to the international endpoint for broader
accessibility.

## Changes

- **Add MiniMax-M2.5 models** to `conf/llm_factories.json`:
- `MiniMax-M2.5` — Peak Performance. Ultimate Value. Master the Complex.
  - `MiniMax-M2.5-highspeed` — Same performance, faster and more agile.
- Both support 204,800 token context window and tool calling (`is_tools:
true`).
- **Update default MiniMax API base URL** in `rag/llm/__init__.py`:
- From `https://api.minimaxi.com/v1` (domestic) to
`https://api.minimax.io/v1` (international).
- Chinese users can still override via the Base URL field in the UI
settings (as documented in existing i18n strings).

## Supported Models

| Model | Context Window | Tool Calling | Description |
|-------|---------------|-------------|-------------|
| `MiniMax-M2.5` | 204,800 tokens | Yes | Peak Performance. Ultimate
Value. |
| `MiniMax-M2.5-highspeed` | 204,800 tokens | Yes | Same performance,
faster and more agile. |

## API Documentation

- OpenAI Compatible API:
https://platform.minimax.io/docs/api-reference/text-openai-api

## Testing

- [x] JSON validation passes
- [x] Python syntax validation passes
- [x] Ruff lint passes
- [x] MiniMax-M2.5 API call verified (returns valid response)
- [x] MiniMax-M2.5-highspeed API call verified (returns valid response)

Co-authored-by: PR Bot <pr-bot@minimaxi.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-03-12 20:41:46 +08:00
qinling0210
1be07a0a34 Fix "Result window is too large" during meta data search (#13521)
### What problem does this PR solve?

Fix
https://github.com/infiniflow/ragflow/issues/13210#issuecomment-3982878498

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-12 18:59:56 +08:00
Magicbook1108
eda7835d47 Fix: image pdf in ingestion pipeline (#13563)
### What problem does this PR solve?

Fix: image pdf in ingestion pipeline #13550


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-12 17:49:02 +08:00
NeedmeFordev
387b0b27c4 feat(parser): support external Docling server via DOCLING_SERVER_URL (#13527)
### What problem does this PR solve?

This PR adds support for parsing PDFs through an external Docling
server, so RAGFlow can connect to remote `docling serve` deployments
instead of relying only on local in-process Docling.

It addresses the feature request in
[#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns
with the external-server usage pattern already used by MinerU.

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### What is changed?

- Add external Docling server support in `DoclingParser`:
  - Use `DOCLING_SERVER_URL` to enable remote parsing mode.
- Try `POST /v1/convert/source` first, and fallback to
`/v1alpha/convert/source`.
- Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not
set.
- Wire Docling env settings into parser invocation paths:
  - `rag/app/naive.py`
  - `rag/flow/parser/parser.py`
- Add Docling env hints in constants and update docs:
  - `docs/guides/dataset/select_pdf_parser.md`
  - `docs/guides/agent/agent_component_reference/parser.md`
  - `docs/faq.mdx`

### Why this approach?

This keeps the change focused on one issue and one capability (external
Docling connectivity), without introducing unrelated provider-model
plumbing.

### Validation

- Static checks:
  - `python -m py_compile` on changed Python files
  - `python -m ruff check` on changed Python files
- Functional checks:
  - Remote v1 endpoint path works
  - v1alpha fallback works
  - Local Docling path remains available when server URL is unset

### Related links

- Feature request: [Support external Docling server (issue
#13426)](https://github.com/infiniflow/ragflow/issues/13426)
- Compare view for this branch:
[main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1)

##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)
2026-03-12 17:09:03 +08:00
Yongteng Lei
e1b632a7bb Feat: add delete all support for delete operations (#13530)
### What problem does this PR solve?

Add delete all support for delete operations.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---------

Co-authored-by: writinwaters <cai.keith@gmail.com>
2026-03-12 09:47:42 +08:00
Ethan T.
1cee8b1a7b fix: use context managers for file handles to prevent resource leaks (#13514)
## Summary
- Convert bare `open()` calls to `with` context managers or
`Path.read_text()`
- File handles leak if not properly closed, especially on exceptions
- Fixes in crypt.py, sequence2txt_model.py, term_weight.py,
deepdoc/vision/__init__.py

## Test plan
- [x] File operations work correctly with context managers
- [x] Resources properly cleaned up on exceptions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 16:47:06 +08:00
eviaaaaa
d0ca388bec Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329)
## Summary
This PR is the direct successor to the previous `docx` lazy-loading
implementation. It addresses the technical debt intentionally left out
in the last PR by fully migrating the `qa` and `manual` parsing
strategies to the new lazy-loading model.

Additionally, this PR comprehensively refactors the underlying `docx`
parsing pipeline to eliminate significant code redundancy and introduces
robust fallback mechanisms to handle completely corrupted image streams
safely.


## What's Changed

* **Centralized Abstraction (`docx_parser.py`)**: Moved the
`get_picture` extraction logic up to the `RAGFlowDocxParser` base class.
Previously, `naive`, `qa`, and `manual` parsers maintained separate,
redundant copies of this method. All downstream strategies now natively
gather raw blobs and return `LazyDocxImage` objects automatically.
* **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge
cases where `python-docx` encounters critically malformed magic headers.
Implemented an explicit `try-except` structure that safely intercepts
`UnrecognizedImageError` (and similar exceptions) and seamlessly falls
back to retrieving the raw binary via `getattr(related_part, "blob",
None)`, preventing parser crashes on damaged documents.

* **Legacy Code & Redundancy Purge**:
* Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`,
and `manual.py`.
* Removed the standalone, immediate-decoding `concat_img` method in
`manual.py`. It has been completely replaced by the globally unified,
lazy-loading-compatible `rag.nlp.concat_img`.
* Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception
packages) across all updated strategy files.

## Scope
To keep this PR focused, I have restricted these changes strictly to the
unification of `docx` extraction logic and the lazy-load migration of
`qa` and `manual`.

## Validation & Testing
I've tested this to ensure no regressions and validated the fallback
logic:

* **Output Consistency**: Compared identical `.docx` inputs using `qa`
and `manual` strategies before and after this branch: chunk counts,
extracted text, table HTML, and attached images match perfectly.
* **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory
usage when processing image-dense documents through the `qa` and
`manual` pipelines, bringing them up to parity with the `naive`
strategy's performance gains.

## Breaking Changes
* None.
2026-03-11 10:00:07 +08:00
Yongteng Lei
3c80a0ae09 Fix: support vLLM's new reasoning field (#13493)
### What problem does this PR solve?

Support vLLM's new reasoning field

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-10 21:13:14 +08:00
Magicbook1108
675810e0cf Refact: optimize confluence performance (#13497)
### What problem does this PR solve?

Refact: optimize confluence performance #13494

### Type of change

- [x] Refactoring
2026-03-10 15:02:24 +08:00
Idriss Sbaaoui
249b78561b Fix missmatch docnm_kwd in raptor chunks (#13451)
### What problem does this PR solve?

issue #13393 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-10 14:24:33 +08:00
Magicbook1108
7143954b48 Fix: chats_openai in none stream condition (#13495)
### What problem does this PR solve?

Fix: chats_openai in none stream condition #13453

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-10 13:44:17 +08:00
qinling0210
7c92f51133 Fix retrieval function when metadata_condtion is specified in retrieval API (#13473)
### What problem does this PR solve?

Fix https://github.com/infiniflow/ragflow/issues/13388

The following command returns empty when there is doc with the meta data
```
curl --request POST \
     --url http://localhost:9222/api/v1/retrieval \
     --header 'Content-Type: application/json' \
     --header 'Authorization: Bearer ragflow-fO3mPFePfLgUYg8-9gjBVVXbvHqrvMPLGaW0P86PvAk' \
     --data '{
          "question": "any question",
          "dataset_ids": ["9bb4f0591b8811f18a4a84ba59049aa3"],
           "metadata_condition": {
            "logic": "and",
            "conditions": [
              {
                "name": "character",
                "comparison_operator": "is",
                "value": "刘备"
              }
            ]
          }
     }'
```

When metadata_condtion is specified in the retrieval API, it is
converted to doc_ids and doc_ids is passed to retrieval function.
In retrieval funciton, when doc_ids is explicitly provided , we should
bypass threshold.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-10 11:57:32 +08:00
guptas6est
32d31284cc Fix: upgrade pypdf to 6.7.5 and migrate from deprecated pypdf2 to fix CVE-2026-28804 and CVE-2023-36464 (#13454)
### What problem does this PR solve?

This PR addresses security vulnerabilities in PDF processing
dependencies identified by Trivy security scan:

1. CVE-2026-28804 (MEDIUM): pypdf 6.7.4 vulnerable to inefficient
decoding of ASCIIHexDecode streams
2. CVE-2023-36464 (MEDIUM): pypdf2 3.0.1 susceptible to infinite loop
when parsing malformed comments

Since pypdf2 is deprecated with no available fixes, this PR migrates all
pypdf2 usage to the actively maintained pypdf library (version 6.7.5),
which resolves
both vulnerabilities.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-09 12:06:00 +08:00
Heyang Wang
c217b8f3d8 Feat: add DingTalk AI Table connector and integration for data synch… (#13413)
### What problem does this PR solve?

Add DingTalk AI Table connector and integration for data synchronization

Issue #13400

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Co-authored-by: wangheyang <wangheyang@corp.netease.com>
2026-03-06 21:13:23 +08:00