Commit Graph

5947 Commits

Author SHA1 Message Date
buua436
a7ce1b1677 Fix: prune deleted doc chunks from retrieval (#14454)
### What problem does this PR solve?

prune deleted doc chunks from retrieval

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 13:03:09 +08:00
Jin Hai
b493a33316 Go: update chat URL (#14453)
### What problem does this PR solve?

Update the URL to: /api/v1/chat/completions

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-04-29 11:45:06 +08:00
Magicbook1108
3b7a6eaa6c Feat: sync deleted files in Bitbucket (#14450)
### What problem does this PR solve?

Feat: sync deleted files in Bitbucket

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 11:29:17 +08:00
Paras Sondhi
74fa54f122 feat(google-drive): optimize memory payload and enable sync deletion (#14372)
**Addresses the Google Drive integration for #14362**

This PR completely overhauls the Google Drive sync logic to accurately
detect remote deletions, while drastically reducing the memory footprint
during the snapshot phase.

### What changed under the hood:

* **Killed the memory bloat:** Swapped out the massive document
dictionary objects for a lightweight `collections.namedtuple` (`SlimDoc
= namedtuple('SlimDoc', ['id'])`). This prevents RAM spikes during
`retrieve_all_slim_docs_perm_sync` on massive enterprise drives.
* **Flawless downstream integration:** The `SlimDoc` object relies on
simple duck typing. It perfectly delivers the `.id` attribute required
by `ConnectorService.cleanup_stale_documents_for_task`, meaning your
core `hash128` vector cleanup logic runs natively without modification.
* **Fixed the Shared Drive blindspot:** The standard API query was
missing team folders. Injected the `corpora="allDrives"` and
`includeItemsFromAllDrives=True` override flags so the connector now
accurately maps state across both personal workspaces and organizational
Shared Drives.

### Testing:
Isolated the Google API retrieval logic locally to prove the `SlimDoc`
mapping works and correctly registers state drops when a file is trashed
remotely.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Performance Improvement
2026-04-29 10:04:36 +08:00
Stephen Hu
345bec812d refactor: improve QwenRerank logic (#14388)
### What problem does this PR solve?

improve QwenRerank logic

### Type of change

- [x] Refactoring
2026-04-28 20:17:34 +08:00
Magicbook1108
0d18b293f5 Fix: enable sync deleted file in airtable (#14438)
### What problem does this PR solve?

Fix: enable sync deleted file in airtable

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 20:09:08 +08:00
Magicbook1108
926efbd29b Fix: update based on #14436 (#14440)
### What problem does this PR solve?

Fix: update based on #14436

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 20:08:42 +08:00
euvre
35f6d81b73 Refactor: migrate chunk retrieval_test and knowledge_graph to REST API endpoints (#14402)
### What problem does this PR solve?

## Summary

Migrate two web API endpoints to REST-style HTTP API endpoints,
following the pattern established in #14222:

| Old Endpoint | New Endpoint |
|---|---|
| `POST /v1/chunk/retrieval_test` | `POST
/api/v1/datasets/<dataset_id>/search` |
| `GET /v1/chunk/knowledge_graph` | `GET
/api/v1/datasets/<dataset_id>/graph` |
2026-04-28 20:00:26 +08:00
Magicbook1108
85575259ac Fix: google authentication - gmail && google-drive (#14422)
### What problem does this PR solve?

Fix: google authentication - gmail && google-drive

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 18:09:02 +08:00
qinling0210
dcce864d4c Simplify Encode (#14437)
### What problem does this PR solve?

Simplify Encode

### Type of change

- [x] Refactoring
2026-04-28 18:07:42 +08:00
Magicbook1108
d532151be0 Feat: more model for paddle (#14436)
### What problem does this PR solve?

Feat: more model for paddle
### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2026-04-28 18:07:00 +08:00
Haruko386
4e5a093ac5 Go: implement provider: Moonshot (#14433)
### What problem does this PR solve?

implement `Moonshot` provider

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-28 18:06:25 +08:00
Jack
c330005659 Fix: document level auto metadata config missing after save (#14421)
### What problem does this PR solve?

Steps to re-produce (existing bug before API migration):

create a new dataset
upload a file 
click on "General" in "Parse" column and then click on "switch or
configure ingestion pipeline"
click on "Settings" (at right of "Auto metadata")
click "Add" to add new metadata
click on "Save"
re-open "Settings" and the newly added metadata is not there

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 17:09:23 +08:00
buua436
e6e80041f5 Fix: agent toolcall null response & schema validation & DeepSeek think history (#14425)
### What problem does this PR solve?
agent toolcall null response & schema validation & DeepSeek think
history

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 17:09:08 +08:00
Jin Hai
f670913bb4 Refactor model type to model class (#14426)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-04-28 16:05:15 +08:00
Jin Hai
7c25870923 Go: update db model (#14423)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-04-28 16:04:55 +08:00
Magicbook1108
18fbfafca6 Feat: enable sync deleted files for more connectors (#14353)
### What problem does this PR solve?

Feat: enable sync delted files for connectors

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-28 15:07:14 +08:00
NeedmeFordev
0df65d358a Fix case-insensitive matching for manual meta_data_filter in / not in list values (#14397)
## Summary

Fixes case-asymmetric matching for manual `meta_data_filter` when using
**`in`** / **`not in`** with a **list** `value`. Document metadata
strings were lowercased, but list elements were not, so values like
`"F2"` failed to match `["F2", "F11"]` even though **`=`** behaved
correctly.

Closes #14389

## Changes

- **`common/metadata_utils.py`**: For **`in`** / **`not in`**, normalize
string elements when `value` and/or `input` is a list, consistent with
scalar string lowercasing.
- **`test/unit_test/common/test_metadata_filter_operators.py`**:
Regression tests for list `value` case-insensitivity and **`not in`**.

## Type of change

- [x] Bug fix (non-breaking)
2026-04-28 14:51:48 +08:00
Idriss Sbaaoui
2a37562791 Fix manual naive parser position extraction fallback (#14420)
### What problem does this PR solve?
This PR fixes a regression where Manual pipeline + Naive (Plain Text)
PDF parsing crashed with `AttributeError: 'PlainParser' object has no
attribute 'extract_positions'` in `rag/app/manual.py`.
fixes #14411 
### Type of change:
- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 14:21:30 +08:00
Jin Hai
ae420f6358 Go: fix compilation (#14418)
### What problem does this PR solve?

Add methods to volcengine

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-04-28 13:21:05 +08:00
qinling0210
effc84a042 Refactor model in GO (#14398)
### What problem does this PR solve?

Refactor model in GO

### Type of change

- [x] Refactoring
2026-04-28 12:59:01 +08:00
Wang Qi
5885691c68 Always return success if no such task id (#14417)
### What problem does this PR solve?

Always return success if no such task id to follow existing code logic.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 12:55:24 +08:00
buua436
444e564329 Fix: align chat recommendation and thumbup APIs (#14413)
### What problem does this PR solve?
align chat recommendation and thumbup APIs
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 12:55:16 +08:00
buua436
7a70a0fd85 Fix: preserve infinity available_int zero filter (#14416)
### What problem does this PR solve?

preserve infinity available_int zero filter

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 12:54:32 +08:00
Jin Hai
819257f257 Go: add volcengine (#14409)
### What problem does this PR solve?

1. Refactor server_main
2. Add volcengine

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-04-28 12:12:58 +08:00
Jack
2d522ccb36 Fix: thumbnails issue in chat (#14415)
[Uploading part_4-13.pdf…]()
### What problem does this PR solve?

In chat, the thumbnails didn't display correctly

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)

Steps to reproduce:
1. create dataset and upload a file (see attached)
2. parse the document
3. once parsing completed, create a chat and associate it with the
dataset
4. ask a question (DAP VS DAPE comparison)
5. check result
2026-04-28 11:39:29 +08:00
writinwaters
0cf105da8d Doc: Added a database schema and migration guide. (#14404)
### What problem does this PR solve?

Added a database schema and migration guide.

### Type of change


- [x] Documentation Update
2026-04-28 09:54:33 +08:00
Jack
c81081f8ef Refactor: Doc change parser (#14327)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_parser
HTTP API: PATCH /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API
PATCH /api/v1/datasets/<dataset_id>/documents

### Type of change

- [x] Refactoring
2026-04-27 23:42:57 +08:00
Jack
872ff08304 Fix: add executor.shutdown (#14403)
### What problem does this PR solve?

Add executor shutdown in finally clause to free resources.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 22:38:43 +08:00
Jack
c5116b90e5 Refactor: migrate document thumbnails API (#14344)
### What problem does this PR solve?

Before migration: GET /v1/document/thumbnails
After migration:  GET /api/v1/thumbnails

### Type of change

- [x] Refactoring
2026-04-27 21:29:09 +08:00
Jack
49912a156e Refactor: migrate document run api (#14351)
### What problem does this PR solve?

Before migration: POST /v1/document/run
After migration: POST /api/v1/documents/ingest/

### Type of change

- [x] Refactoring
2026-04-27 21:25:58 +08:00
Jin Hai
965717c4fb Go: add new provider: google (#14395)
### What problem does this PR solve?

As title.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-04-27 20:35:47 +08:00
Jack
343bda1119 Refactor: deco document upload_and_parse API (#14366)
### What problem does this PR solve?

remove unused "POST /v1/document/upload_and_parse"

### Type of change

- [x] Refactoring
2026-04-27 20:35:00 +08:00
euvre
d78013964a tests: add missing HTTP API tests for dataset management endpoints removed in #14222 (#14390)
### What problem does this PR solve?

### Summary

PR #14222 consolidated KB (web) API endpoints into RESTful Dataset
(HTTP) API endpoints and deleted the web API test suite under
`test_web_api/test_kb_app/` and `test_web_api/test_document_app/`. While
most test coverage was migrated to the HTTP API test suite, some tests
were not ported over. This PR adds back the missing coverage.

### Route migration reference

| Old Web API | New HTTP API | Missing tests |
|---|---|---|
| `POST /v1/kb/update_metadata_setting` | `PUT
/api/v1/datasets/<id>/metadata/config` | auth & error paths |
| `GET /api/v1/datasets/<id>/auto_metadata` | `GET
/api/v1/datasets/<id>/metadata/config` | auth & CRUD |
| `PUT /api/v1/datasets/<id>/auto_metadata` | `PUT
/api/v1/datasets/<id>/metadata/config` | auth & CRUD |
| `GET /v1/kb/<kb_id>/basic_info` | `GET
/api/v1/datasets/<id>/ingestions/summary` | covered |
| `POST /v1/kb/list_pipeline_logs` | `GET
/api/v1/datasets/<id>/ingestions` | edge cases missing |

### Changes

#### `test_file_management_within_dataset/test_metadata_config.py` (new,
10 tests)

Covers `GET/PUT /datasets/<id>/metadata/config` (migrated from
`test_kb_tags_meta.py`'s `test_update_metadata_setting` and
`test_document_metadata.py`'s negative tests):
- Authorization for dataset metadata config GET/PUT
- Authorization for document metadata config PUT
- Success, invalid dataset, missing payload, not found scenarios

#### `test_dataset_management/test_ingestion_logs.py` (extended, +2
tests)

Covers `GET /datasets/<id>/ingestions` edge cases (migrated from
`test_kb_pipeline_tasks.py`):
- Missing dataset ID
- Abnormal date filter

### Type of change

- [x] Other: Test coverage improvement

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 20:01:28 +08:00
Jack
a536980e22 Refactor: Doc batch change status (#14337)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_status

After consolidation, Restful API
POST /api/v1/datasets/<dataset_id>/documents/batch-update-status 

### Type of change

- [x] Refactoring
2026-04-27 20:00:23 +08:00
buua436
c949096db0 Refactor: optimize agent reset conversation variable defaults (#14401)
### What problem does this PR solve?
optimize agent reset conversation variable defaults
### Type of change
- [x] Refactoring
2026-04-27 19:57:56 +08:00
Wang Qi
488c3ef6a3 Add task API (#14393)
### What problem does this PR solve?

Add task API

### Type of change

- [x] Refactor
2026-04-27 19:16:37 +08:00
buua436
82313020c7 Refa: align list operations and strict mode (#14387)
### What problem does this PR solve?

align list operations and strict mode

### Type of change
- [x] Refactoring
2026-04-27 19:13:00 +08:00
Jack
c1941fd503 Refactor: deco doc-parse API that is not used any more (#14367)
### What problem does this PR solve?

Delete un-used API "POST /v1/document/parse"

### Type of change

- [x] Refactoring
2026-04-27 18:54:49 +08:00
buua436
4f6651968a Fix: prioritize explore session ID and reset default conversation variables (#14399)
### What problem does this PR solve?

 prioritize explore session ID and reset default conversation variables

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 18:52:40 +08:00
mginfn
10e28e5c5f Helm template ragflow.yaml: fix nginx-config-volume mountPath according to Dockerfile v0.25.0 (#14361)
### What problem does this PR solve?

Dockerfile v0.25.0 expects nginx conf at path
/etc/nginx/ragflow.conf.python, see
[Dockerfile#L200](ca01c7a745/Dockerfile (L200))
However current helm template mount the conf at path
/etc/nginx/ragflow.conf causing runtime error at startup time.

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Mauro Gattari <mauro.gattari@infn.it>
2026-04-27 18:51:55 +08:00
buua436
0f2778efe7 Fix: support release in agent update api (#14396)
### What problem does this PR solve?

support release in agent update api

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 17:35:35 +08:00
Jack
61a24a2c14 Refactor: migrate doc upload info used in chat (#14359)
### What problem does this PR solve?

Before migration: POST /v1/document/upload_info/
After migration: POST /api/v1/documentss/upload/

### Type of change

- [x] Refactoring
2026-04-27 16:58:42 +08:00
Zhichang Yu
c446c403de perf: lazy img_np loading and chunked parse_into_bboxes for large PDFs (#14385)
## Summary

- **Lazy img_np loading**: `np.array(img)` is now deferred until the
first OCR text extraction is actually needed, avoiding unnecessary
memory allocation for pages that already have text.
- **Chunked parse_into_bboxes**: Large PDFs (>50 pages, configurable via
`PDF_PARSER_PAGE_BATCH_SIZE`) are processed in batches. Each chunk's
boxes are normalized with `_to_global_boxes` to produce globally
consistent page numbers and position tags.
- **DLA early init**: Move remote-client initialization before model
loading in `LayoutRecognizer.__init__` so `DEEPDOC_URL` (or legacy
`TENSORRT_DLA_SVR`) short-circuits unnecessary model download for parser
containers relying on remote inference.
- **Fix outline regression**: Restore `self.outlines =
extract_pdf_outlines(fnm)` in `parse_into_bboxes`; this was dropped
during refactoring and is required by downstream `remove_toc` and
metadata handling in `rag/flow/parser/parser.py`.

## Test plan

- [ ] Small PDF (<=50 pages): verify parse succeeds and `self.outlines`
is populated
- [ ] Large PDF (>50 pages): verify chunked processing produces globally
consistent page numbers
- [ ] With `DEEPDOC_URL` set: verify remote DLA client is used and local
model is not downloaded
- [ ] With legacy `TENSORRT_DLA_SVR` set: verify backward compatibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 16:52:43 +08:00
Idriss Sbaaoui
4303be223f Fix metadata parsing regression for upgraded v0.24 datasets (#14383)
### What problem does this PR solve?

This PR fixes issue #14371 where file parsing failed after upgrading
from v0.24.0 to v0.25.0, because metadata config could be a JSON Schema
object but was handled like a list and later caused `KeyError:
'properties'`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 16:18:06 +08:00
Wang Qi
d88f7ac8d2 Remove evaluation_app.py and kb_app.py (#14394)
### What problem does this PR solve?

Delete not used APIs

### Type of change

- [x] Refactoring
2026-04-27 16:08:54 +08:00
Jack
290f0294d6 Refactor: migrate artifact API (#14348)
### What problem does this PR solve?

Before migration: GET /v1/document/artifact/<filename>
After migration:  GET /api/v1/documents/artifact/<filename>

### Type of change

- [x] Refactoring
2026-04-27 15:19:41 +08:00
euvre
2846a93998 Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?

Fixes #14196

## Problem

When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:

- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports

Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.

## Root Cause

```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
    # Only the first 300 pages were rendered; everything beyond was silently dropped
```

While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.

## Solution

### 1. Define constants in `common/constants.py`

```python
MAXIMUM_PAGE_NUMBER = 100000                        # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000  # Used by the task/DB layer
```

### 2. Replace all hardcoded sentinel values

| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |

### 3. Fix `parse_into_bboxes()` missing parameters

Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.

## Files Changed (22)

- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 14:57:20 +08:00
Jin Hai
c3eac4103a Go: aliyun model provider (#14379)
### What problem does this PR solve?

As title.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-04-27 14:53:33 +08:00
buua436
0b46ab07c5 Refa: restore openai-compatible chat completions api (#14380)
### What problem does this PR solve?
restore openai-compatible chat completions api
### Type of change

- [x] Refactoring
2026-04-27 14:02:19 +08:00