### What problem does this PR solve?
Syncs the /api/v1/chat/completions docs with the current behavior,
including the new legacy streaming mode.
### Type of change
- [x] Documentation Update
### What problem does this PR solve?
The parser pods suffer from OOM kills when processing large PDF
documents. The root cause is in api/db/services/task_service.py: when
layout_recognize is not DeepDOC (e.g. Plain Text), page_size was set to
MAXIMUM_TASK_PAGE_NUMBER (100 million), causing the entire PDF to be
processed as a single task with all pages loaded into memory
simultaneously.
This PR fixes the issue by paginating non-DeepDOC PDF parsing tasks the
same way DeepDOC already does.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [x] Performance Improvement
- [ ] Other (please describe):
## Summary
Add image parsing capability to PaddleOCR integration, building on top
of #15967 (async Job API migration).
## Changes
### `deepdoc/parser/paddleocr_parser.py`
- Add `parse_image()` method that uses the same async Job API flow as
`parse_pdf()`
- Extracts text from `layoutParsingResults` → `prunedResult` →
`parsing_res_list`
- Returns concatenated block content as a single string
### `rag/llm/ocr_model.py`
- Add `parse_image()` wrapper to `PaddleOCROcrModel` with availability
check and logging
## Relationship to other PRs
- **Depends on**: #15967 (async Job API migration) — this PR is based on
that branch
- **Replaces**: #14826 (original image processing PR based on old sync
API)
## Notes
This PR uses `base_url` and the async Job API (submit → poll → fetch)
consistent with #15967, rather than the old `api_url` + sync POST
pattern from #14826.
## Summary
Migrate PaddleOCR integration from the deprecated synchronous HTTP API
to the new asynchronous Job API (`submit → poll → fetch`), aligning with
PaddleOCR 3.6.0+ architecture.
## Changes
### Python (`deepdoc/parser/paddleocr_parser.py`)
- Replace synchronous `requests.post()` with async Job API flow (submit
→ poll → fetch)
- Authentication: `token {token}` → `Bearer {token}`
- File transfer: base64 JSON body → multipart file upload
- Polling: exponential backoff (initial 3s, ×1.5, max 15s, timeout
controlled by `request_timeout`)
- Result: fetch full JSONL from result URL, preserving `prunedResult`
with bbox info for crop functionality
- Rename `api_url` → `base_url` (backward compatible: `api_url` still
accepted as fallback)
### Python (`rag/llm/ocr_model.py`)
- Prefer `paddleocr_base_url` / `PADDLEOCR_BASE_URL`, fallback to
`paddleocr_api_url` / `PADDLEOCR_API_URL`
### Go (`internal/entity/models/paddleocr.go`)
- Add `Client-Platform: ragflow` header to submit and poll requests
- Change polling from fixed 3s to exponential backoff (initial 3s, ×1.5,
max 15s)
### Python (`common/constants.py`)
- Add `PADDLEOCR_BASE_URL` to env keys and default config
## Backward Compatibility
- Old env var `PADDLEOCR_API_URL` still works (used as fallback)
- Frontend field `paddleocr_api_url` still works (backend reads it as
fallback)
- No user-facing configuration changes required for existing setups
## Why not use the `paddleocr` SDK package directly?
RAGFlow's `_transfer_to_sections()` relies on `prunedResult` (containing
`block_bbox`, `block_label`, `parsing_res_list`) from the raw API
response for PDF crop functionality. The SDK's public `parse_document()`
API only returns `DocParsingResult` with `markdown_text`, discarding the
bbox data. Therefore we implement the async Job API flow directly via
HTTP, following the same logic as the SDK internally.
### What problem does this PR solve?
Merge password related functions
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
## Summary
- The `ChatChannel` DB column was renamed from `dialog_id` to `chat_id`
via a migration (added in a prior commit).
- Aligns the REST API layer (`chat_channel_api.py`,
`chat_channel_service.py`) to use `chat_id` consistently.
- Updates the frontend (`interface.ts`, `hooks.ts`,
`connect-dialog-modal.tsx`, `added-channel-card.tsx`) to read/write
`chat_id` instead of `dialog_id`.
- The joined `dialog_name` alias in the list query is unchanged (backend
still returns it under that name).
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
### What problem does this PR solve?
`rag/app/naive.py` `Markdown.load_images_from_urls` fetched image URLs
parsed
straight out of an untrusted uploaded markdown document via a raw
`requests.get`,
with no SSRF validation. Markdown chunking always reaches this path
(`return_section_images=True`), so any authenticated user who uploads a
`.md`/`.markdown`/`.mdx` file to a knowledge base could make the server
issue
requests to internal services or cloud-metadata endpoints, e.g.
``. The `image/`
Content-Type
check only gates decoding — the outbound request (the SSRF) always
fires.
This was the one user-controlled fetch site missed by the project's
existing
SSRF-hardening (`common/ssrf_guard.py`, already applied to the crawler,
SearXNG,
RSS connector, MCP/document APIs, and OAuth avatar download).
The fix validates and DNS-pins every hop with
`common.ssrf_guard.assert_url_is_safe`
before connecting, and follows redirects manually so each redirect
target is
re-validated (closing the DNS-rebinding / redirect-bypass window),
mirroring
`common/data_source/rss_connector.py`. Blocked URLs are skipped and
logged like
any other unreachable image, so legitimate public images are unaffected.
Adds a
regression test at `test/unit_test/rag/app/test_markdown_image_ssrf.py`.
Closes#15437
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Ubuntu <ubuntu@ubuntu-2204.linuxvmimages.local>
Co-authored-by: galuis116 <galuis116@users.noreply.github.com>
### What problem does this PR solve?
fix: remove unnecessary 'asChild' prop from FilterButton component
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
fix: remove unnecessary div in profile page layout
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
fix: update channelTemplates to filter for Discord and Lark only
- Fixed a display issue with chunks during pipeline parsing.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
fix: update localization keys for image2text and add ocr option
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix DB migration issue.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Fix:
- Pass session_id to langfuse.
- Get correct status for add model_type.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
This PR fixes Go admin server startup failure caused by duplicate model
aliases in conf/all_models.json.
The model provider loader builds a global lookup table from both model
name and alias values. Some aliases duplicated another model's name or
another
alias, for example amazon.titan-embed-text-v1, which caused startup to
fail with a duplicate alias error. This PR removes conflicting duplicate
aliases
while keeping all model definitions intact.
### What problem does this PR solve?
As title
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
### What problem does this PR solve?
Fix register user
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
TOC chunks now include a toc field so the agent pipeline logs expose the
data the frontend expects.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
## What
- Add `group` class to wrapper div to enable hover state coordination
- Apply hover styles to checkbox via group-hover
- Make FormLabel clickable via onClick toggle + cursor-pointer
- Fix label color logic: disabled vs primary state
## Why
The "Remember me" label was not clickable and had no hover feedback,
making the UX inconsistent with standard checkbox behavior.
## How to test
0. Go to the demo video before/after attached below
1. Go to the login page
2. Click directly on the "Remember me" label → should toggle the
checkbox
3. Hover over the checkbox area → should show hover styles
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Before
https://github.com/user-attachments/assets/bd47d45c-09ea-437f-bd98-3397ce040c1e
## After
https://github.com/user-attachments/assets/45c65d1a-bec7-4ad6-8f1c-d149f7296f8f
### What problem does this PR solve?
This PR enhances the CLI parser to support dimension configurations for
custom embedding models. Users can now specify the maximum dimension and
other supported dimensions directly after the embedding keyword.
```
add model 'x1 x2 x3 x4 x5' to provider 'vllm' instance 'test' with
tokens 1024 chat think vision,
token 2048 chat,
token 1024 think vision,
token 0 embedding 2048 64 1024 2048,
token 0 embedding 2048;
```
- The first integer following embedding represents the max_dimension.
- Any subsequent integers represent specific alternative dimensions.
- If no subsequent integers are provided, dimensions defaults to empty,
indicating all sizes under max_dimension are supported.
### Description
Currently, when setting tenant default models (e.g., chat, embedding,
rerank), the API only accepts the composite name
(`model_name@model_instance@model_provider`). However, some integrations
and front-end features prefer using the database `model_id` (UUID)
directly.
This PR adds support for `model_id` in default model configuration:
1. **Request Binding**: Added `model_id` (optional field) to the request
body schema in the handler.
2. **Database Lookup**: If `model_id` is supplied, the service queries
the database to resolve the respective provider, instance, and model
names.
3. **Security Validation**: Verified that the provider associated with
the resolved `model_id` belongs to the requesting tenant.
4. **Unit Tests**: Added `TestSetTenantDefaultModels_WithModelID` to
verify DB ID resolution and tenant mapping.
### What problem does this PR solve?
core module for agent layer built on top of graph engine #16039
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fixes the sandbox config API method mismatch so the frontend and backend
use the same HTTP verb.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Adds a legacy mode for /chat/completions that restores v0.23.0-style
output by converting start_to_think/end_to_think back into raw
<think></think> markers and streaming cumulative answer text.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
go-version of Pregel-based BSP engine
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat:
- Allow upsert model_type for instance model
Fix:
- Allow create instance with duplicate api_key
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: Move less important chat settings into a collapsible panel.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
As title.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
```
RAGFlow(api/default)> parse file 'test.html';
Parsing HTML file: test.html
<html>
......
```
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Guard the agent-attachment download against a missing or empty storage blob so the caller gets a structured 4xx (`Document not found!`) instead of an HTTP 500. Same bug class as #15365 on document preview.
Resolve#15502
### What problem does this PR solve?
```
RAGFlow(api/default)> parse file 'README.md';
Parsing Markdown file: README.md
--- AST tree:
HTMLBlock '<div align="center">\n<a href="https:…'
```
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Allow S3-compatible data source region fields to accept custom values
while preserving search-and-select behavior.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)