11 Commits

Author SHA1 Message Date
sirj0k3r
b2b63600f1 Adds gpt-5.4-mini and gpt-5.4-nano (#14908)
### What problem does this PR solve?
Includes gpt-5.4-mini and gpt-5.4-nano to the OpenAI model list

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-14 10:16:24 +08:00
0xτensor
127aeac4aa fix: expose gpt-5.5 and gpt-5.4 in OpenAI model list (#14828)
### What problem does this PR solve?

OpenAI model catalogs used in provider selection flows were missing the
latest GPT models (`gpt-5.5` and `gpt-5.4`).
Because model availability is driven by seeded catalog data
(`conf/llm_factories.json` → DB seed → API response), these models were
not selectable in the UI or `/llm/list` responses.

This PR updates and synchronizes the OpenAI catalog definitions across
configuration sources and ensures the new models are correctly exposed
through the API layer and validated in tests.

---

### Type of change

* [x] New Feature (non-breaking change which adds functionality)

---

### Changes Made

* Added `gpt-5.5` and `gpt-5.4` to OpenAI catalog definitions in:

  * `conf/llm_factories.json`
  * `conf/models/openai.json` (chat + vision support)
* Ensured consistency between DB-seeded factory config and provider
model configuration
* Updated test coverage in:

  * `test_llm_list_unit.py`

    * seeded OpenAI catalog entries
* added response-level assertion validating `/llm/list` includes both
new model IDs under OpenAI grouping

---

### Root Cause

OpenAI model listings in selection flows are generated from catalog data
seeded via `conf/llm_factories.json`.
The catalog had not been updated to include the latest GPT models,
resulting in missing availability in UI and API responses.

---

### Testing

* Created isolated test environment:

  * `python -m venv .venv-review`
  * installed `pytest`
* Ran targeted and full test suite:

  * `test_list_app_grouping_availability_and_merge`:  passed
  * Full `test_llm_list_unit.py`:  10 passed

---

### Risks / Limitations

* Adding models to the catalog does not guarantee upstream provider
availability or account entitlement.
* Environments with pre-seeded DB catalogs may require reseed or refresh
to reflect updated configuration.

---

### Notes

* Changes are minimal and scoped strictly to catalog configuration and
related test coverage.
* Ensures `/llm/list` API remains aligned with expected latest OpenAI
model availability.
* Closes #14827
2026-05-12 18:03:47 +08:00
box4wangjing
292b0b8bce chore: fix some comments to improve readability (#14756)
### What problem does this PR solve?

fix some comments to improve readability

### Type of change

- [x] Documentation Update

---------

Signed-off-by: box4wangjing <box4wangjing@outlook.com>
2026-05-11 16:48:48 +08:00
wdeveloper16
78188ce9e9 Feat: add OpenDataLoader PDF parser backend (#14058) (#14097)
### What problem does this PR solve?

Closes #14058.

RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU,
Docling, TCADP, PaddleOCR). This PR adds **OpenDataLoader**
([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf))
as a new optional backend, giving users a deterministic, local-first
alternative with competitive table extraction accuracy.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---

### Changes

#### Backend
- `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser`
class inheriting `RAGFlowPdfParser`. Implements `check_installation()`
(guards Python package + Java 11+ runtime), `parse_pdf()` with
JSON-first extraction (heading/paragraph/table/list/image/formula) and
Markdown fallback, position-tag generation compatible with the shared
`@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup.
- `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in
`PARSERS` dict, added to `chunk_token_num=0` override list.
- `rag/flow/parser/parser.py` — `"opendataloader"` branch in the
pipeline PDF handler + check validation list.

#### Infrastructure
- `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in
via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH.

#### Frontend
- `web/src/components/layout-recognize-form-field.tsx` —
`OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown.
Cascades automatically to the pipeline editor's Parser component.

#### Docs
- `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader
entry and full env-var reference.

---

### Environment variables

| Variable | Default | Description |
|---|---|---|
| `USE_OPENDATALOADER` | `false` | Set `true` to install
`opendataloader-pdf` on container startup |
| `OPENDATALOADER_VERSION` | latest | Pin the PyPI release (e.g.
`==2.2.1`) |
| `OPENDATALOADER_HYBRID` | _(unset)_ | Enable hybrid AI mode (e.g.
`docling-fast`) |
| `OPENDATALOADER_IMAGE_OUTPUT` | _(unset)_ | `off` / `embedded` /
`external` |
| `OPENDATALOADER_OUTPUT_DIR` | _(tmp)_ | Persistent output dir; temp
dir used + cleaned if unset |
| `OPENDATALOADER_DELETE_OUTPUT` | `1` | `0` to retain intermediate
files for debugging |
| `OPENDATALOADER_SANITIZE` | _(unset)_ | `1` to filter prompt-injection
patterns from output |

---

### Dependencies

- **Runtime**: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not
added to `pyproject.toml` core deps. Installed by
`ensure_opendataloader()` at container startup when
`USE_OPENDATALOADER=true`.
- **System**: Java 11+ on PATH (JVM is the underlying engine). The
installer skips with a warning if `java` is not found.

---

### How to test

**Standalone parser:**
```bash
source .venv/bin/activate
uv pip install opendataloader-pdf
python3 -c "
import sys; sys.path.insert(0, '.')
from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser
p = OpenDataLoaderParser()
print('available:', p.check_installation())
s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline')
print(f'sections={len(s)} tables={len(t)}')
"

```
### Benchmark vs Docling
```
file                      parser            secs  sections  tables
----------------------------------------------------------------------
text-heavy.pdf            docling           45.29       148      10
text-heavy.pdf            opendataloader     3.14       559       0
table-heavy.pdf           docling           7.05        76       3
table-heavy.pdf           opendataloader     3.71        90       0
complex.pdf               docling            42.67       114       8
complex.pdf               opendataloader     3.51       180       0
```
2026-04-25 00:33:02 +08:00
Zhichang Yu
b7744e053e fix: support dense_vector from ES fields response (ES 9.x compatibility) (#13972)
fix: support dense_vector from ES fields response (ES 9.x compatibility)

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Configuration Chore (non-breaking change which updates
configuration)


## Summary by CodeRabbit

* **Bug Fixes**
* More accurate handling and unwrapping of dense-vector fields so
returned values have correct shapes.
* Field selection reliably limits returned data and falls back to
alternate result locations when needed.
* Use of consistent result IDs and tolerant handling when score values
are missing.

* **Chores / Configuration**
* Increased build memory and adjusted build-time flags for the frontend
build.
* Simplified runtime model/GPU checks and removed an automated runtime
GPU-install attempt.

* **Build Fixes**
* `web/vite.config.ts`: make `build.minify` and `build.sourcemap`
respect `VITE_MINIFY` and `VITE_BUILD_SOURCEMAP` env vars from
Dockerfile instead of hardcoding `terser` and `true`.

* **Environment**
* Allow stack version override and default the runtime image tag to
"latest".

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Correct unwrapping of dense-vector fields and reliable field selection
with fallback locations.
* Consistent use of hit-level IDs and tolerant handling when score
values are missing.

* **Chores / Configuration**
* Increased frontend build memory and added build-time minify/sourcemap
flags; build minification and sourcemap now configurable.
* Removed runtime GPU detection for model initialization; force CPU
initialization.

* **Environment**
* Allow stack version override and default runtime image tag to
"latest".

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 17:44:13 +08:00
Lynn
62cb292635 Feat/tenant model (#13072)
### What problem does this PR solve?

Add id for table tenant_llm and apply in LLMBundle.

### Type of change

- [x] Refactoring

---------

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
Co-authored-by: Liu An <asiro@qq.com>
2026-03-05 17:27:17 +08:00
Yongteng Lei
f13a1fb007 Refa: improve model verification ux (#13392)
### What problem does this PR solve?

Improve model verification UX. #13395 

### Type of change

- [x] Refactoring

---------

Co-authored-by: Liu An <asiro@qq.com>
2026-03-05 17:23:47 +08:00
6ba3i
22c4d72891 tests: improve RAGFlow coverage based on Codecov report (#13219)
### What problem does this PR solve?

Codecov’s coverage report shows that several RAGFlow code paths are
currently untested or under-tested. This makes it easier for regressions
to slip in during refactors and feature work.
This PR adds targeted automated tests to cover the files and branches
highlighted by Codecov, improving confidence in core behavior while
keeping runtime functionality unchanged.

### Type of change

- [x] Other (please describe): Test coverage improvement (adds/extends
unit and integration tests to address Codecov-reported gaps)
2026-02-26 19:03:26 +08:00
6ba3i
38011f2c16 tests: improve RAGFlow coverage based on Codecov report (#13200)
### What problem does this PR solve?

Codecov’s coverage report shows that several RAGFlow code paths are
currently untested or under-tested. This makes it easier for regressions
to slip in during refactors and feature work.
This PR adds targeted automated tests to cover the files and branches
highlighted by Codecov, improving confidence in core behavior while
keeping runtime functionality unchanged.

### Type of change

- [x] Other (please describe): Test coverage improvement (adds/extends
unit and integration tests to address Codecov-reported gaps)
2026-02-25 19:12:11 +08:00
Liu An
c4f60b349d Fix(test): downgrade test priorities (#12913)
### What problem does this PR solve?

Changed test priorities in multiple test files, downgrading from p1 to
p2 and p2 to p3.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-30 20:02:56 +08:00
6ba3i
960ecd3158 Feat: update and add new tests for web api apps (#12714)
### What problem does this PR solve?

This PR adds missing web API tests (system, search, KB, LLM, plugin,
connector). It also addresses a contract mismatch that was causing test
failures: metadata updates did not persist new keys (update‑only
behavior).

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): Test coverage expansion and test helper
instrumentation
2026-01-20 19:12:15 +08:00