### What problem does this PR solve?
Fix QWen rerank error handling so DashScope error responses without a
text attribute do not raise a secondary KeyError and hide the real
provider error.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Closes#15428
The hybrid score in `rag/nlp/search.py` (`rerank_by_model`) blends
reranker similarity with token similarity on a fixed `[0, 1]` scale:
```python
return tkweight * np.array(tksim) + vtweight * vtsim + rank_fea # tkweight=0.3, vtweight=0.7
```
The reranker implementations did not agree on that scale. Only three of
roughly 17 providers normalized their output, and `NvidiaRerank`
returned raw, unbounded logits. Weighted at `0.7`, a negative logit
could push a genuinely relevant chunk below pure keyword matches, and
its magnitude swamped `tksim`, which lives in `[0, 1]`. The practical
effect was that the same query produced differently scaled scores
depending on the configured reranker, and logit based providers degraded
retrieval quality instead of improving it.
This PR enforces a single scoring contract in one place:
- `Base.similarity` is now the only public entry point. It
short-circuits empty input and guarantees a normalized result. Each
provider implements its raw scoring in `_compute_rank`, which removes
sixteen duplicated empty input guards and the three scattered
normalization calls.
- Normalization is range aware. Providers that already return calibrated
`[0, 1]` relevance scores (Cohere, Jina, Voyage, and others) keep their
absolute magnitudes, so `similarity_threshold` filtering and the
reported `vector_similarity` stay meaningful. Only out-of-range output
such as NVIDIA logits is min-max rescaled into `[0, 1]`.
- The twelve leftover `[DEBUG ...]` prints in `rerank_by_model`,
introduced in #14231, are removed. They ran on every retrieval, added
per chunk overhead, and leaked queries, keywords, and document content
to stdout and logs.
A new regression suite in
`test/unit_test/rag/llm/test_rerank_normalization.py` covers logit
rescaling (positive, negative, and flat batches), preservation of
already calibrated scores, ordering, empty input handling, and the per
provider HTTP path. It also asserts that no provider overrides
`similarity()`, so the contract cannot silently drift.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Description
This PR fixes critical bugs and improves the robustness of the RAG
reranking module while maintaining **100% backward compatibility** with
all existing functionality and providers.
## Key Changes
1. **Network Stability**: Added 30s timeout to all API requests to
prevent service blocking
2. **Boundary Protection**: Added empty query/text validation for all
rerank models
3. **Response Fault Tolerance**: Replaced hardcoded key access with
`.get()` to avoid KeyError crashes
4. **Bug Fixes**:
- Fixed `Ai302Rerank` (completely non-functional before)
- Fixed `GPUStackRerank` incorrect exception catching
- Fixed `_normalize_rank` empty array crash
5. **Code Specification**: Added type annotations, standardized
unimplemented class prompts
## Compatibility
- ✅ No changes to any class/method names
- ✅ All rerank providers (Jina/Cohere/NVIDIA/HuggingFace etc.) work as
before
- ✅ No breaking changes, zero impact on existing workflows
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
## Summary
This PR fully addresses all CodeRabbit review feedback and enhances the
robustness of the reranking module with 100% backward compatibility.
## Key Fixes
1. Fixed JinaRerank hardcoded base_url to support subclass endpoint
overrides
2. Corrected GPUStackRerank exception handling to use proper requests
exceptions and preserve stack traces
3. Added 30s timeout to all API calls to prevent service hanging
4. Added empty input validation for all rerank providers
5. Replaced direct dict key access with .get() to eliminate KeyError
crashes
6. Fixed _normalize_rank edge case for empty arrays
7. Implemented missing functionality for Ai302Rerank
8. Standardized type hints and fixed typo issues
## Compatibility
- No breaking changes to any existing functionality
- All rerank providers work as originally intended
- Fully compatible with existing configurations and workflows
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Multiple `requests.post()` calls across the LLM integration layer lack a
`timeout` parameter. Without a timeout, a single unresponsive upstream
service can block the calling thread **indefinitely**, eventually
exhausting the thread pool and degrading the entire system.
This is a well-known issue — Python's `requests` library defaults to
`timeout=None` (infinite wait), and [the library docs explicitly
recommend](https://requests.readthedocs.io/en/latest/user/advanced/#timeouts)
always setting a timeout.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Change
Added `timeout` to all `requests.post()` calls missing it:
| File | Calls fixed | Timeout |
|------|-------------|---------|
| `rag/llm/rerank_model.py` | 9 | 30s |
| `rag/llm/embedding_model.py` | 8 | 30s |
| `rag/llm/cv_model.py` | 3 | 60s |
| `rag/llm/tts_model.py` | 2 | 60s |
| `rag/llm/sequence2txt_model.py` | 2 | 60s |
Embedding/rerank calls use 30s (lightweight API calls). Vision, TTS, and
audio transcription use 60s (heavier workloads with file uploads).
Note: other files in the codebase (e.g. `check_minio_alive`,
`check_ragflow_server_alive`) already use `timeout=10`, so this PR
brings the LLM layer in line with existing practice.
Signed-off-by: Ricardo-M-L <Sibyl_Hartmanbnb@webname.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
HuggingfaceRerank.post() unconditionally prepends `http://` to base_url,
which already contains a protocol. This creates invalid URLs like
http://http://127.0.0.1:8080/rerank, breaking all requests. The fix
normalizes URL handling to match the rest of the codebase, removing
redunant `http://`.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Related Issues
- #7318
- #7796
---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
https://bailian.console.aliyun.com/cn-beijing?tab=api#/api/?type=model&url=2780056
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] Other (please describe): add gte-rerank-v2、qwen3-rerank
### What problem does this PR solve?
This PR aims to extend the list of possible providers. Adds new Provider
"RAGcon" within the Ollama Modal. It provides all model types except OCR
via Openai-compatible endpoints.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Jakob <16180662+hauberj@users.noreply.github.com>
### What problem does this PR solve?
**Cohere rerank base_url default handling**
- Background: When no rerank base URL is configured, the settings
pipeline was passing an empty string through RERANK_CFG →
TenantLLMService → CoHereRerank, so the Cohere client received
base_url="" and produced “missing protocol” errors during rerank calls.
- What changed: The CoHereRerank constructor now only forwards base_url
to the Cohere client when it isn’t empty/whitespace, causing the client
to fall back to its default API endpoint otherwise.
- Why it matters: This prevents invalid URL construction in the rerank
workflow and keeps tests/sanity checks that rely on the default Cohere
endpoint from failing when no custom base URL is specified.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: Philipp Heyken Soares <philipp.heyken-soares@am.ai>
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Co-authored-by: Jason <ggbbddjm@gmail.com>
### What problem does this PR solve?
Introduced gpu profile in .env
Added Dockerfile_tei
fix datrie
Removed LIGHTEN flag
### Type of change
- [x] Documentation Update
- [x] Refactoring
### What problem does this PR solve?
fix the rerank_model condition logic by correcting the np.isclose check.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
for total_token_count method use if to check first, to improve the
performance when we need to handle exception cases
### Type of change
- [x] Refactoring
### What problem does this PR solve?
1. Use input length to prepare res
2. Adjust torch_empty_cache code location
### Type of change
- [x] Refactoring
- [x] Performance Improvement
Updated constructors for base and derived classes in chat, embedding,
rerank, sequence2txt, and tts models to accept **kwargs. This change
improves extensibility and allows passing additional parameters without
breaking existing interfaces.
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: IT: Sop.Son <sop.son@feavn.local>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### What problem does this PR solve?
#9082#6365
<u> **WARNING: it's not compatible with the older version of `Agent`
module, which means that `Agent` from older versions can not work
anymore.**</u>
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
vLLM provider with a reranking model does not work : as vLLM uses under
the hood the [CoHereRerank
provider](https://github.com/infiniflow/ragflow/blob/v0.17.0/rag/llm/__init__.py#L250)
with a `base_url`, if this URL [is not passed to the Cohere
client](https://github.com/infiniflow/ragflow/blob/v0.17.0/rag/llm/rerank_model.py#L379-L382)
any attempt will endup on the Cohere SaaS (sending your private api key
in the process) instead of your vLLM instance.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
Add GPUStack as a new model provider.
[GPUStack](https://github.com/gpustack/gpustack) is an open-source GPU
cluster manager for running LLMs. Currently, locally deployed models in
GPUStack cannot integrate well with RAGFlow. GPUStack provides both
OpenAI compatible APIs (Models / Chat Completions / Embeddings /
Speech2Text / TTS) and other APIs like Rerank. We would like to use
GPUStack as a model provider in ragflow.
[GPUStack Docs](https://docs.gpustack.ai/latest/quickstart/)
Related issue: https://github.com/infiniflow/ragflow/issues/4064.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### Testing Instructions
1. Install GPUStack and deploy the `llama-3.2-1b-instruct` llm, `bge-m3`
text embedding model, `bge-reranker-v2-m3` rerank model,
`faster-whisper-medium` Speech-to-Text model, `cosyvoice-300m-sft` in
GPUStack.
2. Add provider in ragflow settings.
3. Testing in ragflow.