Commit Graph

19 Commits

Author SHA1 Message Date
cleanjunc
88e4d6bddb Fix: restore GraphRAG entity ranking by indexing pagerank and n-hop paths (#15797)
### Summary

Closes #15795 

Knowledge-graph queries rank entities by `pagerank * sim` in `KGSearch`,
but the entity chunks written at index time stopped carrying the values
that ranking depends on. `graph_node_to_chunk` only stored
`entity_type`, `description`, and `source_id`, dropping the node
`pagerank` and the n-hop neighbour paths, while `search.py` still read
them back as `rank_flt` and `n_hop_with_weight`.

The producer of these fields, `update_nodes_pagerank_nhop_neighbour`,
was removed in #6513, but the read side in `KGSearch` was never updated.
The result is that on every knowledge-graph query:

- `pagerank` resolves to `0`, so the `pagerank * sim` sort key is `0`
for every entity and selection falls back to arbitrary order.
- Every displayed entity score is `0.00`.
- The n-hop relation-enrichment block is dead code because `n_hop_ents`
is always empty, leaving `merge_tuples` and `is_continuous_subsequence`
orphaned.

This PR restores the missing index-time fields so the documented `P(E|Q)
= pagerank * sim` ranking and the n-hop enrichment work again.

What changed:

- `graph_node_to_chunk` now writes `rank_flt` from the node pagerank and
`n_hop_with_weight` from the recomputed n-hop neighbour paths.
- Reintroduced the n-hop path computation (`n_neighbor`) in
`rag/graphrag/utils.py`, reusing the previously orphaned `merge_tuples`
/ `is_continuous_subsequence` helpers, with a direction-agnostic
edge-weight lookup for undirected graphs. `set_graph` computes the paths
per added or updated node and passes them through.
- `KGSearch` now selects `n_hop_with_weight` in the entity keyword
search so Infinity and OceanBase return it (Elasticsearch and OpenSearch
already read it from `_source`), and the read is hardened against
missing keys or empty strings before `json.loads`.
- Added the `n_hop_with_weight` column to OceanBase, including the
`EXTRA_COLUMNS` migration entry so existing tables get it. The other
engines already map both fields via dynamic templates or the Infinity
mapping.

Scope note: pagerank and n-hop are re-indexed for the added or updated
nodes in each pass, consistent with the existing incremental indexing
design.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Testing

Added unit tests in
`test/unit_test/rag/graphrag/test_graphrag_utils.py`:

- `n_neighbor`: path and weight shape, one-hop vs two-hop, isolated
nodes, missing weights, and direction-agnostic lookup.
- `graph_node_to_chunk`: `rank_flt` populated from pagerank and
defaulting to `0`, `n_hop_with_weight` serialized and defaulting to an
empty list.

```
uv run pytest test/unit_test/rag/graphrag/   # 106 passed
uv run ruff check rag/graphrag/ rag/utils/ob_conn.py
```
2026-06-09 20:50:45 +08:00
CaptainTimon
2717ee283f feat(raptor): add Psi tree builder with original-space ranking and safe migration (#14679)
### What problem does this PR solve?

Closes #14674.

This PR improves RAPTOR configuration and tree construction while
preserving the existing RAPTOR behavior as the default.

RAPTOR currently builds summary layers with the original UMAP + GMM
clustering path. This PR keeps that default path, and adds:

- A hidden backend tree-builder option:
  - `tree_builder="raptor"`: default, existing RAPTOR behavior.
- `tree_builder="psi"`: rank-aware Psi-style tree builder using original
embedding-space cosine ranking.
- A user-facing clustering method option for the default RAPTOR builder:
  - `clustering_method="gmm"`: existing default.
- `clustering_method="ahc"`: agglomerative hierarchical clustering path.
- A RAPTOR UI setting for `Clustering method` and `Max cluster`.

### What changed

#### Backend

- Added `tree_builder` support for RAPTOR/Psi.
- Added `clustering_method` support for GMM/AHC.
- Kept existing RAPTOR + GMM as the default.
- Added Psi tree building from original-space cosine similarity.
- Added bucketed Psi building controls for large inputs:
  - `raptor.ext.psi_exact_max_leaves`
  - `raptor.ext.psi_bucket_size`
- Added method-aware RAPTOR summary metadata using existing
`extra.raptor_method`.
- Avoided adding a dedicated DB schema field for experimental method
tracking.
- Added cleanup/migration logic to avoid mixing stale RAPTOR summary
trees.
- Added defensive checks for Psi tree construction and summary failures.

#### Frontend/UI

- Added `Clustering method` in RAPTOR settings with `GMM` and `AHC`.
- Added/kept `Max cluster` in RAPTOR settings.
- Enlarged max cluster UI limit to `1024`, matching backend validation.
- Kept AHC editable even when a RAPTOR task has already finished.
- Fixed the UI save payload so `clustering_method` and `tree_builder`
are serialized through `parser_config.raptor.ext`, avoiding backend
validation errors for extra top-level RAPTOR fields.

Example saved RAPTOR config:

```json
{
  "raptor": {
    "max_cluster": 317,
    "ext": {
      "clustering_method": "ahc",
      "tree_builder": "raptor"
    }
  }
}

Co-authored-by: CaptainTimon <CaptainTimon@users.noreply.github.com>
2026-05-12 09:42:31 +08:00
orbisai0security
e992fe39b2 fix: the oceanbase database connector constructs sql... in ob_conn.py (#14470)
## Summary
Fix critical severity security issue in `rag/utils/ob_conn.py`.

## Vulnerability
| Field | Value |
|-------|-------|
| **ID** | V-003 |
| **Severity** | CRITICAL |
| **Scanner** | multi_agent_ai |
| **Rule** | `V-003` |
| **File** | `rag/utils/ob_conn.py:691` |

**Description**: The OceanBase database connector constructs SQL WHERE
clauses by directly embedding user-controlled filter expressions using
Python f-strings at lines 726, 777, 781, 787, 793, 821, and 827. No
parameterization or allowlist validation is applied before the
expressions are incorporated into live SQL queries. This is the most
critical vulnerability in the codebase because it directly exposes the
RAG knowledge base — the platform's core business asset — to complete
compromise.

## Changes
- `rag/utils/ob_conn.py`

## Verification
- [x] Build passes
- [x] Scanner re-scan confirms fix
- [x] LLM code review passed

---
*Automated security fix by [OrbisAI Security](https://orbisappsec.com)*
2026-04-30 14:25:17 +08:00
MkDev11
cfee2bc9db feat: Auto-adjust chunk recall weights based on user feedback (#12689)
### What problem does this PR solve?

Implements automatic adjustment of knowledge base chunk recall weights
based on user feedback (upvotes/downvotes). When users upvote or
downvote a response, the system locates the corresponding knowledge
snippets and adjusts their recall weight to improve future retrieval
quality.

**Closes #12670**

**How it works:**
1. User upvotes/downvotes a response via `POST /thumbup`
2. System extracts chunk IDs from the conversation reference
3. For each referenced chunk:
   - Reads current `pagerank_fea` value from document store
   - Increments (+1) for upvote or decrements (-1) for downvote
   - Clamps weight to [0, 100] range
   - Updates chunk in ES/Infinity/OceanBase
4. Future retrievals score these chunks higher/lower based on
accumulated feedback

**Files changed:**
- `api/db/services/chunk_feedback_service.py` - New service for updating
chunk pagerank weights
- `api/apps/conversation_app.py` - Integrated feedback service into
thumbup endpoint
- `test/testcases/test_web_api/test_chunk_feedback/` - Unit tests

### Type of change

- [x] New Feature (non-breaking change which adds functionality)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Chat message feedback now updates per-chunk relevance weights
(feature-flag gated), with configurable weighting and atomic updates
across storage backends.

* **Bug Fixes**
* Stricter validation for message feedback inputs and more robust
handling of feedback transitions.

* **Tests**
* Expanded test coverage for chunk-feedback behavior, weighting
strategies, storage backends, and thumb-flip scenarios.

* **Chores**
  * CI workflow extended to run the new chunk-feedback web API tests.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com>
Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>
2026-04-08 09:52:18 +08:00
huber
8a6b5ced6b fix: add missing chunk_data column to OceanBase schema migration (#13306)
### What problem does this PR solve?

When using OceanBase as the document storage engine, parsing and
inserting chunks with chunk_data (e.g., table parser row data) fails
with the following error:
```
[ERROR][Exception]: Insert chunk error: ['Unconsumed column names: chunk_data']
This happens because the chunk_data column was recently introduced but was omitted from the EXTRA_COLUMNS list in 
rag/utils/ob_conn.py
```
As a result, the automatic schema migration for existing OceanBase
tables does not append the missing chunk_data column, causing the
underlying pyobvector or SQLAlchemy to raise an unconsumed column names
error during data insertion.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### What is the solution?
Added column_chunk_data to the EXTRA_COLUMNS list in 
```
rag/utils/ob_conn.py
```
This ensures that the OceanBase connection wrapper can correctly detect
the missing column and automatically alter existing chunk tables to
include the chunk_data field during initialization.
2026-03-02 13:25:11 +08:00
He Wang
394ff16b66 fix: OceanBase metadata not returned in document list API (#13209)
### What problem does this PR solve?

Fix #13144.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-25 15:29:17 +08:00
Yao Wei
cf6fd6f115 fix: When using OceanBase as storage, the list_chunk sorting is abnormal. #13198 (#13208)
Actual behavior
When using OceanBase as storage, the list_chunk sorting is abnormal. The
following is the SQL statement.

SELECT id, content_with_weight, important_kwd, question_kwd, img_id,
available_int, position_int, doc_type_kwd, create_timestamp_flt,
create_time, array_to_string(page_num_int, ',') AS page_num_int_sort,
array_to_string(top_int, ',') AS top_int_sort FROM
rag_store_284250730805059584 WHERE doc_id = '' AND kb_id IN ('') ORDER
BY page_num_int_sort ASC, top_int_sort ASC, create_timestamp_flt DESC
LIMIT 0, 20

<img width="1610" height="740" alt="image"
src="https://github.com/user-attachments/assets/84e14c30-a97f-4e8f-8c8c-6ccac915d97d"
/>

Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>
2026-02-25 13:36:18 +08:00
He Wang
ff7afcbe5f feat: add OceanBase memory store (#12955)
### What problem does this PR solve?

Add OceanBase memory store and extracting base class `OBConnectionBase`.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-03 16:46:17 +08:00
Paul Y Hui
f028f74883 Fixed 12787 with syntax error in generated MySql json path expression (#12929)
### What problem does this PR solve?

Fixed 12787 with syntax error in generated MySql json path expression
https://github.com/infiniflow/ragflow/issues/12787 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

#### What was fixed:
- Changed line 237 in ob_conn.py from value_str = get_value_str(value)
if value else "" to value_str = get_value_str(value)
- This fixes the bug where falsy but valid values (0, False, "", [], {})
were being converted to empty strings, causing invalid SQL syntax
#### What was tested:
- Comprehensive unit tests covering all edge cases
- Regression tests specifically for the bug scenario

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2026-02-03 09:50:14 +08:00
Carve_
23bdf25a1f feature:Add OceanBase Storage Support for Table Parser (#12923)
### What problem does this PR solve?

close #12770 

This PR adds OceanBase as a storage backend for the Table Parser. It
enables dynamic table schema storage via JSON and implements OceanBase
SQL execution for text-to-SQL retrieval.


### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Changes
- Table Parser stores row data into `chunk_data` when doc engine is
OceanBase. (table.py)
- OceanBase table schema adds `chunk_data` JSON column and migrates if
needed.
- Implemented OceanBase `sql()` to execute text-to-SQL results.
(ob_conn.py)
- Add `DOC_ENGINE_OCEANBASE` flag for engine detection (setting.py)

### Test
1. Set `DOC_ENGINE=oceanbase` (e.g. in `docker/.env`)
<img width="1290" height="783" alt="doc_engine_ob"
src="https://github.com/user-attachments/assets/7d1c609f-7bf2-4b2e-b4cc-4243e72ad4f1"
/>

2. Upload an Excel file to Knowledge Base.(for test, we use as below)
<img width="786" height="930" alt="excel"
src="https://github.com/user-attachments/assets/bedf82f2-cd00-426b-8f4d-6978a151231a"
/>

3. Choose **Table** as parsing method.
<img width="2550" height="1134" alt="parse_excel"
src="https://github.com/user-attachments/assets/aba11769-02be-4905-97e1-e24485e24cd0"
/>

4.Ask a natural language query in chat.
<img width="2550" height="1134" alt="query"
src="https://github.com/user-attachments/assets/26a910a6-e503-4ac7-b66a-f5754bbb0e91"
/>
2026-01-31 15:11:54 +08:00
Angel98518
98b6a0e6d1 feat: Add OceanBase Performance Monitoring and Health Check Integration (#12886)
## Description

This PR implements comprehensive OceanBase performance monitoring and
health check functionality as requested in issue #12772. The
implementation follows the existing ES/Infinity health check patterns
and provides detailed metrics for operations teams.

## Problem

Currently, RAGFlow lacks detailed health monitoring for OceanBase when
used as the document engine. Operations teams need visibility into:
- Connection status and latency
- Storage space usage
- Query throughput (QPS)
- Slow query statistics
- Connection pool utilization

## Solution

### 1. Enhanced OBConnection Class (`rag/utils/ob_conn.py`)

Added comprehensive performance monitoring methods:
- `get_performance_metrics()` - Main method returning all performance
metrics
- `_get_storage_info()` - Retrieves database storage usage
- `_get_connection_pool_stats()` - Gets connection pool statistics
- `_get_slow_query_count()` - Counts queries exceeding threshold
- `_estimate_qps()` - Estimates queries per second
- Enhanced `health()` method with connection status

### 2. Health Check Utilities (`api/utils/health_utils.py`)

Added two new functions following ES/Infinity patterns:
- `get_oceanbase_status()` - Returns OceanBase status with health and
performance metrics
- `check_oceanbase_health()` - Comprehensive health check with detailed
metrics

### 3. API Endpoint (`api/apps/system_app.py`)

Added new endpoint:
- `GET /v1/system/oceanbase/status` - Returns OceanBase health status
and performance metrics

### 4. Comprehensive Unit Tests
(`test/unit_test/utils/test_oceanbase_health.py`)

Added 340+ lines of unit tests covering:
- Health check success/failure scenarios
- Performance metrics retrieval
- Error handling and edge cases
- Connection pool statistics
- Storage information retrieval
- QPS estimation
- Slow query detection

## Metrics Provided

- **Connection Status**: connected/disconnected
- **Latency**: Query latency in milliseconds
- **Storage**: Used and total storage space
- **QPS**: Estimated queries per second
- **Slow Queries**: Count of queries exceeding threshold
- **Connection Pool**: Active connections, max connections, pool size

## Testing

- All unit tests pass
- Error handling tested for connection failures
- Edge cases covered (missing tables, connection errors)
- Follows existing code patterns and conventions

## Code Statistics

- **Total Lines Changed**: 665+ lines
- **New Code**: ~600 lines
- **Test Coverage**: 340+ lines of comprehensive tests
- **Files Modified**: 3
- **Files Created**: 1 (test file)

## Acceptance Criteria Met

 `/system/oceanbase/status` API returns OceanBase health status
 Monitoring metrics accurately reflect OceanBase running status
 Clear error messages when health checks fail
 Response time optimized (metrics cached where possible)
 Follows existing ES/Infinity health check patterns
 Comprehensive test coverage

## Related Files

- `rag/utils/ob_conn.py` - OceanBase connection class
- `api/utils/health_utils.py` - Health check utilities
- `api/apps/system_app.py` - System API endpoints
- `test/unit_test/utils/test_oceanbase_health.py` - Unit tests

Fixes #12772

---------

Co-authored-by: Daniel <daniel@example.com>
2026-01-30 09:44:42 +08:00
dive2tech
15a534909f fix: avoid ZeroDivisionError when fulltext column weights sum to zero (#12856)
### What problem does this PR solve?

When all fulltext_search_columns use explicit weight 0 (e.g. "col^0"),
weight_sum is 0 and dividing by it raises ZeroDivisionError. Use equal
weights 1/n when weight_sum <= 0 and n > 0; otherwise normalize as
before.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [x] Refactoring
2026-01-28 14:38:03 +08:00
He Wang
bd9163904a fix(ob_conn): ignore duplicate errors when executing 'create_idx' (#12661)
### What problem does this PR solve?

Skip duplicate errors to avoid 'create_idx' failures caused by slow
metadata refresh or external modifications.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-16 20:46:37 +08:00
He Wang
360114ed42 fix(ob_conn): avoid reusing SQLAlchemy Column objects in DDL (#12588)
### What problem does this PR solve?

When there are multiple users, parsing a document for a new user can
trigger the reuse of column objects, leading to the error
`sqlalchemy.exc.ArgumentError: Column object 'id' already assigned to
Table xxx`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-13 17:39:20 +08:00
He Wang
55c9fc0017 fix: add 'mom_id' column to OBConnection chunk table (#12444)
### What problem does this PR solve?

Fix #12428

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-05 19:31:44 +08:00
Jin Hai
01f0ced1e6 Fix IDE warnings (#12281)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-29 12:01:18 +08:00
Lynn
6e9691a419 Feat: message manage (#12196)
### What problem does this PR solve?

Manage message and use in agent.

Issue #4213 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-25 21:18:13 +08:00
He Wang
badf33e3b9 feat: enhance OBConnection.search (#11876)
### What problem does this PR solve?

Enhance OBConnection.search for better performance. Main changes:
1. Use string type of vector array in distance func for better parsing
performance.
2. Manually set max_connections as pool size instead of using default
value.
3. Set 'fulltext_search_columns' when starting.
4. Cache the results of the table existence check (we will never drop
the table).
5. Remove unused 'group_results' logic.
6. Add the `USE_FULLTEXT_FIRST_FUSION_SEARCH` flag, and the
corresponding fusion search SQL when it's false.

### Type of change
- [x] Performance Improvement
2025-12-10 19:13:37 +08:00
He Wang
38234aca53 feat: add OceanBase doc engine (#11228)
### What problem does this PR solve?

Add OceanBase doc engine. Close #5350

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 10:00:14 +08:00