ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Files

nickmopen 5b02fe4841 fix(api): stop duplicating answer in openai-compatible chat completions stream (#15286 ) (#15443 )

### What problem does this PR solve?

Fixes #15286.

When calling `/api/v1/openai/<chat_id>/chat/completions` with `"stream":
true`, the response contains the answer **twice** — the final message
repeats everything that was already streamed.

#### Root cause

RAGFlow's `async_chat` streams the body as incremental `delta.content`
chunks, then emits a terminating `final` event whose `answer` is the
**complete** (decorated) message. The handler re-emitted that full
answer as one more `delta.content` chunk:

```python
if ans.get("final"):
    if ans.get("answer"):
        full_content = ans["answer"]
        response["choices"][0]["delta"]["content"] = full_content   # <-- whole answer again
        yield ...
```

So a client accumulating `delta.content` ends up with the message
duplicated.

#### Fix

Drop the re-emission. The complete answer from the `final` event is now
surfaced **only** through the trailing chunk's `final_content` and
`reference` fields, which matches OpenAI streaming semantics: deltas are
incremental, and the final chunk carries only `finish_reason` / `usage`
(plus RAGFlow's `reference` / `final_content` extensions).

This matches the expected behavior described in the issue: "The stream
should only yield content chunks once, and the final message should only
contain reference, usage, and finish_reason."

#### Testability refactor

The streaming SSE assembly was a closure inside the request handler, so
it could only be exercised against a live server + real LLM. I extracted
it into a module-level `_stream_chat_completion_sse` async generator
(behavior-preserving) so it can be unit-tested with a fake event stream.

#### Tests

Adds
`test/unit_test/api/apps/restful_apis/test_openai_stream_no_duplicate.py`
(same import-stub pattern as the existing `test_get_agent_session.py`):

- body is streamed exactly once (the regression);
- the complete answer is never re-emitted as a content chunk;
- the terminating chunk has `finish_reason="stop"`, `content=None`, and
correct `usage`;
- `final_content` / `reference` are present on the trailing chunk;
- reasoning (`think`) deltas stream separately and are not duplicated.

> Note: this is unrelated to #15442, which only changes the `stream`
default — it does not touch the duplication logic.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Added test cases

---------

Co-authored-by: Wang Qi <wangq8@outlook.com>

2026-06-02 13:20:40 +08:00

auth

Fix: JWT algorithm-confusion in OIDC ID token verification (#15181 )

2026-05-29 19:37:01 +08:00

restful_apis

fix(api): stop duplicating answer in openai-compatible chat completions stream (#15286 ) (#15443 )

2026-06-02 13:20:40 +08:00

services

Feat: tenant llm provider (#14595 )

2026-05-29 17:39:41 +08:00

__init__.py

Bug fix: Enhance embeding model to give better error message (#15346 )