Files
ragflow/test/unit_test/agent/component
Muhammad Furqan 828c5789f6 fix(agent/tools): GoogleScholar empty json output and ignored top_n (#16419)
### What problem does this PR solve?

Closes #16418.

`scholarly.search_pubs(...)` returns a **lazy generator**, but
`agent/tools/googlescholar.py` treated it as a re-iterable, bounded
list:

```python
scholar_client = scholarly.search_pubs(kwargs["query"], ...)   # lazy generator
self._retrieve_chunks(scholar_client, ...)                     # (1) iterates -> exhausts it
self.set_output("json", list(scholar_client))                  # (2) already empty -> []
```

1. **`json` output was always empty.** `_retrieve_chunks` iterates
`scholar_client`, exhausting the generator; `list(scholar_client)` then
returns `[]`.
2. **`top_n` was never applied.** Unlike `ArXiv`
(`max_results=self._param.top_n`), the unbounded generator was passed
straight to `_retrieve_chunks`, which has no internal limit — so the
tool kept paginating well past Top N (until an error, rate-limit/block,
or `COMPONENT_EXEC_TIMEOUT`).

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Changes

- Materialize at most `top_n` results once with `itertools.islice`, and
reuse that list for both `_retrieve_chunks` and the `json` output.
- Add regression tests
(`test/unit_test/agent/component/test_googlescholar.py`, stubbing
`scholarly.search_pubs`) covering the `top_n` bound, the non-empty
`json` output, and the empty-query short-circuit.

Verified: against `main` the new tests fail with `assert 30 == 5` (top_n
ignored) and `assert 0 == 5` (empty json); with this fix all pass.
Backend-only.

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-07-01 10:47:39 +08:00
..