mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-07-02 00:35:46 +08:00
### What problem does this PR solve? Closes #16418. `scholarly.search_pubs(...)` returns a **lazy generator**, but `agent/tools/googlescholar.py` treated it as a re-iterable, bounded list: ```python scholar_client = scholarly.search_pubs(kwargs["query"], ...) # lazy generator self._retrieve_chunks(scholar_client, ...) # (1) iterates -> exhausts it self.set_output("json", list(scholar_client)) # (2) already empty -> [] ``` 1. **`json` output was always empty.** `_retrieve_chunks` iterates `scholar_client`, exhausting the generator; `list(scholar_client)` then returns `[]`. 2. **`top_n` was never applied.** Unlike `ArXiv` (`max_results=self._param.top_n`), the unbounded generator was passed straight to `_retrieve_chunks`, which has no internal limit — so the tool kept paginating well past Top N (until an error, rate-limit/block, or `COMPONENT_EXEC_TIMEOUT`). ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Changes - Materialize at most `top_n` results once with `itertools.islice`, and reuse that list for both `_retrieve_chunks` and the `json` output. - Add regression tests (`test/unit_test/agent/component/test_googlescholar.py`, stubbing `scholarly.search_pubs`) covering the `top_n` bound, the non-empty `json` output, and the empty-query short-circuit. Verified: against `main` the new tests fail with `assert 30 == 5` (top_n ignored) and `assert 0 == 5` (empty json); with this fix all pass. Backend-only. --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>