mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-06-29 15:31:05 +08:00
Fix: preserve field boundaries in chunked documents from MySQL… (#13369)
### What problem does this PR solve? When multiple columns are used as content columns in RDBMS connector, the generated document text gets chunked by TxtParser which strips newline delimiters during merge. This causes field names and values from different columns to be concatenated without any separator, making the content unreadable. Changes: - txt_parser.py: restore newline separator when merging adjacent text segments within a chunk, so that split sections are not directly concatenated - rdbms_connector.py: use double newline between fields and place field value on a new line after the field name bracket, giving TxtParser clearer boundaries to work with Closes #13001 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: tunsuytang <tunsuytang@tencent.com>
This commit is contained in:
@@ -204,11 +204,11 @@ class RDBMSConnector(LoadConnector, PollConnector):
|
||||
value = row_dict[col]
|
||||
if isinstance(value, (dict, list)):
|
||||
value = json.dumps(value, ensure_ascii=False)
|
||||
# Use brackets around field name to ensure it's distinguishable
|
||||
# after chunking (TxtParser strips \n delimiters during merge)
|
||||
content_parts.append(f"【{col}】: {value}")
|
||||
# Use brackets around field name and put value on a new line
|
||||
# so that TxtParser preserves field boundaries after chunking.
|
||||
content_parts.append(f"【{col}】:\n{value}")
|
||||
|
||||
content = "\n".join(content_parts)
|
||||
content = "\n\n".join(content_parts)
|
||||
|
||||
if self.id_column and self.id_column in row_dict:
|
||||
doc_id = f"{self.db_type}:{self.database}:{row_dict[self.id_column]}"
|
||||
|
||||
Reference in New Issue
Block a user