Files
ragflow/rag/utils/lazy_image.py

133 lines
3.3 KiB
Python
Raw Normal View History

refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233) **Summary** This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a **lazy-loading approach**: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. **What's Changed** Instead of a dry file-by-file list, here is the logical breakdown of the updates: * **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**: Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. **Scope & What is Intentionally Left Out** To keep this PR focused, I have restricted these changes strictly to the **general Word pipeline** and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly **not modified** in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. **Design Considerations** I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. **Validation & Testing** I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * **Confirmed a noticeable drop in peak memory usage when processing image-dense documents.** For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. **Breaking Changes** None.
2026-02-28 11:22:31 +08:00
import logging
from io import BytesIO
from PIL import Image
from rag.nlp import concat_img
class LazyImage:
refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233) **Summary** This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a **lazy-loading approach**: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. **What's Changed** Instead of a dry file-by-file list, here is the logical breakdown of the updates: * **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**: Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. **Scope & What is Intentionally Left Out** To keep this PR focused, I have restricted these changes strictly to the **general Word pipeline** and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly **not modified** in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. **Design Considerations** I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. **Validation & Testing** I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * **Confirmed a noticeable drop in peak memory usage when processing image-dense documents.** For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. **Breaking Changes** None.
2026-02-28 11:22:31 +08:00
def __init__(self, blobs, source=None):
self._blobs = [b for b in (blobs or []) if b]
self.source = source
self._pil = None
def __bool__(self):
return bool(self._blobs)
def to_pil(self):
if self._pil is not None:
try:
self._pil.load()
return self._pil
except Exception:
try:
self._pil.close()
except Exception:
pass
self._pil = None
res_img = None
for blob in self._blobs:
try:
image = Image.open(BytesIO(blob)).convert("RGB")
except Exception as e:
logging.info(f"LazyImage: skip bad image blob: {e}")
refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233) **Summary** This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a **lazy-loading approach**: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. **What's Changed** Instead of a dry file-by-file list, here is the logical breakdown of the updates: * **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**: Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. **Scope & What is Intentionally Left Out** To keep this PR focused, I have restricted these changes strictly to the **general Word pipeline** and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly **not modified** in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. **Design Considerations** I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. **Validation & Testing** I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * **Confirmed a noticeable drop in peak memory usage when processing image-dense documents.** For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. **Breaking Changes** None.
2026-02-28 11:22:31 +08:00
continue
if res_img is None:
res_img = image
continue
new_img = concat_img(res_img, image)
if new_img is not res_img:
try:
res_img.close()
except Exception:
pass
try:
image.close()
except Exception:
pass
res_img = new_img
self._pil = res_img
return self._pil
def to_pil_detached(self):
pil = self.to_pil()
self._pil = None
return pil
def close(self):
if self._pil is not None:
try:
self._pil.close()
except Exception:
pass
self._pil = None
return None
def __getattr__(self, name):
pil = self.to_pil()
if pil is None:
raise AttributeError(name)
return getattr(pil, name)
def __array__(self, dtype=None):
import numpy as np
pil = self.to_pil()
if pil is None:
return np.array([], dtype=dtype)
return np.array(pil, dtype=dtype)
def __enter__(self):
return self.to_pil()
def __exit__(self, exc_type, exc, tb):
self.close()
return False
Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329) ## Summary This PR is the direct successor to the previous `docx` lazy-loading implementation. It addresses the technical debt intentionally left out in the last PR by fully migrating the `qa` and `manual` parsing strategies to the new lazy-loading model. Additionally, this PR comprehensively refactors the underlying `docx` parsing pipeline to eliminate significant code redundancy and introduces robust fallback mechanisms to handle completely corrupted image streams safely. ## What's Changed * **Centralized Abstraction (`docx_parser.py`)**: Moved the `get_picture` extraction logic up to the `RAGFlowDocxParser` base class. Previously, `naive`, `qa`, and `manual` parsers maintained separate, redundant copies of this method. All downstream strategies now natively gather raw blobs and return `LazyDocxImage` objects automatically. * **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge cases where `python-docx` encounters critically malformed magic headers. Implemented an explicit `try-except` structure that safely intercepts `UnrecognizedImageError` (and similar exceptions) and seamlessly falls back to retrieving the raw binary via `getattr(related_part, "blob", None)`, preventing parser crashes on damaged documents. * **Legacy Code & Redundancy Purge**: * Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`, and `manual.py`. * Removed the standalone, immediate-decoding `concat_img` method in `manual.py`. It has been completely replaced by the globally unified, lazy-loading-compatible `rag.nlp.concat_img`. * Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception packages) across all updated strategy files. ## Scope To keep this PR focused, I have restricted these changes strictly to the unification of `docx` extraction logic and the lazy-load migration of `qa` and `manual`. ## Validation & Testing I've tested this to ensure no regressions and validated the fallback logic: * **Output Consistency**: Compared identical `.docx` inputs using `qa` and `manual` strategies before and after this branch: chunk counts, extracted text, table HTML, and attached images match perfectly. * **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory usage when processing image-dense documents through the `qa` and `manual` pipelines, bringing them up to parity with the `naive` strategy's performance gains. ## Breaking Changes * None.
2026-03-11 10:00:07 +08:00
@staticmethod
def merge(a, b):
"""
Merge two LazyImage instances by combining their blob lists.
Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329) ## Summary This PR is the direct successor to the previous `docx` lazy-loading implementation. It addresses the technical debt intentionally left out in the last PR by fully migrating the `qa` and `manual` parsing strategies to the new lazy-loading model. Additionally, this PR comprehensively refactors the underlying `docx` parsing pipeline to eliminate significant code redundancy and introduces robust fallback mechanisms to handle completely corrupted image streams safely. ## What's Changed * **Centralized Abstraction (`docx_parser.py`)**: Moved the `get_picture` extraction logic up to the `RAGFlowDocxParser` base class. Previously, `naive`, `qa`, and `manual` parsers maintained separate, redundant copies of this method. All downstream strategies now natively gather raw blobs and return `LazyDocxImage` objects automatically. * **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge cases where `python-docx` encounters critically malformed magic headers. Implemented an explicit `try-except` structure that safely intercepts `UnrecognizedImageError` (and similar exceptions) and seamlessly falls back to retrieving the raw binary via `getattr(related_part, "blob", None)`, preventing parser crashes on damaged documents. * **Legacy Code & Redundancy Purge**: * Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`, and `manual.py`. * Removed the standalone, immediate-decoding `concat_img` method in `manual.py`. It has been completely replaced by the globally unified, lazy-loading-compatible `rag.nlp.concat_img`. * Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception packages) across all updated strategy files. ## Scope To keep this PR focused, I have restricted these changes strictly to the unification of `docx` extraction logic and the lazy-load migration of `qa` and `manual`. ## Validation & Testing I've tested this to ensure no regressions and validated the fallback logic: * **Output Consistency**: Compared identical `.docx` inputs using `qa` and `manual` strategies before and after this branch: chunk counts, extracted text, table HTML, and attached images match perfectly. * **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory usage when processing image-dense documents through the `qa` and `manual` pipelines, bringing them up to parity with the `naive` strategy's performance gains. ## Breaking Changes * None.
2026-03-11 10:00:07 +08:00
"""
a_blobs = a._blobs if isinstance(a, LazyImage) else []
b_blobs = b._blobs if isinstance(b, LazyImage) else []
Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329) ## Summary This PR is the direct successor to the previous `docx` lazy-loading implementation. It addresses the technical debt intentionally left out in the last PR by fully migrating the `qa` and `manual` parsing strategies to the new lazy-loading model. Additionally, this PR comprehensively refactors the underlying `docx` parsing pipeline to eliminate significant code redundancy and introduces robust fallback mechanisms to handle completely corrupted image streams safely. ## What's Changed * **Centralized Abstraction (`docx_parser.py`)**: Moved the `get_picture` extraction logic up to the `RAGFlowDocxParser` base class. Previously, `naive`, `qa`, and `manual` parsers maintained separate, redundant copies of this method. All downstream strategies now natively gather raw blobs and return `LazyDocxImage` objects automatically. * **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge cases where `python-docx` encounters critically malformed magic headers. Implemented an explicit `try-except` structure that safely intercepts `UnrecognizedImageError` (and similar exceptions) and seamlessly falls back to retrieving the raw binary via `getattr(related_part, "blob", None)`, preventing parser crashes on damaged documents. * **Legacy Code & Redundancy Purge**: * Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`, and `manual.py`. * Removed the standalone, immediate-decoding `concat_img` method in `manual.py`. It has been completely replaced by the globally unified, lazy-loading-compatible `rag.nlp.concat_img`. * Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception packages) across all updated strategy files. ## Scope To keep this PR focused, I have restricted these changes strictly to the unification of `docx` extraction logic and the lazy-load migration of `qa` and `manual`. ## Validation & Testing I've tested this to ensure no regressions and validated the fallback logic: * **Output Consistency**: Compared identical `.docx` inputs using `qa` and `manual` strategies before and after this branch: chunk counts, extracted text, table HTML, and attached images match perfectly. * **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory usage when processing image-dense documents through the `qa` and `manual` pipelines, bringing them up to parity with the `naive` strategy's performance gains. ## Breaking Changes * None.
2026-03-11 10:00:07 +08:00
combined = a_blobs + b_blobs
if not combined:
return None
merged = LazyImage(combined)
Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329) ## Summary This PR is the direct successor to the previous `docx` lazy-loading implementation. It addresses the technical debt intentionally left out in the last PR by fully migrating the `qa` and `manual` parsing strategies to the new lazy-loading model. Additionally, this PR comprehensively refactors the underlying `docx` parsing pipeline to eliminate significant code redundancy and introduces robust fallback mechanisms to handle completely corrupted image streams safely. ## What's Changed * **Centralized Abstraction (`docx_parser.py`)**: Moved the `get_picture` extraction logic up to the `RAGFlowDocxParser` base class. Previously, `naive`, `qa`, and `manual` parsers maintained separate, redundant copies of this method. All downstream strategies now natively gather raw blobs and return `LazyDocxImage` objects automatically. * **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge cases where `python-docx` encounters critically malformed magic headers. Implemented an explicit `try-except` structure that safely intercepts `UnrecognizedImageError` (and similar exceptions) and seamlessly falls back to retrieving the raw binary via `getattr(related_part, "blob", None)`, preventing parser crashes on damaged documents. * **Legacy Code & Redundancy Purge**: * Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`, and `manual.py`. * Removed the standalone, immediate-decoding `concat_img` method in `manual.py`. It has been completely replaced by the globally unified, lazy-loading-compatible `rag.nlp.concat_img`. * Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception packages) across all updated strategy files. ## Scope To keep this PR focused, I have restricted these changes strictly to the unification of `docx` extraction logic and the lazy-load migration of `qa` and `manual`. ## Validation & Testing I've tested this to ensure no regressions and validated the fallback logic: * **Output Consistency**: Compared identical `.docx` inputs using `qa` and `manual` strategies before and after this branch: chunk counts, extracted text, table HTML, and attached images match perfectly. * **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory usage when processing image-dense documents through the `qa` and `manual` pipelines, bringing them up to parity with the `naive` strategy's performance gains. ## Breaking Changes * None.
2026-03-11 10:00:07 +08:00
return merged
refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233) **Summary** This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a **lazy-loading approach**: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. **What's Changed** Instead of a dry file-by-file list, here is the logical breakdown of the updates: * **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**: Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. **Scope & What is Intentionally Left Out** To keep this PR focused, I have restricted these changes strictly to the **general Word pipeline** and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly **not modified** in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. **Design Considerations** I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. **Validation & Testing** I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * **Confirmed a noticeable drop in peak memory usage when processing image-dense documents.** For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. **Breaking Changes** None.
2026-02-28 11:22:31 +08:00
LazyDocxImage = LazyImage
refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233) **Summary** This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a **lazy-loading approach**: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. **What's Changed** Instead of a dry file-by-file list, here is the logical breakdown of the updates: * **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**: Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. **Scope & What is Intentionally Left Out** To keep this PR focused, I have restricted these changes strictly to the **general Word pipeline** and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly **not modified** in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. **Design Considerations** I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. **Validation & Testing** I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * **Confirmed a noticeable drop in peak memory usage when processing image-dense documents.** For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. **Breaking Changes** None.
2026-02-28 11:22:31 +08:00
def ensure_pil_image(img):
if isinstance(img, Image.Image):
return img
if isinstance(img, LazyImage):
refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233) **Summary** This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a **lazy-loading approach**: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. **What's Changed** Instead of a dry file-by-file list, here is the logical breakdown of the updates: * **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**: Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. **Scope & What is Intentionally Left Out** To keep this PR focused, I have restricted these changes strictly to the **general Word pipeline** and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly **not modified** in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. **Design Considerations** I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. **Validation & Testing** I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * **Confirmed a noticeable drop in peak memory usage when processing image-dense documents.** For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. **Breaking Changes** None.
2026-02-28 11:22:31 +08:00
return img.to_pil()
return None
def is_image_like(img):
return isinstance(img, Image.Image) or isinstance(img, LazyImage)
refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233) **Summary** This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a **lazy-loading approach**: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. **What's Changed** Instead of a dry file-by-file list, here is the logical breakdown of the updates: * **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**: Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. **Scope & What is Intentionally Left Out** To keep this PR focused, I have restricted these changes strictly to the **general Word pipeline** and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly **not modified** in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. **Design Considerations** I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. **Validation & Testing** I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * **Confirmed a noticeable drop in peak memory usage when processing image-dense documents.** For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. **Breaking Changes** None.
2026-02-28 11:22:31 +08:00
def open_image_for_processing(img, *, allow_bytes=False):
if isinstance(img, Image.Image):
return img, False
if isinstance(img, LazyImage):
refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233) **Summary** This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a **lazy-loading approach**: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. **What's Changed** Instead of a dry file-by-file list, here is the logical breakdown of the updates: * **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**: Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. **Scope & What is Intentionally Left Out** To keep this PR focused, I have restricted these changes strictly to the **general Word pipeline** and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly **not modified** in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. **Design Considerations** I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. **Validation & Testing** I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * **Confirmed a noticeable drop in peak memory usage when processing image-dense documents.** For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. **Breaking Changes** None.
2026-02-28 11:22:31 +08:00
return img.to_pil_detached(), True
if allow_bytes and isinstance(img, (bytes, bytearray)):
try:
pil = Image.open(BytesIO(img)).convert("RGB")
return pil, True
except Exception as e:
logging.info(f"open_image_for_processing: bad bytes: {e}")
return None, False
return img, False