Commit Graph

51 Commits

Author SHA1 Message Date
Daniil Sivak
60ad32a0c2 Feat: support epub parsing (#13650)
Closes #1398

### What problem does this PR solve?

Adds native support for EPUB files. EPUB content is extracted in spine
(reading) order and parsed using the existing HTML parser. No new
dependencies required.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

To check this parser manually:

```python
uv run --python 3.12 python -c "
from deepdoc.parser import EpubParser

with open('$HOME/some_epub_book.epub', 'rb') as f:
  data = f.read()

sections = EpubParser()(None, binary=data, chunk_token_num=512)
print(f'Got {len(sections)} sections')
for i, s in enumerate(sections[:5]):
  print(f'\n--- Section {i} ---')
  print(s[:200])
"
```
2026-03-17 20:14:06 +08:00
Phives
4ceb668d40 feat(api/utils): Harden file_utils for robustness and edge cases (#12915)
## Summary
Improves robustness and edge-case handling in `api.utils.file_utils` to
avoid crashes, DoS/OOM risks, and timeouts when processing user-provided
filenames, paths, and file blobs.

## Changes

### Resource limits & timeouts
- **`MAX_BLOB_SIZE_THUMBNAIL`** (50 MiB) and **`MAX_BLOB_SIZE_PDF`**
(100 MiB) to reject oversized inputs before thumbnail/PDF processing.
- **`GHOSTSCRIPT_TIMEOUT_SEC`** (120 s) for
`repair_pdf_with_ghostscript` subprocess to avoid hangs on malicious or
broken PDFs.

### `filename_type`
- Handles `None`, empty string, non-string (e.g. int/list), and
path-only input via new **`_normalize_filename_for_type()`**.
- Uses basename for type detection (e.g. `a/b/c.pdf` → PDF).
- Enforces **`FILE_NAME_LEN_LIMIT`**; invalid input returns
`FileType.OTHER`.

### `thumbnail_img`
- Rejects `None`/empty/oversized blob and invalid filename; returns
`None` instead of raising.
- Wraps PDF, image, and PPT handling in try/except so corrupt or
malformed files return `None`.
- Ensures PDF has pages and PPT has slides before use.
- Normalizes PIL image mode (RGBA/P/LA → RGB) for safe PNG export.

### `repair_pdf_with_ghostscript`
- Handles `None`/empty input; skips repair when input size exceeds
limit.
- Uses `subprocess.run(..., timeout=GHOSTSCRIPT_TIMEOUT_SEC)` and
catches `TimeoutExpired`.
- Returns original bytes when Ghostscript output is empty.

### `read_potential_broken_pdf`
- `None` → `b""`; non–sequence-like (no `len`) → `b""`; empty → return
as-is.
- Oversized blob returned as-is (no repair) to avoid DoS.

### `sanitize_path`
- Explicit `None` and non-string check; strips whitespace before
normalizing.

## Testing
- **`test/unit_test/utils/test_api_file_utils.py`** added with 36 unit
tests covering the above behavior (filename_type, sanitize_path,
read_potential_broken_pdf, thumbnail_img, thumbnail,
repair_pdf_with_ghostscript, constants).
- All tests pass.

---------

Co-authored-by: Gittensor Miner <miner@gittensor.io>
2026-02-25 14:34:47 +08:00
Kevin Hu
f1c2fac03e Refa: remove ppt image. (#12909)
### What problem does this PR solve?

remove `aspose`

### Type of change

- [x] Refactoring
2026-01-30 13:35:42 +08:00
lys1313013
37e4485415 feat: add MDX file support (#12261)
Feat: add MDX file support  #12057 
### What problem does this PR solve?

<img width="1055" height="270" alt="image"
src="https://github.com/user-attachments/assets/a0ab49f9-7806-41cd-8a96-f593591ab36b"
/>

The page states that MDX files are supported, but uploading fails with
the error: "x.mdx: This type of file has not been supported yet!"
<img width="381" height="110" alt="image"
src="https://github.com/user-attachments/assets/4bbb7d08-cb47-416a-95fc-bc90b90fcc39"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-29 12:54:31 +08:00
YngvarHuang
bd5dda6b10 Feature/doc upload api add parent path 20251112 (#11231)
### What problem does this PR solve?

Add the specified parent_path to the document upload api interface
(#11230)

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Co-authored-by: virgilwong <hyhvirgil@gmail.com>
2025-11-13 09:59:39 +08:00
Jin Hai
9a486e0f51 Move some funcs from api to rag module (#10972)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-03 19:26:09 +08:00
Jin Hai
1284647694 Refactor file utils (#10970)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-03 18:54:55 +08:00
Jin Hai
d008a4df9f Move base64_image related functions to common directory (#10957)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-03 15:20:46 +08:00
Billy Bao
fa210e7c58 Feat: parsing hyperlinks in docx and pdf & Fix: default parser config of toc extraction (#10877)
### What problem does this PR solve?

Feat: parsing hyperlinks in docx and pdf #10848
Fix: default parser config of toc extraction

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-03 09:34:12 +08:00
Jin Hai
44f2d6f5da Move 'get_project_base_directory' to common directory (#10940)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-02 21:05:28 +08:00
Jin Hai
57a83eca8a Remove unused code (#10938)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-02 16:25:16 +08:00
Kevin Hu
f24d464a53 Fix: video file suffix (#10740)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-23 11:13:09 +08:00
Billy Bao
d956a442ce Fix: Remove pdf embed support, update based on #10635 (#10663)
### What problem does this PR solve?

Fix: Remove pdf embed support, update based on  #10635

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-20 13:45:53 +08:00
Billy Bao
8ee0b6ea54 File: Now parsing support all types of embedded documents, solved #10059 (#10635)
### What problem does this PR solve?

File: Now parsing support all types of embedded documents, solved #10059
Fix: Incomplete words in chat #10530
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-10-17 18:46:47 +08:00
Kevin Hu
cbf04ee470 Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423)
### What problem does this PR solve?

#9869

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: jinhai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: chanx <1243304602@qq.com>
Co-authored-by: balibabu <cike8899@users.noreply.github.com>
Co-authored-by: Lynn <lynn_inf@hotmail.com>
Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com>
Co-authored-by: huangzl <huangzl@shinemo.com>
Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com>
Co-authored-by: Wilmer <33392318@qq.com>
Co-authored-by: Adrian Weidig <adrianweidig@gmx.net>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yongteng Lei <yongtengrey@outlook.com>
Co-authored-by: Liu An <asiro@qq.com>
Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com>
Co-authored-by: BadwomanCraZY <511528396@qq.com>
Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com>
Co-authored-by: Russell Valentine <russ@coldstonelabs.org>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Billy Bao <newyorkupperbay@gmail.com>
Co-authored-by: Zhedong Cen <cenzhedong2@126.com>
Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com>
Co-authored-by: TensorNull <tensor.null@gmail.com>
Co-authored-by: TeslaZY <TeslaZY@outlook.com>
Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com>
Co-authored-by: AB <aj@Ajays-MacBook-Air.local>
Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com>
Co-authored-by: He Wang <wanghechn@qq.com>
Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com>
Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box>
Co-authored-by: Stephen Hu <stephenhu@seismic.com>
Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com>
Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com>
Co-authored-by: mxc <mxc@example.com>
Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com>
Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com>
Co-authored-by: mcoder6425 <mcoder64@gmail.com>
Co-authored-by: lemsn <lemsn@msn.com>
Co-authored-by: lemsn <lemsn@126.com>
Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com>
Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com>
Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>
2025-10-09 12:36:19 +08:00
Yongteng Lei
39ef2ffba9 Feat: parsing supports jsonl or ldjson format (#9087)
### What problem does this PR solve?

Supports jsonl or ldjson format. Feature request from
[discussion](https://github.com/orgs/infiniflow/discussions/8774).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-07-30 09:48:20 +08:00
Jin Hai
4a2ff633e0 Fix typo in code (#8327)
### What problem does this PR solve?

Fix typo in code

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-06-18 09:41:09 +08:00
Yongteng Lei
0ebf05440e Feat: repair corrupted PDF files on upload automatically (#7693)
### What problem does this PR solve?

Try the best to repair corrupted PDF files on upload automatically.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-05-19 14:54:06 +08:00
Zhichang Yu
c813c1ff4c Made task_executor async to speedup parsing (#5530)
### What problem does this PR solve?

Made task_executor async to speedup parsing

### Type of change

- [x] Performance Improvement
2025-03-03 18:59:49 +08:00
yihong
8a2542157f Fix: possible memory leaks close #5277 (#5500)
### What problem does this PR solve?

close #5277 by make sure the file close

### Type of change

- [x] Performance Improvement

---------

Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-03 10:26:45 +08:00
Zhichang Yu
112ef42a19 Ensure thumbnail be smaller than 64K (#3722)
### What problem does this PR solve?

Ensure thumbnail be smaller than 64K. Close #1443
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-11-28 19:15:31 +08:00
Zhichang Yu
2249d5d413 Always open text file for write with UTF-8 (#3688)
### What problem does this PR solve?

Always open text file for write with UTF-8. Close #932 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-27 16:24:16 +08:00
Zhichang Yu
7c486ee3f9 Fix typo (#3319)
### What problem does this PR solve?

Fix typo

### Type of change

- [x] Refactoring
2024-11-11 09:36:39 +08:00
Kevin Hu
b2524eec49 fix sequence2txt error and usage total token issue (#2961)
### What problem does this PR solve?

#1363

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-10-22 11:38:37 +08:00
wwwlll
6a4858a7ee Fix thumbnail_img NoneType error (#2941)
### What problem does this PR solve?

fix thumbnail_img NoneType error

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-10-22 09:21:05 +08:00
chongchuanbing
485bfd6c08 fix: Large document thumbnail display failed (#2763)
### What problem does this PR solve?

In MySQL, when the thumbnail base64 of a document is relatively large,
the display of the document's thumbnail fails.
Now, I put the document thumbnail into MiniIO storage.

### Type of change

- [✓] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: chongchuanbing <chongchuanbing@gmail.com>
2024-10-10 09:09:29 +08:00
Jin Hai
6b3a40be5c Format file format from Windows/dos to Unix (#1949)
### What problem does this PR solve?

Related source file is in Windows/DOS format, they are format to Unix
format.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-08-15 09:17:36 +08:00
Kevin Hu
cafdee536f add sql to naive parser (#1908)
### What problem does this PR solve?


### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2024-08-12 15:29:33 +08:00
黄腾
ede733e130 add support for eml file parser (#1768)
### What problem does this PR solve?

add support for eml file parser
#1363

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Zhedong Cen <cenzhedong2@126.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-08-06 16:42:14 +08:00
KevinHuSh
0171082cc5 fix create dialog bug (#982)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-30 09:25:05 +08:00
Zhedong Cen
8dd45459be Add support for HTML file (#973)
### What problem does this PR solve?

Add support for HTML file

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-05-30 09:12:55 +08:00
KevinHuSh
c3bc72dfd9 fix too large thumbnail issue (#817)
### What problem does this PR solve?

#709

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-17 14:04:21 +08:00
dashi6174
6ff63ee2ba Support for code files parse (#789)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-05-15 16:34:28 +08:00
KevinHuSh
cab274f560 remove PyMuPDF (#618)
### What problem does this PR solve?
#613 

### Type of change


- [x] Other (please describe):
2024-04-30 12:38:09 +08:00
KevinHuSh
674b3aeafd fix disable and enable llm setting in dialog (#616)
### What problem does this PR solve?
#614 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-30 11:04:14 +08:00
KevinHuSh
2af74cc494 refine docker layers (#606)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2024-04-29 17:57:40 +08:00
KevinHuSh
f69ff39fa0 add file management feature (#560)
### What problem does this PR solve?

### Type of change

- [x] Documentation Update
2024-04-26 17:21:53 +08:00
chrysanthemum-boy
72384b191d Add .doc file parser. (#497)
### What problem does this PR solve?
Add `.doc` file parser, using tika.
```
pip install tika
```
```
from tika import parser
from io import BytesIO

def extract_text_from_doc_bytes(doc_bytes):
    file_like_object = BytesIO(doc_bytes)
    parsed = parser.from_buffer(file_like_object)
    return parsed["content"]
```
### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: chrysanthemum-boy <fannc@qq.com>
2024-04-23 15:31:43 +08:00
KevinHuSh
b8e58fe27a add redis to accelerate access of minio (#482)
### What problem does this PR solve?

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-04-22 14:11:09 +08:00
KevinHuSh
fd7fcb5baf apply pep8 formalize (#155) 2024-03-27 11:33:46 +08:00
KevinHuSh
d1c600d5d3 add ocr and recognizer demo, update README (#74) 2024-02-26 19:51:35 +08:00
KevinHuSh
7fd1eca582 init README of deepdoc, add picture processer. (#71)
* init README of deepdoc, add picture processer.

* add resume parsing
2024-02-23 18:28:12 +08:00
KevinHuSh
a8294f2168 Refine resume parts and fix bugs in retrival using sql (#66) 2024-02-19 19:22:17 +08:00
KevinHuSh
c5ea37cd30 Add resume parser and fix bugs (#59)
* Update .gitignore

* Update .gitignore

* Add resume parser and fix bugs
2024-02-07 19:27:23 +08:00
KevinHuSh
e6acaf6738 Add Q&A and Book, fix task running bugs (#50) 2024-02-01 18:53:56 +08:00
KevinHuSh
072f9dd5bc Add app to rag module: presentaion & laws (#43) 2024-01-25 18:57:39 +08:00
KevinHuSh
e32ef75e99 Test chat API and refine ppt chunker (#42) 2024-01-23 19:45:36 +08:00
KevinHuSh
34b2ab3b2f Test APIs and fix bugs (#41) 2024-01-22 19:51:38 +08:00
KevinHuSh
484e5abc1f llm configuation refine and trievalTest API refine (#40) 2024-01-19 19:51:57 +08:00
KevinHuSh
9bf75d4511 add dialog api (#33) 2024-01-17 20:20:42 +08:00