mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-06-29 23:41:12 +08:00
Refa: Resume parsing module (architectural optimizations based on SmartResume Pipeline) (#13255)
Core optimizations (refer to arXiv:2510.09722): 1. PDF text fusion: Metadata + OCR dual-path extraction and fusion 2. Page-aware reconstruction: YOLOv10 page segmentation + hierarchical sorting + line number indexing 3. Parallel task decomposition: Basic information/work experience/educational background three-way parallel LLM extraction 4. Index pointer mechanism: LLM returns a range of line numbers instead of generating the full text, reducing the illusion of full text. --------- Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local> Co-authored-by: Aron.Yao <yaowei@192.168.1.68> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
This commit is contained in:
2669
rag/app/resume.py
2669
rag/app/resume.py
File diff suppressed because it is too large
Load Diff
39
rag/prompts/resume_basic_info.md
Normal file
39
rag/prompts/resume_basic_info.md
Normal file
@@ -0,0 +1,39 @@
|
||||
请从以下带行号索引的简历文本中提取基本信息。
|
||||
|
||||
{indexed_text}
|
||||
|
||||
提取如下信息到 JSON,若某些字段不存在则输出 "" 空或 0:
|
||||
{{
|
||||
"name_kwd": "",
|
||||
"gender_kwd": "",
|
||||
"age_int": 0,
|
||||
"phone_kwd": "",
|
||||
"email_tks": "",
|
||||
"birth_dt": "",
|
||||
"work_exp_flt": 0,
|
||||
"current_location": "",
|
||||
"expect_city_names_tks": [],
|
||||
"expect_position_name_tks": [],
|
||||
"skill_tks": [],
|
||||
"language_tks": [],
|
||||
"certificate_tks": [],
|
||||
"self_evaluation_tks": ""
|
||||
}}
|
||||
|
||||
字段说明:
|
||||
- name_kwd: 姓名,如"张三"
|
||||
- gender_kwd: 男/女,若不存在则不填
|
||||
- age_int: 当前年龄,整数
|
||||
- phone_kwd: 电话/手机,请保留原文中的形式,保留国家码区号括号
|
||||
- email_tks: 邮箱,如 "xxx@qq.com"
|
||||
- birth_dt: 出生年月,如 "1996-11"
|
||||
- work_exp_flt: 工作年限,浮点数
|
||||
- current_location: 现居地/当前城市,不要从工作经历中推测,要写明现居地
|
||||
- expect_city_names_tks: 期望工作城市列表,简历中需要明确说明是期望城市
|
||||
- expect_position_name_tks: 期望职位列表
|
||||
- skill_tks: 技能/技术栈列表
|
||||
- language_tks: 语言能力列表
|
||||
- certificate_tks: 证书/资质列表
|
||||
- self_evaluation_tks: 自我评价/个人优势/个人总结,完整提取原文内容
|
||||
|
||||
只返回 JSON。 /no_think
|
||||
39
rag/prompts/resume_basic_info_en.md
Normal file
39
rag/prompts/resume_basic_info_en.md
Normal file
@@ -0,0 +1,39 @@
|
||||
Please extract basic information from the following line-indexed resume text.
|
||||
|
||||
{indexed_text}
|
||||
|
||||
Extract the following information into JSON. If a field does not exist, output "" or 0:
|
||||
{{
|
||||
"name_kwd": "",
|
||||
"gender_kwd": "",
|
||||
"age_int": 0,
|
||||
"phone_kwd": "",
|
||||
"email_tks": "",
|
||||
"birth_dt": "",
|
||||
"work_exp_flt": 0,
|
||||
"current_location": "",
|
||||
"expect_city_names_tks": [],
|
||||
"expect_position_name_tks": [],
|
||||
"skill_tks": [],
|
||||
"language_tks": [],
|
||||
"certificate_tks": [],
|
||||
"self_evaluation_tks": ""
|
||||
}}
|
||||
|
||||
Field descriptions:
|
||||
- name_kwd: Full name, e.g. "John Smith"
|
||||
- gender_kwd: Male/Female, leave empty if not present
|
||||
- age_int: Current age, integer
|
||||
- phone_kwd: Phone number, keep original format including country code and brackets
|
||||
- email_tks: Email address, e.g. "xxx@gmail.com"
|
||||
- birth_dt: Date of birth, e.g. "1996-11"
|
||||
- work_exp_flt: Years of work experience, float
|
||||
- current_location: Current city/location, do not infer from work experience, must be explicitly stated
|
||||
- expect_city_names_tks: List of preferred work cities, must be explicitly stated in the resume
|
||||
- expect_position_name_tks: List of desired positions
|
||||
- skill_tks: List of skills/tech stack
|
||||
- language_tks: List of language proficiencies
|
||||
- certificate_tks: List of certificates/qualifications
|
||||
- self_evaluation_tks: Self-evaluation/personal strengths/summary, extract full original text
|
||||
|
||||
Return JSON only. /no_think
|
||||
31
rag/prompts/resume_education.md
Normal file
31
rag/prompts/resume_education.md
Normal file
@@ -0,0 +1,31 @@
|
||||
请从以下带行号索引的简历文本中提取教育背景。
|
||||
|
||||
{indexed_text}
|
||||
|
||||
提取为 JSON:
|
||||
{{
|
||||
"education": [
|
||||
{{
|
||||
"school": "",
|
||||
"major": "",
|
||||
"degree": "",
|
||||
"department": "",
|
||||
"start_date": "",
|
||||
"end_date": "",
|
||||
"desc_lines": [start_index, end_index]
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
字段说明:
|
||||
- school: 学校全称,如"厦门大学",中英文都可以
|
||||
- major: 专业,如"机械工程"
|
||||
- degree: 学位,本科/硕士/博士/专科/高中/初中,若不存在则填""
|
||||
- department: 系/学院,如"信息工程系"
|
||||
- start_date: 开始时间,格式为 %Y.%m 或 %Y
|
||||
- end_date: 结束时间,若至今填写"至今",若不存在填写""
|
||||
- desc_lines: [起始行号, 结束行号],教育描述对应的行号范围(可选)
|
||||
- 包括课程成绩、研究方向、GPA、荣誉奖项等
|
||||
- 不存在则填 []
|
||||
|
||||
只返回 JSON。 /no_think
|
||||
31
rag/prompts/resume_education_en.md
Normal file
31
rag/prompts/resume_education_en.md
Normal file
@@ -0,0 +1,31 @@
|
||||
Please extract education background from the following line-indexed resume text.
|
||||
|
||||
{indexed_text}
|
||||
|
||||
Extract into JSON:
|
||||
{{
|
||||
"education": [
|
||||
{{
|
||||
"school": "",
|
||||
"major": "",
|
||||
"degree": "",
|
||||
"department": "",
|
||||
"start_date": "",
|
||||
"end_date": "",
|
||||
"desc_lines": [start_index, end_index]
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
Field descriptions:
|
||||
- school: Full school name, e.g. "Stanford University", both Chinese and English are acceptable
|
||||
- major: Major/field of study, e.g. "Computer Science"
|
||||
- degree: Degree level - Bachelor/Master/PhD/Associate/High School/Middle School, leave "" if not available
|
||||
- department: Department/College, e.g. "School of Engineering"
|
||||
- start_date: Start date, format %Y.%m or %Y
|
||||
- end_date: End date, use "Present" if still enrolled, "" if not available
|
||||
- desc_lines: [start_line, end_line], line number range for education description (optional)
|
||||
- Includes coursework, research focus, GPA, honors/awards, etc.
|
||||
- Use [] if not available
|
||||
|
||||
Return JSON only. /no_think
|
||||
31
rag/prompts/resume_project_exp.md
Normal file
31
rag/prompts/resume_project_exp.md
Normal file
@@ -0,0 +1,31 @@
|
||||
请从以下带行号索引的简历文本中提取项目经验。
|
||||
|
||||
{indexed_text}
|
||||
|
||||
提取为 JSON,每段项目经验包含:
|
||||
{{
|
||||
"projectExperience": [
|
||||
{{
|
||||
"project_name": "",
|
||||
"role": "",
|
||||
"start_date": "",
|
||||
"end_date": "",
|
||||
"desc_lines": [start_index, end_index]
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
字段说明:
|
||||
- project_name: 项目名称
|
||||
- role: 担任角色/职责,如"项目负责人"、"后端开发"
|
||||
- start_date: 开始时间,格式为 %Y.%m 或 %Y
|
||||
- end_date: 结束时间,若至今填写"至今",若不存在填写""
|
||||
- desc_lines: [起始行号, 结束行号],项目描述对应的行号范围(整数数组)
|
||||
- 指项目描述的原文引用段落 index 范围,包括项目内容、技术栈、成果等
|
||||
- 不包括 project_name、role、start_date、end_date 所在行
|
||||
- 尽可能写全,直到下一段项目经验或其他段落标题为止
|
||||
- 遇到以下段落标题时必须截止,不要将其包含在 desc_lines 中:
|
||||
个人评价、自我评价、个人总结、个人优势、自我描述、技能特长、专业技能、教育背景、教育经历、工作经历、工作经验、证书资质、语言能力、兴趣爱好、求职意向
|
||||
- 如果不存在就写 []
|
||||
|
||||
只返回 JSON。 /no_think
|
||||
31
rag/prompts/resume_project_exp_en.md
Normal file
31
rag/prompts/resume_project_exp_en.md
Normal file
@@ -0,0 +1,31 @@
|
||||
Please extract project experience from the following line-indexed resume text.
|
||||
|
||||
{indexed_text}
|
||||
|
||||
Extract into JSON, each project experience entry contains:
|
||||
{{
|
||||
"projectExperience": [
|
||||
{{
|
||||
"project_name": "",
|
||||
"role": "",
|
||||
"start_date": "",
|
||||
"end_date": "",
|
||||
"desc_lines": [start_index, end_index]
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
Field descriptions:
|
||||
- project_name: Project name
|
||||
- role: Role/responsibility, e.g. "Project Lead", "Backend Developer"
|
||||
- start_date: Start date, format %Y.%m or %Y
|
||||
- end_date: End date, use "Present" if ongoing, "" if not available
|
||||
- desc_lines: [start_line, end_line], line number range for project description (integer array)
|
||||
- Refers to the original text reference range for project description, including project content, tech stack, achievements, etc.
|
||||
- Does not include lines containing project_name, role, start_date, end_date
|
||||
- Include as much as possible until the next project experience entry or other section heading
|
||||
- STOP before these section headings (do not include them in desc_lines):
|
||||
Self-evaluation, Personal Summary, Skills, Technical Skills, Education, Work Experience, Certificates, Languages, Hobbies, Career Objective
|
||||
- Use [] if not available
|
||||
|
||||
Return JSON only. /no_think
|
||||
3
rag/prompts/resume_system.md
Normal file
3
rag/prompts/resume_system.md
Normal file
@@ -0,0 +1,3 @@
|
||||
你是一个专业的简历分析助手。你的任务是将给定的简历文本转换为 JSON 输出。
|
||||
(如果有中英文简历同时出现时,只关注中文简历)
|
||||
严格按照 JSON 格式返回结果,不要有任何其他文字。
|
||||
3
rag/prompts/resume_system_en.md
Normal file
3
rag/prompts/resume_system_en.md
Normal file
@@ -0,0 +1,3 @@
|
||||
You are a professional resume analysis assistant. Your task is to convert the given resume text into JSON output.
|
||||
(If both Chinese and English resumes appear, focus only on the English resume)
|
||||
Strictly return results in JSON format without any other text.
|
||||
39
rag/prompts/resume_work_exp.md
Normal file
39
rag/prompts/resume_work_exp.md
Normal file
@@ -0,0 +1,39 @@
|
||||
请从以下带行号索引的简历文本中提取工作经历。
|
||||
|
||||
{indexed_text}
|
||||
|
||||
提取为 JSON,每段工作经历包含:
|
||||
{{
|
||||
"workExperience": [
|
||||
{{
|
||||
"company": "",
|
||||
"position": "",
|
||||
"internship": 0,
|
||||
"start_date": "",
|
||||
"end_date": "",
|
||||
"desc_lines": [start_index, end_index]
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
字段说明:
|
||||
- company: 公司全称(含括号内地区信息),如"阿里巴巴(中国)有限公司"
|
||||
- position: 职位名称,遵循原文不要编造或推测
|
||||
- internship: 该段经历是否是实习,是实习为1,不是为0
|
||||
- start_date: 入职时间,格式为 %Y.%m 或 %Y,如 "2024.1"
|
||||
- end_date: 离职时间,若至今填写"至今",若不存在填写""
|
||||
- desc_lines: [起始行号, 结束行号],工作描述对应的行号范围(整数数组)
|
||||
- 指工作经历描述的原文引用段落 index 范围,包括工作成果、业绩、主要工作、技术栈等
|
||||
- 不包括 company、position、start_date、end_date 所在行
|
||||
- 尽可能写全,直到下一段工作经历或其他段落标题为止
|
||||
- 遇到以下段落标题时必须截止,不要将其包含在 desc_lines 中:
|
||||
个人评价、自我评价、个人总结、个人优势、自我描述、技能特长、专业技能、教育背景、教育经历、项目经验、项目经历、证书资质、语言能力、兴趣爱好、求职意向
|
||||
- 如果不存在就写 []
|
||||
|
||||
示例:
|
||||
[22]: 阿里巴巴 2021.11-2022.11 高级工程师
|
||||
[23]: 工作描述: 从事地推工作完成xx业绩
|
||||
[24]: 在地推任务中考核为A
|
||||
则 desc_lines 应为 [23, 24]
|
||||
|
||||
只返回 JSON。 /no_think
|
||||
38
rag/prompts/resume_work_exp_en.md
Normal file
38
rag/prompts/resume_work_exp_en.md
Normal file
@@ -0,0 +1,38 @@
|
||||
Please extract work experience from the following line-indexed resume text.
|
||||
|
||||
{indexed_text}
|
||||
|
||||
Extract into JSON, each work experience entry contains:
|
||||
{{
|
||||
"workExperience": [
|
||||
{{
|
||||
"company": "",
|
||||
"position": "",
|
||||
"internship": 0,
|
||||
"start_date": "",
|
||||
"end_date": "",
|
||||
"desc_lines": [start_index, end_index]
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
Field descriptions:
|
||||
- company: Full company name (including region info in brackets), e.g. "Google Inc."
|
||||
- position: Job title, follow original text, do not fabricate or guess
|
||||
- internship: Whether this is an internship, 1 for yes, 0 for no
|
||||
- start_date: Start date, format %Y.%m or %Y, e.g. "2024.1"
|
||||
- end_date: End date, use "Present" if still employed, "" if not available
|
||||
- desc_lines: [start_line, end_line], line number range for job description (integer array)
|
||||
- Refers to the original text reference range for job description, including achievements, responsibilities, tech stack, etc.
|
||||
- Include as much as possible until the next work experience entry or other section heading
|
||||
- STOP before these section headings (do not include them in desc_lines):
|
||||
Self-evaluation, Personal Summary, Skills, Technical Skills, Education, Project Experience, Certificates, Languages, Hobbies, Career Objective
|
||||
- Use [] if not available
|
||||
|
||||
Example:
|
||||
[22]: Google Inc. 2021.11-2022.11 Senior Engineer
|
||||
[23]: Job description: Responsible for backend development
|
||||
[24]: Achieved 99.9% uptime for core services
|
||||
Then desc_lines should be [23, 24]
|
||||
|
||||
Return JSON only. /no_think
|
||||
Reference in New Issue
Block a user