Refa: Resume parsing module (architectural optimizations based on SmartResume Pipeline) (#13255)

Core optimizations (refer to arXiv:2510.09722):

1. PDF text fusion: Metadata + OCR dual-path extraction and fusion

2. Page-aware reconstruction: YOLOv10 page segmentation + hierarchical
sorting + line number indexing

3. Parallel task decomposition: Basic information/work
experience/educational background three-way parallel LLM extraction

4. Index pointer mechanism: LLM returns a range of line numbers instead
of generating the full text, reducing the illusion of full text.

---------

Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>
Co-authored-by: Aron.Yao <yaowei@192.168.1.68>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
This commit is contained in:
Yao Wei
2026-03-02 19:05:50 +08:00
committed by GitHub
parent 7d6f20585f
commit f8c91e8854
11 changed files with 2810 additions and 144 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,39 @@
请从以下带行号索引的简历文本中提取基本信息。
{indexed_text}
提取如下信息到 JSON若某些字段不存在则输出 "" 空或 0:
{{
"name_kwd": "",
"gender_kwd": "",
"age_int": 0,
"phone_kwd": "",
"email_tks": "",
"birth_dt": "",
"work_exp_flt": 0,
"current_location": "",
"expect_city_names_tks": [],
"expect_position_name_tks": [],
"skill_tks": [],
"language_tks": [],
"certificate_tks": [],
"self_evaluation_tks": ""
}}
字段说明:
- name_kwd: 姓名,如"张三"
- gender_kwd: 男/女,若不存在则不填
- age_int: 当前年龄,整数
- phone_kwd: 电话/手机,请保留原文中的形式,保留国家码区号括号
- email_tks: 邮箱,如 "xxx@qq.com"
- birth_dt: 出生年月,如 "1996-11"
- work_exp_flt: 工作年限,浮点数
- current_location: 现居地/当前城市,不要从工作经历中推测,要写明现居地
- expect_city_names_tks: 期望工作城市列表,简历中需要明确说明是期望城市
- expect_position_name_tks: 期望职位列表
- skill_tks: 技能/技术栈列表
- language_tks: 语言能力列表
- certificate_tks: 证书/资质列表
- self_evaluation_tks: 自我评价/个人优势/个人总结,完整提取原文内容
只返回 JSON。 /no_think

View File

@@ -0,0 +1,39 @@
Please extract basic information from the following line-indexed resume text.
{indexed_text}
Extract the following information into JSON. If a field does not exist, output "" or 0:
{{
"name_kwd": "",
"gender_kwd": "",
"age_int": 0,
"phone_kwd": "",
"email_tks": "",
"birth_dt": "",
"work_exp_flt": 0,
"current_location": "",
"expect_city_names_tks": [],
"expect_position_name_tks": [],
"skill_tks": [],
"language_tks": [],
"certificate_tks": [],
"self_evaluation_tks": ""
}}
Field descriptions:
- name_kwd: Full name, e.g. "John Smith"
- gender_kwd: Male/Female, leave empty if not present
- age_int: Current age, integer
- phone_kwd: Phone number, keep original format including country code and brackets
- email_tks: Email address, e.g. "xxx@gmail.com"
- birth_dt: Date of birth, e.g. "1996-11"
- work_exp_flt: Years of work experience, float
- current_location: Current city/location, do not infer from work experience, must be explicitly stated
- expect_city_names_tks: List of preferred work cities, must be explicitly stated in the resume
- expect_position_name_tks: List of desired positions
- skill_tks: List of skills/tech stack
- language_tks: List of language proficiencies
- certificate_tks: List of certificates/qualifications
- self_evaluation_tks: Self-evaluation/personal strengths/summary, extract full original text
Return JSON only. /no_think

View File

@@ -0,0 +1,31 @@
请从以下带行号索引的简历文本中提取教育背景。
{indexed_text}
提取为 JSON:
{{
"education": [
{{
"school": "",
"major": "",
"degree": "",
"department": "",
"start_date": "",
"end_date": "",
"desc_lines": [start_index, end_index]
}}
]
}}
字段说明:
- school: 学校全称,如"厦门大学",中英文都可以
- major: 专业,如"机械工程"
- degree: 学位,本科/硕士/博士/专科/高中/初中,若不存在则填""
- department: 系/学院,如"信息工程系"
- start_date: 开始时间,格式为 %Y.%m 或 %Y
- end_date: 结束时间,若至今填写"至今",若不存在填写""
- desc_lines: [起始行号, 结束行号],教育描述对应的行号范围(可选)
- 包括课程成绩、研究方向、GPA、荣誉奖项等
- 不存在则填 []
只返回 JSON。 /no_think

View File

@@ -0,0 +1,31 @@
Please extract education background from the following line-indexed resume text.
{indexed_text}
Extract into JSON:
{{
"education": [
{{
"school": "",
"major": "",
"degree": "",
"department": "",
"start_date": "",
"end_date": "",
"desc_lines": [start_index, end_index]
}}
]
}}
Field descriptions:
- school: Full school name, e.g. "Stanford University", both Chinese and English are acceptable
- major: Major/field of study, e.g. "Computer Science"
- degree: Degree level - Bachelor/Master/PhD/Associate/High School/Middle School, leave "" if not available
- department: Department/College, e.g. "School of Engineering"
- start_date: Start date, format %Y.%m or %Y
- end_date: End date, use "Present" if still enrolled, "" if not available
- desc_lines: [start_line, end_line], line number range for education description (optional)
- Includes coursework, research focus, GPA, honors/awards, etc.
- Use [] if not available
Return JSON only. /no_think

View File

@@ -0,0 +1,31 @@
请从以下带行号索引的简历文本中提取项目经验。
{indexed_text}
提取为 JSON每段项目经验包含:
{{
"projectExperience": [
{{
"project_name": "",
"role": "",
"start_date": "",
"end_date": "",
"desc_lines": [start_index, end_index]
}}
]
}}
字段说明:
- project_name: 项目名称
- role: 担任角色/职责,如"项目负责人"、"后端开发"
- start_date: 开始时间,格式为 %Y.%m 或 %Y
- end_date: 结束时间,若至今填写"至今",若不存在填写""
- desc_lines: [起始行号, 结束行号],项目描述对应的行号范围(整数数组)
- 指项目描述的原文引用段落 index 范围,包括项目内容、技术栈、成果等
- 不包括 project_name、role、start_date、end_date 所在行
- 尽可能写全,直到下一段项目经验或其他段落标题为止
- 遇到以下段落标题时必须截止,不要将其包含在 desc_lines 中:
个人评价、自我评价、个人总结、个人优势、自我描述、技能特长、专业技能、教育背景、教育经历、工作经历、工作经验、证书资质、语言能力、兴趣爱好、求职意向
- 如果不存在就写 []
只返回 JSON。 /no_think

View File

@@ -0,0 +1,31 @@
Please extract project experience from the following line-indexed resume text.
{indexed_text}
Extract into JSON, each project experience entry contains:
{{
"projectExperience": [
{{
"project_name": "",
"role": "",
"start_date": "",
"end_date": "",
"desc_lines": [start_index, end_index]
}}
]
}}
Field descriptions:
- project_name: Project name
- role: Role/responsibility, e.g. "Project Lead", "Backend Developer"
- start_date: Start date, format %Y.%m or %Y
- end_date: End date, use "Present" if ongoing, "" if not available
- desc_lines: [start_line, end_line], line number range for project description (integer array)
- Refers to the original text reference range for project description, including project content, tech stack, achievements, etc.
- Does not include lines containing project_name, role, start_date, end_date
- Include as much as possible until the next project experience entry or other section heading
- STOP before these section headings (do not include them in desc_lines):
Self-evaluation, Personal Summary, Skills, Technical Skills, Education, Work Experience, Certificates, Languages, Hobbies, Career Objective
- Use [] if not available
Return JSON only. /no_think

View File

@@ -0,0 +1,3 @@
你是一个专业的简历分析助手。你的任务是将给定的简历文本转换为 JSON 输出。
(如果有中英文简历同时出现时,只关注中文简历)
严格按照 JSON 格式返回结果,不要有任何其他文字。

View File

@@ -0,0 +1,3 @@
You are a professional resume analysis assistant. Your task is to convert the given resume text into JSON output.
(If both Chinese and English resumes appear, focus only on the English resume)
Strictly return results in JSON format without any other text.

View File

@@ -0,0 +1,39 @@
请从以下带行号索引的简历文本中提取工作经历。
{indexed_text}
提取为 JSON每段工作经历包含:
{{
"workExperience": [
{{
"company": "",
"position": "",
"internship": 0,
"start_date": "",
"end_date": "",
"desc_lines": [start_index, end_index]
}}
]
}}
字段说明:
- company: 公司全称(含括号内地区信息),如"阿里巴巴(中国)有限公司"
- position: 职位名称,遵循原文不要编造或推测
- internship: 该段经历是否是实习是实习为1不是为0
- start_date: 入职时间,格式为 %Y.%m 或 %Y如 "2024.1"
- end_date: 离职时间,若至今填写"至今",若不存在填写""
- desc_lines: [起始行号, 结束行号],工作描述对应的行号范围(整数数组)
- 指工作经历描述的原文引用段落 index 范围,包括工作成果、业绩、主要工作、技术栈等
- 不包括 company、position、start_date、end_date 所在行
- 尽可能写全,直到下一段工作经历或其他段落标题为止
- 遇到以下段落标题时必须截止,不要将其包含在 desc_lines 中:
个人评价、自我评价、个人总结、个人优势、自我描述、技能特长、专业技能、教育背景、教育经历、项目经验、项目经历、证书资质、语言能力、兴趣爱好、求职意向
- 如果不存在就写 []
示例:
[22]: 阿里巴巴 2021.11-2022.11 高级工程师
[23]: 工作描述: 从事地推工作完成xx业绩
[24]: 在地推任务中考核为A
则 desc_lines 应为 [23, 24]
只返回 JSON。 /no_think

View File

@@ -0,0 +1,38 @@
Please extract work experience from the following line-indexed resume text.
{indexed_text}
Extract into JSON, each work experience entry contains:
{{
"workExperience": [
{{
"company": "",
"position": "",
"internship": 0,
"start_date": "",
"end_date": "",
"desc_lines": [start_index, end_index]
}}
]
}}
Field descriptions:
- company: Full company name (including region info in brackets), e.g. "Google Inc."
- position: Job title, follow original text, do not fabricate or guess
- internship: Whether this is an internship, 1 for yes, 0 for no
- start_date: Start date, format %Y.%m or %Y, e.g. "2024.1"
- end_date: End date, use "Present" if still employed, "" if not available
- desc_lines: [start_line, end_line], line number range for job description (integer array)
- Refers to the original text reference range for job description, including achievements, responsibilities, tech stack, etc.
- Include as much as possible until the next work experience entry or other section heading
- STOP before these section headings (do not include them in desc_lines):
Self-evaluation, Personal Summary, Skills, Technical Skills, Education, Project Experience, Certificates, Languages, Hobbies, Career Objective
- Use [] if not available
Example:
[22]: Google Inc. 2021.11-2022.11 Senior Engineer
[23]: Job description: Responsible for backend development
[24]: Achieved 99.9% uptime for core services
Then desc_lines should be [23, 24]
Return JSON only. /no_think