SmolDocling-256M:极小参数量的视觉语言模型|端到端文档解析方案的另一种思路

article/2025/8/22 23:54:03

背景问题

传统的一站式文档解析工具,包含布局分析、OCR和表格识别等,往往需要结合多个独立的模型,同时根据处理任务的不同调用不同的模型,增加了处理流程的复杂度,并且难以泛化到不同的文档类型。大型视觉语言模型(LVLMs)虽然提供端到端的解决方案,但是计算成本高,如Qwen2.5VL系列模型,至少7B以上的模型才有不错的效果,这对于文档解析这种轻量型的任务来说计算负担太重了。

SmolDocling

SmolDocling是一种超小型的VLM,由IBM Research和Hugging Face合作开发,能够在使用远少于大型模型计算资源的情况下(仅有传统语言模型的参数量),提供与大型模型相当的性能,支持OCR、布局和定位、代码识别、公式识别、图表识别、表格识别、图像分类、标题对应、列表分组、全页转换等功能。
该模型基于Hugging Face的SmolVLM-256M模型训练得到,使用Vision Encoder处理文档图像,以及轻量级语言模型理解和生成结构化文本,同时引入DocTags标记格式,提高文档结构的识别和处理效率。
在这里插入图片描述

DocTags标记格式

在这里插入图片描述

DocTags 是模型输出内容的格式,是一种类似XML的语言,DocTags创建了一个清晰且结构化的标签和规则系统,用于将文本与文档的结构分开。通过减少混淆,使得模型的转换变得更加容易和准确。而且,相比直接转换为HTML或 Markdown等格式可能会保留更多有用的细节。

使用方法

1. 在线试用

可访问 https://www.smoldocling.net/ 进行试用

2. 本地部署

导入必要的库

from io import BytesIO
from pathlib import Path
from urllib.parse import urlparseimport requests
from PIL import Image
from docling_core.types.doc import ImageRefMode
from docling_core.types.doc.document import DocTagsDocument, DoclingDocument
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config, stream_generate

初始化模型

model_path = "ds4sd/SmolDocling-256M-preview"
model, processor = load(model_path)
config = load_config(model_path)

数据准备

# 设置提示词
prompt = "Convert this page to docling."# 设置图片URL
image = "https://ibm.biz/docling-page-with-list"if urlparse(image).scheme != "":  # it is a URLresponse = requests.get(image, stream=True, timeout=10)response.raise_for_status()pil_image = Image.open(BytesIO(response.content))
else:pil_image = Image.open(image)formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)

执行转换

print("DocTags: \n\n")output = ""
for token in stream_generate(model, processor, formatted_prompt, [image], max_tokens=4096, verbose=False
):output += token.textprint(token.text, end="")if "</doctag>" in token.text:breakprint("\n\n")doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([output], [pil_image])# 转换为docling格式
doc = DoclingDocument(name="SampleDocument")
doc.load_from_doctags(doctags_doc)

转换为Markdown

print("Markdown: \n\n")
print(doc.export_to_markdown())

实际体验

找了一张英语论文的截图,含有表格和多个段落。
prompt为Convert this page to docling.
在这里插入图片描述

转换后的Doctag格式:

<doctag><text><loc_10><loc_26><loc_471><loc_58>et al., 2021), ChartQA (Masry et al., 2022), and OCRBench_v2 (Fu et al., 
2024b). Furthermore, we also evaluate the visual grounding capability of our model on the referring expression comprehension benchmarks (Kazemzadeh et al., 2014; Mao et al., 2016), object detection in the wild (Li et al., 2022) and a self-curated point grounding benchmark.</text>
<text><loc_10><loc_66><loc_471><loc_82>Video (w/o Audio) → Text We assess our model on several representative video understanding tasks like Video-MME (Fu et al., 2024a), MVBench (Li et al., 2024a), and EgoSchema (Mangalam et al., 2023).</text>
<text><loc_10><loc_91><loc_471><loc_107>Multimodality → Text We demonstrate the ability of our model for mixed-modality (image, audio and text) prompts on OmniBench (Li et al., 2024b).</text>
<section_header_level_1><loc_10><loc_116><loc_166><loc_124>5.1.1 Performance of Text → Text</section_header_level_1>
<text><loc_10><loc_130><loc_471><loc_168>We compare Qwen2.5-Omni with other leading large language model of similar size (7B). As shown in Table 1, the performance of Qwen2.5-Omni generally falls between Qwen2-7B and Qwen2.5-7B. Our model outperforms Qwen2-7B on most benchmarks, such as MMLU-Pro, MMLU-redux, MATH, GSM8K, MBPP, MultiPL-E and LiveCodeBench, which demonstrates the exceptional capabilities of our model for Text → Text .</text>
<otsl><loc_40><loc_190><loc_438><loc_312><ched>Datasets<ched>Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B<lcel><lcel><lcel><lcel><nl><ecel><ecel><ched>General Tasks<lcel><lcel><lcel><nl><rhed>MMLU-Pro<fcel>52.1<fcel>48.3<fcel>44.1<fcel>56.3<fcel>47.0<nl><rhed>MMLU-redux<fcel>72.8<fcel>67.2<fcel>67.3<fcel>75.4<fcel>71.0<nl><rhed>LiveBench0831<fcel>30.6<fcel>26.7<fcel>29.2<fcel>35.9<fcel>29.6<nl><ecel><ched>Mathematics & Science Tasks<lcel><lcel><lcel><lcel><nl><rhed>GPQA<fcel>32.8<fcel>32.4<fcel>34.3<fcel>36.4<fcel>30.8<nl><rhed>MATH<fcel>44.3<fcel>51.9<fcel>52.9<fcel>75.5<fcel>71.5<nl><rhed>GSM8K<fcel>76.7<fcel>84.5<fcel>85.7<fcel>91.6<fcel>88.7<nl><ecel><ched>Coding Tasks<lcel><lcel><lcel><lcel><nl><rhed>HumanEval<fcel>68.9<fcel>72.6<fcel>79.9<fcel>84.8<fcel>78.7<nl><rhed>MBPP<fcel>74.9<fcel>69.6<fcel>67.2<fcel>79.2<fcel>73.2<nl><rhed>MultiPL-E<fcel>53.4<fcel>50.7<fcel>59.1<fcel>70.4<fcel>65.8<nl><rhed>LiveCodeBench2305-2409<fcel>18.9<fcel>8.3<fcel>23.9<fcel>28.7<fcel>24.6<nl><caption><loc_65><loc_174><loc_420><loc_182>Table 1: Text → Text performance of 7B+ pure text models and Qwen2.5-Omni</caption></otsl>
<section_header_level_1><loc_10><loc_329><loc_173><loc_337>5.1.2 Performance of Audio → Text</section_header_level_1>
<text><loc_10><loc_343><loc_471><loc_420>We compare Qwen2.5-Omni with other leading specialist or generalist models on diverse audio understanding, audio reasoning, and voice-chatting benchmarks. As shown in Table 2 and 3, Qwen2.5-Omni delivers better or comparable performance with other state-of-the-art methods on audio understanding. For instance, it achieves superior ASR and S2T performance on Fleurs_zh, CommonVoice_en, CommonVoice_zh, CoVoSt1_en-de and CoVoSt1_zh-en test sets, surpassing previous state-of-the-art models like Whisper-large-v3, Qwen2Audio, MinMo and other Omni models. Qwen2.5-Omni also achieves state-of-the-art performance on general audio understanding tasks like music and VSC. Additionally, Qwen2.5-Omni achieves state-of-the-art results on audio reasoning with superior performance on sound, music and speech subsets of MMU benchmark. These results demonstrate the powerful capabilities of Qwen2.5-Omni in general audio understanding and reasoning.</text>
<text><loc_10><loc_424><loc_471><loc_494>Additionally, on VoiceBench, Qwen2.5-Omni achieves an impressive average score of 74.12, surpassing other audio language models and omni models of similar size. This showcases our model's strong capabilities in speech interaction. To further explore the performance of diverse speech interaction, we convert text instructions from several pure-text benchmarks into speech and evaluate Qwen2.5-Omni, Qwen2-Audio and Qwen2-7B on the in-house voice-chat benchmark. About 90% of text-instructions are utilized. We use speech instruction for Qwen2.5-Omni and Qwen2-Audio, and text instruction for Qwen2-7B. As shown in Table 4, compared to Qwen2-Audio, Qwen2-5-Omni significantly narrows the gap with Qwen2-7B, which uses text instructions. This reflects our model's substantial progress in diversified end-to-end speech interaction.</text>
</doctag>

要求输出的Markdown格式:

Markdown: et al., 2021), ChartQA (Masry et al., 2022), and OCRBench\_v2 (Fu et al., 2024b). Furthermore, we also evaluate the visual grounding capability of our model on the referring expression comprehension benchmarks (Kazemzadeh et al., 2014; Mao et al., 2016), object detection in the wild (Li et al., 2022) and a self-curated point grounding benchmark.Video (w/o Audio) → Text We assess our model on several representative video understanding tasks like Video-MME (Fu et al., 2024a), MVBench (Li et al., 2024a), and EgoSchema (Mangalam et al., 2023).Multimodality → Text We demonstrate the ability of our model for mixed-modality (image, audio and text) prompts on OmniBench (Li et al., 2024b).## 5.1.1 Performance of Text → TextWe compare Qwen2.5-Omni with other leading large language model of similar size (7B). As shown in Table 1, the performance of Qwen2.5-Omni generally falls between Qwen2-7B and Qwen2.5-7B. Our model outperforms Qwen2-7B on most benchmarks, such as MMLU-Pro, MMLU-redux, MATH, GSM8K, MBPP, MultiPL-E and LiveCodeBench, which demonstrates the exceptional capabilities of our model for Text → Text .Table 1: Text → Text performance of 7B+ pure text models and Qwen2.5-Omni| Datasets               | Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B   | Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B   | Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B   | Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B   | Gemmaa2-9  Blama3-1-8B  Qwen2-7B  Qwen2.5-7B  Qwen2.5-Omni-7B   |
|------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|
|                        |                                                                 | General Tasks                                                   | General Tasks                                                   | General Tasks                                                   | General Tasks                                                   |
| MMLU-Pro               | 52.1                                                            | 48.3                                                            | 44.1                                                            | 56.3                                                            | 47.0                                                            |
| MMLU-redux             | 72.8                                                            | 67.2                                                            | 67.3                                                            | 75.4                                                            | 71.0                                                            |
| LiveBench0831          | 30.6                                                            | 26.7                                                            | 29.2                                                            | 35.9                                                            | 29.6                                                            |
|                        | Mathematics & Science Tasks                                     | Mathematics & Science Tasks                                     | Mathematics & Science Tasks                                     | Mathematics & Science Tasks                                     | Mathematics & Science Tasks                                     |
| GPQA                   | 32.8                                                            | 32.4                                                            | 34.3                                                            | 36.4                                                            | 30.8                                                            |
| MATH                   | 44.3                                                            | 51.9                                                            | 52.9                                                            | 75.5                                                            | 71.5                                                            |
| GSM8K                  | 76.7                                                            | 84.5                                                            | 85.7                                                            | 91.6                                                            | 88.7                                                            |
|                        | Coding Tasks                                                    | Coding Tasks                                                    | Coding Tasks                                                    | Coding Tasks                                                    | Coding Tasks                                                    |
| HumanEval              | 68.9                                                            | 72.6                                                            | 79.9                                                            | 84.8                                                            | 78.7                                                            |
| MBPP                   | 74.9                                                            | 69.6                                                            | 67.2                                                            | 79.2                                                            | 73.2                                                            |
| MultiPL-E              | 53.4                                                            | 50.7                                                            | 59.1                                                            | 70.4                                                            | 65.8                                                            |
| LiveCodeBench2305-2409 | 18.9                                                            | 8.3                                                             | 23.9                                                            | 28.7                                                            | 24.6                                                            |
|                        |                                                                 |                                                                 |                                                                 |                                                                 |                                                                 |## 5.1.2 Performance of Audio → TextWe compare Qwen2.5-Omni with other leading specialist or generalist models on diverse audio understanding, audio reasoning, and voice-chatting benchmarks. As shown in Table 2 and 3, Qwen2.5-Omni delivers better or comparable performance with other state-of-the-art methods on audio understanding. For instance, it achieves superior ASR and S2T performance on Fleurs\_zh, CommonVoice\_en, CommonVoice\_zh, CoVoSt1\_en-de and CoVoSt1\_zh-en test sets, surpassing previous state-of-the-art models like Whisper-large-v3, Qwen2Audio, MinMo and other Omni models. Qwen2.5-Omni also achieves state-of-the-art performance on general audio understanding tasks like music and VSC. Additionally, Qwen2.5-Omni achieves state-of-the-art results on audio reasoning with superior performance on sound, music and speech subsets of MMU benchmark. These results demonstrate the powerful capabilities of Qwen2.5-Omni in general audio understanding and reasoning.Additionally, on VoiceBench, Qwen2.5-Omni achieves an impressive average score of 74.12, surpassing other audio language models and omni models of similar size. This showcases our model's strong capabilities in speech interaction. To further explore the performance of diverse speech interaction, we convert text instructions from several pure-text benchmarks into speech and evaluate Qwen2.5-Omni, Qwen2-Audio and Qwen2-7B on the in-house voice-chat benchmark. About 90% of text-instructions are utilized. We use speech instruction for Qwen2.5-Omni and Qwen2-Audio, and text instruction for Qwen2-7B. As shown in Table 4, compared to Qwen2-Audio, Qwen2-5-Omni significantly narrows the gap with Qwen2-7B, which uses text instructions. This reflects our model's substantial progress in diversified end-to-end speech interaction.

初步试用起来与Qwen2.5VL-7B差不多,甚至可能某些数据集上略胜一筹。
在这里插入图片描述

但是对于中文的图片识别效果就非常糟糕了,看了下是缺少对应训练的数据集导致的。
在这里插入图片描述

对于这样一个2亿参数量的小模型来说,能提供端到端的文档解析方案,同时还有不错的准确率,已经很难得了,期待后续有更多类似的模型推出。


http://www.hkcw.cn/article/ceVVxDdLvB.shtml

相关文章

SUV行驶中被巨石砸下路面,目击者:SUV司机自己爬上来,没受伤!

SUV行驶中被巨石砸下路面。5月28日贵州毕节,SUV行驶中被巨石砸下路面,摩托车司机弃车避险后又赶来查看,目击者:SUV司机自己爬上来,没受伤!SUV行驶中被巨石砸下路面SUV行驶中被巨石砸下路面SUV行驶中被巨石砸下路面SUV行驶中被巨石砸下路面SUV行驶中被巨石砸下路面责任编辑…

一文了解半导体封装测试

1.半导体后端工艺 制作半导体产品的第一步&#xff0c;就是根据所需功能设计芯片&#xff08;Chip&#xff09;。然后&#xff0c;再将芯片制作成晶圆&#xff08;Wafer&#xff09;。由于晶圆由芯片反复排列而成&#xff0c;当我们细看已完成的晶圆时&#xff0c;可以看到上面…

leetcode hot100刷题日记——28.环形链表2

解答&#xff1a; 方法一&#xff1a;哈希表 class Solution { public:ListNode *detectCycle(ListNode *head) {//哈希表unordered_set<ListNode *>visited;while(head!nullptr){if(visited.count(head)){return head;}visited.insert(head);headhead->next;}return…

NW907NW918美光固态闪存NW920NW930

NW907NW918美光固态闪存NW920NW930 技术解析&#xff1a;美光NW系列固态闪存的核心突破 美光NW907、NW918、NW920、NW930四款固态闪存产品&#xff0c;代表了当前存储技术的顶尖水平。其核心创新在于G9 NAND架构的深度优化&#xff0c;采用更先进的5纳米制程工艺&#xff0c;…

前人栽树,后人乘凉——AdaBoost

一、AdaBoost介绍 AdaBoost的全称是ADAPTIVE BOOSTING&#xff08;自适应增强算法&#xff09;&#xff0c;是一种经典的集成学习算法&#xff0c;它通过组合多个弱学习器来构建一个强学习器。 从表意上看&#xff0c;AdaBoost就是在不断对于错误的知识点进行加深印象&#x…

【深度学习:进阶篇】--2.3.深度学习正则化

学习目标 目标 了解偏差与方差的意义知道L2正则化与L1正则化的数学意义知道Droupout正则化的方法了解早停止法、数据增强法的其它正则化方式 应用 无 目录 学习目标 1 偏差与方差 1.1 数据集划分 1.2 偏差与方差 1.3 解决方法&#xff08;过拟合&#xff09; 2 正则化(…

解决报错error: ‘void_t’ is not a member of ‘std’

解决报错error: ‘void_t’ is not a member of ‘std’ 博主是在编译ceres库时遇到的此报错。 解决方式很简单&#xff0c;将编译使用的c标准设定为c17即可。 例如&#xff0c;在VS2022中&#xff0c;右键单击项目-属性&#xff1a;

【达梦数据库】会话sp_close关闭不掉

背景 一个纯查询的语句&#xff0c;执行了很久&#xff0c;sp_close关闭不掉 排查方法 1、会话sp_close关闭不掉&#xff0c;sp_cance后再执行sp_close依旧关闭不了&#xff1b; sp_close_session(sess_id)sp_cancel_session_operation(sess_id)2、通过分析事务视图v$trx的…

澳门向永久居民每人发1万澳门元 新计划细节公布

澳门特区行政会今日完成讨论《2025年度现金分享计划》行政法规草案。该法规对2025年度的现金分享发放资格及申请手续进行了规范。根据规定,符合身份条件和在澳条件的居民可获得现金分享,其中永久性居民每人一万澳门元,非永久性居民每人六千澳门元。身份条件是指在2024年12月…

接口测试之文件上传(全)

&#x1f345; 点击文末小卡片&#xff0c;免费获取软件测试全套资料&#xff0c;资料在手&#xff0c;涨薪更快 在日常工作中&#xff0c;经常有上传文件功能的测试场景&#xff0c;因此&#xff0c;本文介绍两种主流编写上传文件接口测试脚本的方法。 首先&#xff0c;要知道…

数十家超市曝出食品安全问题 供应链隐患凸显

近日,全国多地超市接连曝出食品安全问题,涉及蔬菜、肉类、包装食品等多个品类。永辉、麦德龙、山姆会员店、小象超市、盒马、中百、朴朴、沃尔玛、奥乐齐和大润发等多家超市被点名。在12315、黑猫和消费保等投诉平台上,可以看到山姆永辉等超市因食品安全问题遭到消费者投诉。…

canvas 实现全屏倾斜重复水印

​ 参考&#xff1a; html、js、canvas实现水印_html页面使用canvas绘制重复水印-CSDN博客 效果 ​​​​ 不求水印显示完全。 实现代码 <template><div class"watermark" ref"waterMark"></div></template><script lang&q…

Android Studio 2022.2.1.20 汉化教程

查看Android Studio 版本 Android Studio Flamingo | 2022.2.1 Patch 2 下载&#xff1a;https://plugins.jetbrains.com/plugin/13710-chinese-simplified-language-pack----/versions/stable

一根开价10万 年轻人迷上文玩玉米,风靡年轻圈

一根开价10万年轻人迷上文玩玉米。“文玩玉米风靡年轻圈,单根标价10万仍抢手!特殊培育的琥珀纹路、蜜蜡颗粒,经直播带货引爆市场,产业链从云南育种到百倍溢价,争议中见证新消费魔力。”文玩,这个曾被贴上“中老年专属”标签的传统爱好,如今正以意想不到的方式在年轻人中…

如何在Qt中绘制一个带有动画的弧形进度条?

如何在Qt中绘制一个弧形的进度条 在图形用户界面开发中&#xff0c;进度指示控件&#xff08;Progress Widget&#xff09;是非常常见且实用的组件。CCArcProgressWidget 是一个继承自 QWidget 的自定义控件&#xff0c;用于绘制圆弧形进度条。当然&#xff0c;笔者看了眼公开…

体育平台数据服务解决方案:从痛点到落地的全栈技术支持

一、体育平台开发的数据痛点解析 在体育平台开发领域&#xff0c;数据层建设往往成为技术团队的核心挑战&#xff1a; 实时数据获取难自建实时数据采集系统需解决赛事方反爬机制、多源数据同步&#xff08;如比分 / 球员位置 / 赔率&#xff09;、毫秒级延迟控制等问题&#…

去寺庙减脂的人胖了11斤 素食诱惑难抵挡

去寺庙减脂的人胖了11斤!一群原本希望通过吃减脂餐修身养性的打工人和中年人,在体验后第一批人竟胖了11斤,这反转实在惊人。当下生活节奏飞快,很多人被亚健康困扰,身体各种不适。于是大家开始注重养生,寺庙因宁静祥和且素食养生概念深入人心,被不少人视为减脂圣地。人们…

printf 输出格式总结(C语言)和C 中的类型提升规则、默认整型提升

类型格式符示例输出说明十进制整数%d12345输出有符号十进制&#xff08;int 类型&#xff09;十进制整数%u12345输出无符号十进制&#xff08;unsigned int&#xff09;十六进制%x3fa2小写十六进制&#xff08;无前缀&#xff09;十六进制%X3FA2大写十六进制&#xff08;无前缀…

Flutter下的一点实践

目录 1、背景2、refena创世纪代码3、localsend里refena的刷新3.1 初始状态3.2 发起设备扫描流程3.3 扫描过程3.3 刷新界面 4.localsend的设备扫描流程4.1 UDP广播设备注册流程4.2 TCP/HTTP设备注册流程4.3 localsend的服务器初始化工作4.4总结 1、背景 在很久以前&#xff0c;…

英国利物浦汽车冲撞人群事件嫌疑人身份公布

英国利物浦汽车冲撞人群事件嫌疑人身份公布 警方通报调查进展英国默西赛德郡警方发言人当地时间5月29日说,日前在利物浦市中心驾车冲撞球迷的男子面临7项指控,将于30日在利物浦地方法院出庭。据英国媒体报道,默西-柴郡地区皇家检察署授权默西赛德郡警方正式对53岁的嫌疑人保…