XVERSE-V-13B

更新信息

[2024/04/28] 发布 XVERSE-V-13B 多模态模型。

Update Information

[2024/04/28] release XVERSE-V-13B large multimodal model.

模型介绍

XVERSE-V-13B 是由深圳元象科技自主研发的支持图文问答的多模态大模型(Large Multimodal Model)，其主要特点如下：

模型结构：视觉编码器采用了 openai/clip-vit-large-patch14-224，文本模型采用了自研的 XVERSE-13B-Chat 模型，图像—文本桥接层采用了高效且简洁的两层 MLP 结构。
训练数据：图文数据采用的是完全公开的数据集，其中预训练阶段数据量为 2.1B 图文对，微调阶段采用了 8.2M 的指令数据。训练数据几乎全为英文数据，因此模型的能力主要体现在英文方面。
图像分辨率：不同于其他固定图像分辨率的模型，XVERSE-V-13B 将图像切分成多个 224×224 的块，分别将他们送到视觉模块进行编码，因此能够处理更高分辨率或者不同宽高比的图像，这为我们模型保留了尽可能多的细节信息。
训练方式： XVERSE-V-13B 采用了两阶段训练，分别为规模比较大的图文对预训练和规模比较小的指令数据微调。其中预训练阶段，我们冻结❄️视觉模块和 LLM 模块，只训练🔥桥接层部分；指令微调阶段，我们依然冻结❄️视觉模块和LLM模块，但是微调🔥桥接层部分以及LLM的所有线性层的 LoRA 参数；另外，在指令微调阶段，我们对桥接层部分和 LoRA 部分采用了差分学习率。

Model Introduction

XVERSE-V-13B, a large multimodal model, is independently developed by Shenzhen Yuanxiang Technology. Its key features are as follows:

Model Structure: The visual encoder adopts openai/clip vit target patch14-224, the text model adopts the self-developed XVERSE-13B Chat model, and the image bridging layer adopts an efficient and concise two-layer MLP structure.
Training Data: The multimodal data utilized in the training process is sourced from fully open datasets. During the pre-training stage, the dataset comprises 2.1 billion pairs of images and text, while the fine-tuning stage employs 8.2 million instruction data points. The training data is predominantly composed of English data, hence the model's proficiency primarily lies in the English domain.
Image Resolution: Unlike other models with fixed image resolutions, XVERSE-V-13B divides images into multiple 224×224 blocks, each sent to the visual module for encoding. This allows it to handle higher-resolution images without the need for cropping.
Training Schedule: XVERSE-V-13B adopts a two-stage training approach, consisting of a large-scale multimodal pre-training followed by fine-tuning on a smaller-scale instruction dataset. During the pre-training stage, we freeze ❄️ the visual and LLM modules and only train 🔥 the bridging layer; During the instruction fine-tuning stage, we still freeze ❄️ the visual and LLM modules, but fine-tune 🔥 the bridging layer as well as the LoRA parameters of all linear layers in the LLM. Additionally, during the instruction fine-tuning stage, we apply differential learning rates to the bridging layer and the LoRA components.

图像编码示例

对于 448*448 的图像，我们通过 Sliding Window 将其切分成4个局部图像块以及 Resize 得到一个包含全局信息的图像，如下图所示

For a 448×448 image, we split it into 4 local image blocks using Sliding Window and resize it to obtain a global information-containing image, as shown in the figure below.

对于更高分辨率的 448*672 的图像，我们通过 Sliding Window 将其切分成6个局部图像块以及 Resize 得到一个包含全局信息的图像，如下图所示

For a higher resolution 448×672 image, we split it into 6 local image blocks using Sliding Window and resize it to obtain a global information-containing image, as shown in the figure below.

^{1：Concate* 表示列向量按行进行拼接}

^{2：对于其他不同分辨率以及不同宽高比的图像，也是同理进行切块编码}

^{1: Concate* represents concatenation of column vectors row-wise.}

^{2: For other images with different resolutions and aspect ratios, the same chunk encoding process applies.}

评测结果

为了综合评估模型的性能，我们在一系列标准数据集上进行了全面测试，包括 MMBench、MMMU、SEEDBench_IMG、MMStar、LLaVABench、AI2D、ScienceQA、VizWiz、TextVQA、OKVQA 和 GQA 等数据集。这些评估覆盖了模型在多个领域的能力，具体包括 OCR，逻辑推理，关系推理，粗粒度感知和细粒度感知。评估结果如下：

OpenCompass 榜单

OpenCompass 是面向大模型评测的一站式平台。其主要特点如下：开源可复现：提供公平、公开、可复现的大模型评测方案。因此，我们报告模型在此榜单上的相关结果。

数据集	XVERSE-V-13B	GeminiProVision`*`	Qwen-VL-Plus`*`	Claude-3V Sonnet`*`	LLaVA-Next-Vicuna-13B	Monkey-Chat	OmniLMM-12B	DeepSeek-VL-7B	CogVLM-17B-Chat	TransCore-M	Yi-VL-34B
MMBench	75.6	73.6	67.0	67.8	70.0	72.4	71.7	73.8	65.8	82.3	72.4
MMBench-CN	74.7	74.3	70.7	64.2	68.5	67.5	62.0	71.4	55.9	80.7	70.7
MMStar	47.8	38.6	39.7	44.2	40.4	40.7	39.6	40.5	39.9	35.6	40.5
MMMU-Val	43.3	48.9	39.8	47.4	37.3	40.7	41.8	38.3	37.3	41.0	45.1
MathVistaMini-Test	44.1	46.5	37.6	45.0	34.1	35.9	34.7	36.9	35.0	32.3	31.5
HallusionBench	31.8	45.2	40.6	41.3	31.8	39.3	35.8	34.5	35.4	27.3	35.3
AI2D-Test	70.4	70.2	65.7	69.9	72.2	68.5	63.3	65.3	63.3	64.1	65.9
OCRBench	489	680.0	726.0	646.0	537.0	534.0	420.0	435.0	590.0	405.0	290.0
SEEDBench_IMG	72.4	70.7	65.7	65.0	71.4	68.9	71.5	70.1	68.8	72.0	68.1
LLaVABench	82.3	79.9	73.7	73.2	73.9	60.5	75.8	77.8	73.9	66.8	62.3

^{1：带 * 号的模型是闭源模型}

对于上述所有比较模型，我们优先汇报其官方公布的结果。在缺少官方结果的情况下，我们采用了 OpenCompass 榜单的报告结果。若 OpenCompass 榜单上仍然缺少相应的数据集评估结果，则来自于我们自行执行的评估流程所获得的数据。而评测框架则采用了VLMEvalKit 评估框架。

传统VQA类任务

传统VQA任务，作为多模态视觉问答领域学术论文常引用的评测任务，具备显著的学术参考价值。因此，我们也将在此类数据集上报告相关的评测结果。

数据集	XVERSE-V-13B	LLaVA-Next-Vicuna-13B	Monkey-Chat	OmniLMM-12B	DeepSeek-VL-7B	CogVLM-17B-Chat	TransCore-M	Yi-VL-34B
ScienceQA	86.4	73.9	82.8	80.8	81.0	70.3	74.9	75.4
OKVQA	59.2	60.0	54.7	40.8	55.1	54.4	56.7	51.4
GQA	62.2	65.5	65.4	61.1	61.8	60.5	63.6	58.3
VizWiz	81.9	54.6	75.6	64.0	50.1	44.0	41.4	70.8
TextVQA	74.2	64.3	53.7	62.4	63.8	69.6	63.1	54.0

同理，对于上述所有比较模型，我们优先汇报其官方公布的结果。在缺少官方结果的情况下，则来自于我们自行执行的评估流程所获得的数据。而评测框架则采用了VLMEvalKit 评估框架。

Evaluation Reports

To comprehensively assess the model's performance, we conducted thorough testing on a series of standard datasets, including MMBench, MMMU, SEEDBench_IMG, MMStar, LLaVABench, AI2D, ScienceQA, VizWiz, TextVQA, OKVQA, and GQA, among others. These evaluations span the model's capabilities across various domains, encompassing OCR, logical reasoning, relational reasoning, coarse-grained perception, and fine-grained perception. The evaluation results are as follows:

OpenCompass Leaderboard

OpenCompass is a one-stop platform for large-scale model evaluation. Its main features are as follows: Open Source and Reproducible: It provides a fair, open, and reproducible evaluation framework for large-scale models. Therefore, we report the relevant results of our model on this leaderboard.

Datasets	XVERSE-V-13B	GeminiProVision`*`	Qwen-VL-Plus`*`	Claude-3V Sonnet`*`	LLaVA-Next-Vicuna-13B	Monkey-Chat	OmniLMM-12B	DeepSeek-VL-7B	CogVLM-17B-Chat	TransCore-M	Yi-VL-34B
MMBench	75.6	73.6	67.0	67.8	70.0	72.4	71.7	73.8	65.8	82.3	72.4
MMBench-CN	74.7	74.3	70.7	64.2	68.5	67.5	62.0	71.4	55.9	80.7	70.7
MMStar	47.8	38.6	39.7	44.2	40.4	40.7	39.6	40.5	39.9	35.6	40.5
MMMU-Val	43.3	48.9	39.8	47.4	37.3	40.7	41.8	38.3	37.3	41.0	45.1
MathVistaMini-Test	44.1	46.5	37.6	45.0	34.1	35.9	34.7	36.9	35.0	32.3	31.5
HallusionBench	31.8	45.2	40.6	41.3	31.8	39.3	35.8	34.5	35.4	27.3	35.3
AI2D-Test	70.4	70.2	65.7	69.9	72.2	68.5	63.3	65.3	63.3	64.1	65.9
OCRBench	489	680.0	726.0	646.0	537.0	534.0	420.0	435.0	590.0	405.0	290.0
SEEDBench_IMG	72.4	70.7	65.7	65.0	71.4	68.9	71.5	70.1	68.8	72.0	68.1
LLaVABench	82.3	79.9	73.7	73.2	73.9	60.5	75.8	77.8	73.9	66.8	62.3

^{1. Models marked with an asterisk * are closed-source models.}

For all the compared models mentioned above, we prioritize reporting their officially published results. In cases where official results are unavailable, we rely on the reported results from the OpenCompass leaderboard. If the corresponding dataset evaluation results are still missing from the OpenCompass leaderboard, we include data obtained from our own evaluation process. The evaluation framework used adheres to the VLMEvalKit evaluation framework.

Traditional VQA tasks

The traditional Visual Question Answering (VQA) task, frequently referenced in academic literature in the field of multimodal visual question answering, holds significant academic reference value. Therefore, we will also report relevant evaluation results on datasets of this kind.

Datasets	XVERSE-V-13B	LLaVA-Next-Vicuna-13B	Monkey-Chat	OmniLMM-12B	DeepSeek-VL-7B	CogVLM-17B-Chat	TransCore-M	Yi-VL-34B
ScienceQA	86.4	73.9	82.8	80.8	81.0	70.3	74.9	75.4
OKVQA	59.2	60.0	54.7	40.8	55.1	54.4	56.7	51.4
GQA	62.2	65.5	65.4	61.1	61.8	60.5	63.6	58.3
VizWiz	81.9	54.6	75.6	64.0	50.1	44.0	41.4	70.8
TextVQA	74.2	64.3	53.7	62.4	63.8	69.6	63.1	54.0

Similarly, for all the compared models mentioned above, we prioritize reporting their officially published results. In the absence of official results, data is obtained from our own evaluation process. The evaluation framework used adheres to the VLMEvalKit evaluation framework.

效果示例

这里我们展示全景和细节识别、图表分析、百科解答、教育问答、内容创作和代码生成等能力的样例。

Demo

Here we present examples of abilities such as panoramic and detail recognition, content creation, travel assistant, chart analysis, educational Q&A, and code generation.

局限性与免责申明

XVERSE-V-13B 与其它所有 LMM 一样，在某些情况下可能会产生不准确、有偏见或其他令人反感的内容。因此，请谨慎使用模型生成的内容，请勿将生成的有害内容进行传播，在部署任何 XVERSE-V-13B 的应用之前，开发人员应根据其具体应用对模型进行安全测试和调优。

我们强烈警告不要将 XVERSE-V-13B 模型用于制造或传播有害信息，或进行任何可能损害公众、国家、社会安全或违反法规的活动。如果使用 XVERSE-V-13B 模型产生任何问题，无论是数据安全问题、公共舆论风险，还是模型被误解、滥用、传播或不合规使用所引发的任何风险和问题，我们将不承担任何责任。

Limitations and Disclaimer

Like all other Large Language Models (LLMs), XVERSE-V-13B may produce inaccurate, biased, or otherwise offensive content under certain circumstances. Therefore, please use the content generated by the model with caution and refrain from disseminating harmful content. Before deploying any application of XVERSE-V-13B, developers should conduct safety tests and optimization of the model according to its specific application.

We strongly warn against the use of the XVERSE-V-13B model for producing or spreading harmful information, or conducting any activities that might harm the public, national, or social security, or violate regulations. We assume no responsibility for any problems arising from the use of the XVERSE-V-13B model, whether it be data security issues, public opinion risks, or any risks and issues caused by misunderstanding, misuse, dissemination, or non-compliance with the model.

模型开源协议

使用本仓库的源码需要遵循 Apache-2.0 开源协议，使用 XVERSE-V-13B 的模型权重则需要遵循模型许可协议。

XVERSE-V-13B 模型权重对学术研究完全开放，并且支持免费商用。如需申请商业许可证，请填写【申请表】，如有其他问题或合作，请联系 opensource@xverse.cn。

Open Source License

The use of the source code in this repository must follow the Apache-2.0 open-source license, while the use of the model weights of XVERSE-V-13B needs to adhere to the Model License Agreement.

The XVERSE-V-13B model weights are fully open to academic research and support free commercial use. To apply for a commercial license, please fill in the application form. For other questions or collaborations, please contact opensource@xverse.cn.