HomeModelsResearchJoin Us
EN
中文
HomeModelsResearchJoin Us

2024-05-23

Unveiling the Tapestry of Consistency in Large Vision-Language Models

Download PDF
PreviousNext

ABSTRACT

Large vision-language models (LVLMs) have recently achieved rapid progress, exhibiting great perception and reasoning abilities concerning visual information. However, when faced with prompts in different sizes of solution spaces, LVLMs fail to always give consistent answers regarding the same knowledge point. This inconsistency of answers between different solution spaces is prevalent in LVLMs and erodes trust. To this end, we provide a multi-modal benchmark ConBench, to intuitively analyze how LVLMs perform when the solution space of a prompt revolves around a knowledge point. Based on the ConBench tool, we are the first to reveal the tapestry and get the following findings: (1) In the discriminate realm, the larger the solution space of the prompt, the lower the accuracy of the answers. (2) Establish the relationship between the discriminative and generative realms: the accuracy of the discriminative question type exhibits a strong positive correlation with its Consistency with the caption. (3) Compared to open-source models, closed-source models exhibit a pronounced bias advantage in terms of Consistency. Eventually, we ameliorate the consistency of LVLMs by trigger-based diagnostic refinement, indirectly improving the performance of their caption. We hope this paper will accelerate the research community in better evaluating their models and encourage future advancements in the consistency domain.

AUTHORS

Yuan Zhang, Fei Xiao, Tao Huang, Chun-Kai Fan, Hongyuan Dong, Jiawen Li, Jiacong Wang, Kuan Cheng, Shanghang Zhang, Haoyuan Guo

Models
Seed1.8Seed1.5-VLSeedance 1.5 proSeedream 4.5Seed LiveInterpret 2.0Seed Realtime VoiceSeed Music
Teams
LLMInfrastructuresVisionSpeechMultimodal Interaction & World ModelAI for ScienceRoboticsResponsible AI
Learn More
ModelsResearchJoin UsTop SeedSeed Edge
Models
Seed1.8
Seed1.5-VL
Seedance 1.5 pro
Seedream 4.5
Seed LiveInterpret 2.0
Seed Realtime Voice
Seed Music
Teams
LLM
Infrastructures
Vision
Speech
Multimodal Interaction & World Model
AI for Science
Robotics
Responsible AI
Learn More
Models
Research
Join Us
Top Seed
Seed Edge
Advancing the frontier of intelligence, in service of humanity
Join ByteDance Seed
Copyright © 2026 Bytedance Seed
User AgreementPrivacy Policy
Contact us : seed.feedback@bytedance.com
Join ByteDance Seed
Copyright © 2026 Bytedance Seed
User AgreementPrivacy Policy