首页模型博客&论文加入我们
EN
中文
首页模型博客&论文加入我们

2025-05-23

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Download PDF
上一篇下一篇

摘要

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce OverTokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

作者

Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou

期刊/会议

ICML 2025

模型成果
Seed2.0Seedance 2.0Seedream 5.0 LiteSeed Realtime VoiceSeed GR-RL
研究团队
LLMInfrastructuresVisionSpeechMultimodal Interaction & World ModelAI for ScienceRoboticsResponsible AI
了解更多
博客Seed Edge校园招聘
模型成果
Seed2.0
Seedance 2.0
Seedream 5.0 Lite
Seed Realtime Voice
Seed GR-RL
研究团队
LLM
Infrastructures
Vision
Speech
Multimodal Interaction & World Model
AI for Science
Robotics
Responsible AI
了解更多
博客
Seed Edge
校园招聘
追求智能上限,创造社会价值
欢迎加入字节跳动 Seed
Copyright © 2026 Bytedance Seed
网站声明
联系我们 : seed.feedback@bytedance.com
欢迎加入字节跳动 Seed
Copyright © 2026 Bytedance Seed
网站声明