Infrastructures
Seed-Infrastructures 团队负责大模型的分布式训练、强化学习框架、高性能推理、异构硬件编译器等工作

课题方向

超大规模分布式训练
研究超大规模训练集群,如何让训练的稳定性和 MFU 提升,跨集群、低精度、容错及弹性训练
Large-scale
Stability

强化学习系统
研究端到端的大模型强化系统,在动态负载、复杂 Agent/环境交互、异构资源、多模态场景下设计下一代系统
Reinforcement learning
Agent
Optimization

推理并行方案
研究如何解决推理的计算和访存瓶颈,多机推理,异构硬件的并行推理方案和调度优化
Inference
Parallel

下一代模型与硬件体系联合优化
结合下一代硬件体系和下一代生成理解模型架构,研究更先进的模型结构、训练模式、推理模式
Systems-algorithm co-design
Model architecture

异构硬件编译优化
研究新硬件体系结构下高性能算子的编译优化、计算通讯联合优化
Heterogeneous systems
Compiler
精选论文

2025.03.20
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the optimal approximation rate.
Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma
LLM
2025.03.20
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma
Fundation
LLM

2025.03.18
Hyper-Connections
We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between features at different depths and dynamically rearrange layers. We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. We anticipate that this method will be broadly applicable and beneficial across a wide range of AI problems.
Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou
LLM
2025.03.18
Hyper-Connections
Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou
Fundation
LLM

2025.03.01
TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice
The Mixture of Experts (MoE) architecture has emerged as a promising solution to reduce computational overhead by selectively activating subsets of model parameters. The effectiveness of MoE models depends primarily on their routing mechanisms, with the widely adopted Top-K routing scheme used for activating experts. However, the Top-K scheme has notable limitations, including unnecessary activations and underutilization of experts. In this work, rather than modifying the routing mechanism as done in previous studies, we propose the Ternary Choice MoE (TC-MoE), a novel approach that expands the expert space by applying the ternary set {-1, 0, 1} to each expert. This expansion allows more efficient and effective expert activations without incurring significant computational costs. Additionally, given the unique characteristics of the expanded expert space, we introduce a new load balance loss and reward loss to ensure workload balance and achieve a flexible trade-off between effectiveness and efficiency. Extensive experiments demonstrate that TC-MoE achieves an average improvement of over 1.1% compared with traditional approaches, while reducing the average number of activated experts by up to 9%. These results confirm that TC-MoE effectively addresses the inefficiencies of conventional routing schemes, offering a more efficient and scalable solution for MoE-based large language models. Code and models are available at https://github.com/stiger1000/TC-MoE.
Shen Yan, Xingyan Bin, Sijun Zhang, Yisen Wang, Zhouchen Lin
LLM
2025.03.01
TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice
Shen Yan, Xingyan Bin, Sijun Zhang, Yisen Wang, Zhouchen Lin
Fundation
LLM
查看更多论文