MLNLP社群是國內外知名的機器學習與自然語言處理社群，受眾覆蓋國內外NLP碩博生、高校老師以及企業研究人員。

社群的願景是促進國內外自然語言處理，機器學習學術界、產業界和廣大愛好者之間的交流和進步，特別是初學者同學們的進步。

來源 | RUC AI Box

預訓練是研發大語言模型的第一個訓練階段，也是最為重要的一個階段。

Ilya Sutskever 在演講中直言“預訓練（as we know it）將會終結”，暗示需要全新的思路來拓展資料邊界。Shital Shah 則在社交媒體上更是指出，真實資料的高質量部分是有限的，繼續簡單堆砌相似資料並不能突破“質量上限”，而合成數據的潛力尚未被充分發掘。

那麼如何構建下一代預訓練模型？

我們持續關注了開源社群中可用於大模型預訓練的資源，包括模型架構、訓練策略、開源資料集、資料方法等方面，以回饋開源社群中致力於構建更智慧的大語言模型的開發者。

相比於完整的綜述，我們覆蓋的範圍將侷限於預訓練相關的常用資源和前沿嘗試，以快速上手大語言模型預訓練。

同時我們歡迎開源社群提交更新，以共同促進大模型的發展。

專案地址：https://github.com/RUCAIBox/awesome-llm-pretraining

一、技術報告

技術報告的背後往往都是成百上千的算力資源作為支撐，因此很推薦仔細閱讀優質開源技術報告。

受篇幅所限，我們列舉了一些近期經典的技術報告，更多的放在GitHub主頁中。

1.1 Dense模型

The Llama 3 Herd of Models.
Qwen2.5 Technical Report.
Gemma 3 Technical Report.
Nemotron-4 340B Technical Report.
Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs.
Baichuan 2: Open Large-scale Language Models

1.2 MoE模型

DeepSeek-V3 Technical Report.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models.
Mixtral of Experts.
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models.
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs.
OLMoE: Open Mixture-of-Experts Language Models.
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent.

1.3 帶開源資料集的模型

YuLan-Mini: An Open Data-efficient Language Model.
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series.
LLM360: Towards Fully Transparent Open-Source LLMs.
Nemotron-4 15B Technical Report.

1.4 訓練/資料策略

Phi-4 Technical Report.
OLMo: Accelerating the Science of Language Models.
2 OLMo 2 Furious.
Yi: Open Foundation Models by 01.AI.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies.

1.5 混合/線性模型

Falcon Mamba: The First Competitive Attention-free 7B Language Model.
MiniMax-01: Scaling Foundation Models with Lightning Attention.
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models.

二、訓練策略

我們從訓練框架、訓練策略、可解釋性、模型架構改進、學習率退火等方面討論了訓練策略。

2.1 訓練框架

最常使用的訓練框架為Megatron-LM，提供了一個良好的開箱即用的高效基準。結合其他庫可以達到更好的訓練速度。

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

最常用的預訓練框架，上手門檻高但更加穩定
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts.

MoE計算通訊重疊
DeepEP: an efficient expert-parallel communication library

專家並行加速
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

利用Hopper的非同步特性加速FP8矩陣乘法
Liger Kernel: Efficient Triton Kernels for LLM Training

Triton加速運算元庫

2.2 訓練策略

關於超引數Scaling Law、並行策略、初始化策略、最佳化器選擇、FP8訓練等。

Predictable Scale: Part I — Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

關於超引數的 Scaling Law
The Ultra-Scale Playbook: Training LLMs on GPU Clusters

視覺化並行策略視訊記憶體佔用
A Spectral Condition for Feature Learning

MuP的進階版本
Muon is Scalable for LLM Training

高效最佳化器
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

最佳化器狀態和啟用值也為FP8的訓練
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

關於MoE的Scaling Law

2.3 可解釋性

我們不完全列舉了一些對於預訓練有啟發的可解釋性工作。

On the Biology of a Large Language Model
Physics of Language Models
In-context Learning and Induction Heads
Rethinking Reflection in Pre-Training

2.4 模型架構改進

我們不完全列舉了一些近期針對模型架構的改進。

Gated Delta Networks: Improving Mamba2 with Delta Rule
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Mixture of Hidden-Dimensions Transformer
Titans: Learning to Memorize at Test Time
Ultra-Sparse Memory Network
Large Language Diffusion Models
Better & Faster Large Language Models via Multi-token Prediction
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
Stick-breaking Attention
Forgetting Transformer: Softmax Attention with a Forget Gate
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
MoBA: Mixture of Block Attention for Long-Context LLMs
KV Shifting Attention Enhances Language Modeling
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs
μnit Scaling: Simple and Scalable FP8 LLM Training

2.5 學習率退火

學習率退火往往和資料質量篩選相結合。

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Scaling Law with Learning Rate Annealing

三、開源資料集

我們主要從網頁、數學、程式碼、通用四個方面討論現有開源資料集。

3.1 網頁

網頁資料將構成預訓練中的核心語料。

DataComp-LM: In search of the next generation of training sets for language models.

開源網頁資料集，經過Fasttext等篩選後得到的3.8T資料集
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

FineWeb和教育質量打分FineWeb-Edu語料，對於知識密集型題目有一定效果
Nemotron-CC-HQ.

英偉達的高質量網頁語料
Chinese-FineWeb-Edu.

OpenCSG開源的中文教育質量打分語料，從Map-CC、SkyPile、WuDao、Wanjuan等篩選打分
FineWeb2: A sparkling update with 1000s of languages

多語言資料集

3.2 數學

數學預訓練語料可以顯著提升基模的數學能力以及後訓練的上限。

MegaMath: Pushing the Limits of Open Math Corpora

開源最大的高質量數學CC語料
JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models

合成數學指令資料
mlfoundations-dev/stackoverflow_math

數學相關提問
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

高難度數學資料集
YuLan-Mini: An Open Data-efficient Language Model

收集開源Lean定理證明資料集

3.3 程式碼

程式碼資料不僅可以增強基模生成程式碼的能力，還可以增強數學、邏輯等方面

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

從 The-Stack-V2 中清洗
SmolLM-corpus.

Python教育質量打分
The-Stack-V2

最大規模未清洗的程式碼資料
YuLan-Mini: An Open Data-efficient Language Model

以教育質量清洗Jupyter-Notebook和Python資料
HuggingFaceTB/issues-kaggle-notebooks

GitHub Issues和Kaggle Notebooks資料
mlfoundations-dev/stackoverflow

程式設計問答論壇
Magicoder: Empowering Code Generation with OSS-Instruct

利用開原始碼生成合成指令資料訓練

3.4 通用（書籍、百科、指令、長上下文等）

通用資料往往是較為稀缺的長尾資料，對於後訓練模型的可用性起到至關重要的作用。

YuLan: An Open-source Large Language Model

長尾知識增強和多種通用資料來源清洗
MinerU: An Open-Source Solution for Precise Document Content Extraction

PDF轉Markdown，相容性較強
The Pile: An 800GB Dataset of Diverse Text for Language Modeling.

arXiv、對話、DM Math等
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.

百科、書籍、論文、Reddit等
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

法律、考試、新聞、專利、百科等
MAmmoTH2: Scaling Instructions from the Web

針對網頁的問答
togethercomputer/Long-Data-Collections

從RedPajama、Pile、P3等資料集過濾的書籍、論文、網頁和指令
Longattn: Selecting long-context training data via token-level attention

長程依賴的問答

四、資料方法

資料集往往配合高質量的資料方法。我們從分詞器、資料配比和課程、資料合成等方面詳細闡述。

4.1 分詞器

分詞是模型重要又常被忽視的一塊，會顯著影響模型在數學、知識等方面能力。

SuperBPE: Space Travel for Language Models

多單詞的分詞器訓練方式
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

預測詞表大小
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

數字的分詞方式比較

4.2 資料配比和課程

多階段預訓練往往能使得模型充分學習高質量、少量的資料。在繼續預訓練（CPT）階段引入更多的數學、程式碼、CoT甚至長思維鏈資料，將構成下一代預訓練模型的核心能力。

Nemotron-4 15B Technical Report

分為 8T 預訓練 + 更少資料規模的 CPT
YuLan-Mini: An Open Data-efficient Language Model

使用教育分數進行課程資料
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

預訓練資料混合比例最佳化
Efficient Online Data Mixing For Language Model Pre-Training

線上資料混合
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

資料混合定律
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

透過 BPE 分詞器的合併規則，破解GPT等商業模型的資料比例
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

基於聚類的迭代資料混合自舉框架
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

為大規模預訓練資料集構建索引，以檢查資料質量

4.3 資料合成

除了前文提到的數學和程式碼的合成數據，我們總結了部分通用的合成數據方法和資源。除此之外，在預訓練後期使用更多的長思維資料，也逐漸成為值得探索的方向。

Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

基於長思維鏈合成數據的模仿學習
Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions

生成資訊密集型的合成指令資料，從有限的語料庫中學習知識
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

結構化合成長文字
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use

多步驟推理資料合成，將複雜任務分解為子軌跡，結合強化學習最佳化資料生成
WildChat: 1M ChatGPT Interaction Logs in the Wild

使用者真實對話的開源資料集
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

對齊資料合成