論老黃賣鏟子的技術含量。




# install cudnn so we can use FlashAttention and run fast (optional)
# https://developer.nvidia.com/cudnn-downloads
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1
.1-1
_all.deb
sudo dpkg -i cuda-keyring_1
.1-1
_all.deb
sudo apt-get update
sudo apt-get -y install libcudnn9-dev-cuda
-12
# "install" cudnn-frontend to ~/
git clone https://github.com/NVIDIA/cudnn-frontend.git
# install MPI (optional, if you intend to use multiple GPUs)
# (you might also have to install NVIDIA NCCL if it doesn't come with your setup)
sudo apt -y install openmpi-bin openmpi-doc libopenmpi-dev
# download and enter llm.c repo
git clone https://github.com/karpathy/llm.c.gitcd llm.c
# download the "starter pack" (~1GB download)
# contains GPT2-124M weights (used in tests), tokenizer, eval data .bin s
./dev/download_starter_pack.sh
# download the training dataset (FineWeb-Edu 100B token) .bin data shards
# note: this is a total of 1001 data shards. If you only want to test things
# out and don't want to do an actual run, feel free to append the number of
# training shards to download (e.g. for just 10 shards: ./edu_fineweb.sh 10)
# the full dataset is ~200GB, we can store it here in dev/data directory.
cd dev/data
./edu_fineweb.sh
# compile (~1 min 1st time for cuDNN mostly, few sec from then on)
cd ../../
make train_gpt2cu USE_CUDNN=
1
# and train! (wait 24 hours here)
mpirun -np
8
./train_gpt2cu \
-i
"dev/data/edu_fineweb100B/edu_fineweb_train_*.bin"
\
-j
"dev/data/edu_fineweb100B/edu_fineweb_val_*.bin"
\
-o
"log_gpt2_1558M"
\
-v
250
-s
300000
-g
384
\
-h
1
\
-b
16
-t
1024
\
-d
1048576
\
-r
0
\
-z
1
\
-c
0.1
\
-k
"cosine"
\
-l
0.0006
\
-q
0.1
\
-u
700
\
-n
2000
\
-x
32000
\
-ge
1
\
-y
1
\
-e
"d48"
num_parameters:
1557686400
=> bytes:
3115372800
allocated
2971
MiB
for
model parameters
batch_size B=
16
* seq_len T=
1024
* num_processes=
8and
total_batch_size=
1048576
=> setting grad_accum_steps=
8
created directory: log_gpt2_1558M
allocating
40409
MiB
for
activations
val loss
11.129390
allocating
2971
MiB
for
parameter gradients
allocating
742
MiB
for
AdamW optimizer state m
allocating
742
MiB
for
AdamW optimizer state v
allocating
742
MiB
for
master copy of params
step
1
/
32000
| loss
11.133732
(+nanz)| norm
52.9732
(+nanz)| lr
8.57e-07
|
3056.36
ms |
42.6
% bf16 MFU |
343080
tok/s
step
2
/
32000
| loss
10.539388
(+nanz)| norm
43.5996
(+nanz)| lr
1.71e-06
|
2747.19
ms |
47.4
% bf16 MFU |
381690
tok/s
step
3
/
32000
| loss
9.894109
(+nanz)| norm
23.2229
(+nanz)| lr
2.57e-06
|
2753.25
ms |
47.3
% bf16 MFU |
381259
tok/s
step
4
/
32000
| loss
9.566241
(+nanz)| norm
28.4920
(+nanz)| lr
3.43e-06
|
2741.47
ms |
47.5
% bf16 MFU |
381690
tok/s
step
5
/
32000
| loss
9.482848
(+nanz)| norm
23.7817
(+nanz)| lr
4.29e-06
|
2752.07
ms |
47.3
% bf16 MFU |
381507
tok/s
step
6
/
32000
| loss
9.332832
(+nanz)| norm
15.9113
(+nanz)| lr
5.14e-06
|
2751.01
ms |
47.3
% bf16 MFU |
381431
tok/s
step
7
/
32000
| loss
9.165650
(+nanz)| norm
10.5941
(+nanz)| lr
6.00e-06
|
2753.03
ms |
47.3
% bf16 MFU |
381327
tok/s
step
8
/
32000
| loss
9.132234
(+nanz)| norm
16.2733
(+nanz)| lr
6.86e-06
|
2748.91
ms |
47.3
% bf16 MFU |
381348
tok/s
step
9
/
32000
| loss
9.097384
(+nanz)| norm
12.1342
(+nanz)| lr
7.71e-06
|
2748.73
ms |
47.3
% bf16 MFU |
381367
tok/s
step
10
/
32000
| loss
9.072879
(+nanz)| norm
10.5923
(+nanz)| lr
8.57e-06
|
2749.40
ms |
47.3
% bf16 MFU |
381369
tok/s
...
-
它只需要基本的 CUDA 依賴項即可執行。
-
它是 C/CUDA 中直接、最小且易讀的實現。llm.c 共有約 5,000 行 C/CUDA 程式碼。這裡嘗試主要使用 C,而不是 C++,以保持簡單。神經網路訓練只是對單個浮點陣列進行相同的簡單算術運算(如 +、-、、/)的一個 while 迴圈,它實際上不應該那麼複雜。
-
它編譯和執行非常快(幾秒鐘),因此可以進行更多步進和更短等待。
-
它在開始時一次性分配其所有 GPU 記憶體,從那時起在訓練期間具有完全恆定的記憶體佔用。因此,一旦開始步進,就可以在剩餘的執行中表現良好並且不會記憶體用完(OOM)。
-
它是按位(bitwise)確定的。
-
它非常高效,略低於~50% 的 MFU。
torchrun --standalone --nproc_per_node=
8
train_gpt2.py \
--input_bin
"dev/data/edu_fineweb100B/edu_fineweb_train_*.bin"
\
--input_val_bin
"dev/data/edu_fineweb100B/edu_fineweb_val_*.bin"
\
--write_tensors
0
\
--model d48 \
--batch_size
8
--sequence_length
1024
--total_batch_size
1048576
\
--dtype bfloat16 \
--compile
1
\
--tensorcores
1
\
--flash
1
\
--num_iterations
32000
\
--warmup_iters
700
\
--weight_decay
0.1
\
--overfit_single_batch
0
\
--learning_rate
0.0006
\
--zero_stage
1
step
16
/
32000
| train loss
8.903997
| norm
8.3474
| lr
1.37e-05
| (
3381.88
ms |
310057
tok/s)
step
17
/
32000
| train loss
8.870140
| norm
3.7936
| lr
1.46e-05
| (
3381.95
ms |
310051
tok/s)
step
18
/
32000
| train loss
8.875732
| norm
9.4993
| lr
1.54e-05
| (
3393.09
ms |
309033
tok/s)
step
19
/
32000
| train loss
8.817432
| norm
2.8345
| lr
1.63e-05
| (
3379.75
ms |
310253
tok/s)
step
20
/
32000
| train loss
8.798056
| norm
4.1234
| lr
1.71e-05
| (
3386.53
ms |
309631
tok/s)
step
21
/
32000
| train loss
8.777574
| norm
2.8010
| lr
1.80e-05
| (
3386.05
ms |
309675
tok/s)
...
-
main.log 檔案(http://llmc.s3-us-west-2.amazonaws.com/gpt2_1558M/main.log)
-
model_00032000.bin llm.c bin 模型檔案(http://llmc.s3-us-west-2.amazonaws.com/gpt2_1558M/model_00032000.bin)
-
轉換為 huggingface transformers GPT-2 模型(https://huggingface.co/karpathy/gpt2_1558M_final2_hf)
python dev/eval/export_hf.py --input log_gpt2_128M/model_00032000.bin --output gpt2_1558M_export
# take model for spin
import
torch
output =
"./gpt2_1558M_final2_hf"
# set pytorch seeds
torch.manual_seed(
42
)torch.cuda.manual_seed(
42
)
prompt =
"In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
from
transformers
import
AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(output)model = AutoModelForCausalLM.from_pretrained(output, attn_implementation=
"flash_attention_2"
, torch_dtype=torch.bfloat16, device_map=
'cuda'
)model.eval()tokens = tokenizer.encode(prompt, return_tensors=
"pt"
)tokens = tokens.to(
'cuda'
)
output = model.generate(tokens, max_new_tokens=
500
, pad_token_id=tokenizer.eos_token_id, do_sample=
True
, top_k=
50
, num_return_sequences=
4
)samples = tokenizer.batch_decode(output)
for
sample
in
samples:
print(
'-'
*
30
)
print(sample)

-
如何避免大語言模型不再胡言亂語?
-
如何確保AIGC創作內容的質量與安全?
-
如何避免營銷活動成為黑產的提款機?
