
最近谷歌公司發表了一篇轟動人工智慧系統界的論文[1],介紹了他們如何基於MLPerf標準去重新整理深度學習訓練速度的世界紀錄。比如,23秒完成BERT訓練!28秒完成ImageNet訓練!文章洋洋灑灑,從系統,演算法,編譯器,應用角度全方位地闡述了谷歌如何打造基於他們自己晶片的AI超級計算機以及深度學習系統。文章一共19位作者,包含了谷歌深度學習系統團隊的一些專家。本文簡要討論一下谷歌公司的這篇文章,分為上下兩部分。如有問題,可以透過郵箱[email protected]聯絡筆者(新加坡國立大學高效能人工智慧實驗室主任、壁仞科技顧問尤洋)。筆者曾在UC Berkeley讀博期間在谷歌公司總部的谷歌大腦團隊實習4次。




參考文獻
[1] Sameer Kumar James Bradbury Cliff Young Yu Emma Wang Anselm Levskaya Blake Hechtman Dehao Chen HyoukJoong Lee Mehmet Deveci Naveen Kumar Pankaj Kanwar Shibo Wang Skye Wanderman-Milne Steve Lacy Tao Wang Tayo OguntebiYazhou Zu Yuanzhong Xu Andy Swing EXPLORING THE LIMITS OF CONCURRENCY IN ML TRAINING ON GOOGLE TPUS
[2] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12, 2017.
[3] Jouppi, N. P., Yoon, D. H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., and Patterson, D. A domainspecific supercomputer for training deep neural networks. Communications of the ACM, 63(7):67–78, 2020.
[4] Kumar, S., Bitorff, V., Chen, D., Chou, C., Hechtman, B., Lee, H., Kumar, N., Mattson, P., Wang, S., Wang, T., et al. Scale MLPerf-0.6 models on Google TPU-v3 pods. arXiv preprint arXiv:1909.09756, 2019.
[5] Langston, J. Microsoft announces new supercomputer, lays out vision for future AI work. MicroSoft Blog, 2020.
[6] Mattson, P., Cheng, C., Coleman, C., Diamos, G., Micikevicius, P., Patterson, D., Tang, H., Wei, G.-Y., Bailis, P., Bittorf, V., et al. MLPerf training benchmark. arXiv preprint arXiv:1910.01500, 2019.
[7] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. TensorFlow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pp. 265–283, 2016.
[8] Frostig, R., Johnson, M. J., and Leary, C. Compiling machine learning programs via high-level tracing. Systems for Machine Learning, 2018.


壁仞科技研究院作為壁仞科技的前沿研究部門,旨在研究新型智慧計算系統的關鍵技術,重點關注新型架構,先進編譯技術和設計方法學,並將逐漸拓展研究方向,探索未來智慧系統的各種可能。壁仞科技研究院秉持開放的原則,將積極投入各類產學研合作並參與開源社群的建設,為相關領域的技術進步做出自己的貢獻。

