MLNLP社群是國內外知名的機器學習與自然語言處理社群，受眾覆蓋國內外NLP碩博生、高校老師以及企業研究人員。

社群的願景是促進國內外自然語言處理，機器學習學術界、產業界和廣大愛好者之間的交流和進步，特別是初學者同學們的進步。

來源 | NLP工作站

作者 | 劉聰NLP

今天發現Kimi開源了兩個MoE視覺理解大模型-Kimi-VL-A3B-Instruct和Kimi-VL-A3B-Thinking，總引數16.4B，啟用引數僅為2.8B，上下文長度128K。

Github: https://github.com/MoonshotAI/Kimi-VL

Paper: https://github.com/MoonshotAI/Kimi-VL/blob/main/Kimi-VL.pdf

在榜單上，大多數超過Qwen2.5-7B模型，如下表所示。

模型架構是由 MoE語言模型、原生解析度視覺編碼器（MoonViT）和 MLP 對映層組成，如下圖所示。

Pre-Train階段涉及4個階段，總計4.4T Tokens。

獨立ViT訓練：訓練MoonViT，使其成為一個健壯的原生解析度視覺編碼器。
聯合預訓練：同時使用純文字資料和多種多模態資料訓練整體模型。
聯合冷卻階段：使用高質量的語言和多模態資料集進行模型訓練，並且加入合成數據，提升模型在數學推理、知識類任務和程式碼生成方面的表現。
聯合長文字啟用階段：將模型的上下文長度從8192擴充套件到131072，以處理長文字和長影片。

Posting-Train階段涉及3個階段：

SFT階段：利用多模態指令資料進行微調，先在32k序列長度下訓練模型1個epoch，學習率從2e−5衰減到2e−6，然後在128k序列長度下再訓練1個epoch。在第一階段（32K），升溫到1e−5最終衰減到1e−6。
CoT階段：透過精心設計的提示工程構建了一個小而高質量的長CoT資料集，為了讓模型學習基本的規劃、評估、反思和探索的過程。
RL階段：採用強化學習（RL）對模型進行訓練，使其能夠自主生成結構化的CoT推理路徑。

最後快速使用

from PIL import Imagefrom transformers import AutoModelForCausalLM, AutoProcessormodel_path = "moonshotai/Kimi-VL-A3B-Instruct"model = AutoModelForCausalLM.from_pretrained(    model_path,    torch_dtype="auto",    device_map="auto",    trust_remote_code=True,)processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)image_path = "demo.png"image = Image.open(image_path)messages = [    {"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]}]text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)generated_ids = model.generate(**inputs, max_new_tokens=512)generated_ids_trimmed = [    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]response = processor.batch_decode(    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]print(response)