效能準確率突破96％！上海演算法創新研究院釋出xVerify：面向推理模型的答案評估器

為什麼要做 xVerify？

當前推理模型（ Reasoning Model）在多個領域取得了顯著進展，但在長推理鏈、困難數學表示式、多語言等複雜場景下，答案抽取與驗證仍面臨以下挑戰：

慢思考場景：長推理鏈包含階段性結果和自我反思等過程，識別和判斷其正確性較為複雜。
表達多樣性：複雜數學表達（LaTeX / 分數 / 自然語言）、多語言描述的答案等價性判斷比較困難。
評估侷限性：基於規則的工具（如 Math-Verify）缺乏靈活性，而基於 LLM 的評估模型則缺乏針對性的訓練。

為了解決這些問題，上海演算法創新研究院的研究團隊推出了xVerify ——首個針對長推理鏈回答、覆蓋多種題型、支援中英文雙語的高效答案驗證工具。相比 Hugging Face 官方元件 Math-Verify，xVerify 支援的題型更廣、驗證方式更全面、準確率更高，多數場景下的評估準確率可達 96% 以上。

專案倉庫：

https://github.com/IAAR-Shanghai/xVerify

Hugging Face論文連結：

https://huggingface.co/papers/2504.10481

xVerify 模型開源倉庫：

https://huggingface.co/collections/IAAR-Shanghai/xverify-67e0f6f94c2dc334727da802

arXiv論文連結：

https://arxiv.org/abs/2504.10481

xVerify 具有以下特點和優勢：

面向推理模型，有效處理長推理鏈。xVerify 的訓練集中取樣了多個推理模型在高難度資料集上生成的長推理鏈回答，針對性地強化了 xVerify 處理長推理鏈回答的能力。因而，xVerify 可以有效處理推理模型回答中的階段性結果和自我反思部分對於其最終答案正確性判斷的干擾。

廣泛適用，支援多種題型。xVerify 適用於多種評測場景，支援數學題、選擇題、分類任務和簡答題。其具備強大的等價判斷能力，能精準識別不同題型的等價答案，主要支援中文和英文，併兼容其他語言。

智慧答案識別，精準匹配等價表達。xVerify 不僅可自動處理字母大小寫轉換（如 a -> A）、希臘字母（如 alpha -> α）等基礎變化，還能識別數學表示式的多種等價形式，如 LaTeX（45\frac{4}{5}54 -> 4/5）、科學計數法（1.34 × 10³ -> 13400）及自然語言（one hundred and twenty-three -> 123）。

並且針對復高難度數學題進行訓練，可以有效應對複雜數學表示式的等價性判斷。即使 LaTeX 表示式不完整或格式不同，xVerify 也能正確解析。此外，在簡答題場景下，xVerify 可判斷大模型生成的答案是否與正確答案在內容上對齊。

多種模型變體，可供靈活選擇。xVerify 提供了多個不同種類和引數規模型的模型，基座模型包括 Qwen 2.5、Gemma 2、LLaMA 3.1/3.2、GLM 4、Phi-4 等，引數規模涵蓋 0.5B 到 32B，從而有效減少基礎模型的偏差。使用者可以根據計算資源和具體評測需求選擇最合適的模型，以確保評測的效率和準確性。

xVerify 的構建流程

📌 第一階段：VAR 資料集構建

該研究系統性地整理了 20 餘個權威 benchmark 資料集，包括 AIME 2024、LiveMathBench、GPQA、MATH、GSM8K 等，涵蓋多種高難度數學推理任務和不同答案表達形式。

基於這些資料，他們在近 20 種基座模型（包含多個推理模型）上，設計多樣化的提示策略，生成包含複雜推理過程與多樣化答案形式的問答資料。

為了增強模型的跨場景泛化能力，他們特別在資料劃分時確保訓練集與泛化集覆蓋不同的 benchmark 來源及 LLM，並設計了多種不同的資料增強策略來多樣化已有的大模型問答樣本（見下圖），從而評估 xVerify 是否能適應多樣化的真實評估場景，而不僅僅依賴特定的訓練資料模式。

在資料標註過程中，該團隊採用 GPT-4o 和人工標註團隊對對訓練集和測試集進行多輪標註和複核，以確保標註的準確性和一致性。具體而言，他們首先使用 GPT-4o 基於不同的提示詞進行兩輪自動標註，針對標註結果中存在分歧或涉及複雜數學表達的樣本進行人工複核。

針對測試集和泛化集，他們採取更嚴格的質量控制措施，所有資料均由人工再次標註，以確保其作為高質量評估基準，能夠準確衡量模型的有效性和泛化能力。

最終，基於精心設計的資料收集、回答生成、資料劃分以及標註策略，該團隊最終構建了 Verify Answer for Reasoning (VAR) 資料集，一個多樣化、高質量的長推理鏈資料集。

VAR 資料集包含訓練集（43204）、測試集（6122）和泛化集（6468），訓練集和測試集用於訓練和評估 xVerify 模型，而泛化集作為測試集的補充，用於檢驗 xVerify 在更多樣化的評估場景中的泛化能力。

📌 第二階段：模型訓練

該研究基於 LLama-Factory 框架，在 VAR 資料集的訓練集上針對多個模型進行微調，所使用的基座模型涵蓋了不同的架構（如 Qwen2.5、Gemma2、Phi-4 系列）和引數規模（0.5B 至 32B）。

最終所有 xVerify 模型均展現出超過較高的高準確率，尤其在處理長推理鏈、數學表示式等價性判斷等核心任務中表現顯著優於傳統方法。

實驗對比

該研究透過四類題型（多選題、數學題、簡答題、分類題）與主流評估方法（評估框架和 Judge Model）進行對比，結果顯示，xVerify 在不同題型上的評估準確率均有顯著優勢，且具有良好的泛化效能。

即使是最小的 xVerify 模型（ xVerify-0.5B-I），其表現仍然能全面超越除 GPT-4o 外的所有方法，而 xVerify-3B-Ib 則全面超越其他方法。

測試集評估結果

泛化集評估結果

此外，論文附錄中還提供了全部 14 個 xVerify 模型的完整評估結果，並針對對比了所有 judge model 和 xVerify 模型的評估效率，實驗結果顯示 xVerify 模型的推理效率顯著優於其他模型。

xVerify 評估示例

● 數學題

{"index": 1,"question": "$已知函式f(x)=|x-a|+2|x+1|(a>0)。$\n$若a=3，求不等式 f(x)>5 的解集；$","correct_answer": "$(-\\infty ,-\\frac{4}{3})\\cup (0,+\\infty )$","llm_output": "\n首先，我們考慮函式 \\( f(x) = |x - 3| + 2|x + 1| \\) 在 \\( a = 3 \\) 時的形式。\n\n接下來，我們需要確定函式 \\( f(x) \\) 的分段形式。由於 \\( |x - 3| \\) 和 \\( |x + 1| \\) 都是絕對值函式，我們需要找到 \\( x \\) 的值，使得這些絕對值函式內部的表示式為零，即 \\( x = 3 \\) 和 \\( x = -1 \\)。\n\n因此，函式 \\( f(x) \\) 可以分為三個部分：\n\n1. 當 \\( x < -1 \\) 時，\\( f(x) = -(x - 3) + 2(-(x + 1)) = -x + 3 - 2x - 2 = -3x + 1 \\)。\n2. 當 \\( -1 \\leq x < 3 \\) 時，\\( f(x) = -(x - 3) + 2(x + 1) = -x + 3 + 2x + 2 = x + 5 \\)。\n3. 當 \\( x \\geq 3 \\) 時，\\( f(x) = (x - 3) + 2(x + 1) = x - 3 + 2x + 2 = 3x - 1 \\)。\n\n現在，我們要求解不等式 \\( f(x) > 5 \\)。\n\n- 對於 \\( x < -1 \\)，不等式變為 \\( -3x + 1 > 5 \\)，解得 \\( x < -\\frac{4}{3} \\)。\n- 對於 \\( -1 \\leq x < 3 \\)，不等式變為 \\( x + 5 > 5 \\)，解得 \\( x > 0 \\)。\n- 對於 \\( x \\geq 3 \\)，不等式變為 \\( 3x - 1 > 5 \\)，解得 \\( x > 2 \\)。\n\n將這些解集合並，我們得到 \\( x \\) 的解集為 \\( x < -\\frac{4}{3} \\) 或 \\( x > 0 \\)。\n\nThe answer is \\( x < -\\frac{4}{3} \\) 或 \\( x > 0 \\).","xVerify_judgment": "Correct"},{"index": 2,"question": "In each town in ARMLandia, the residents have formed groups, which meet each week to share math problems and enjoy each others' company over a potluck-style dinner. Each town resident belongs to exactly one group. Every week, each resident is required to make one dish and to bring it to his/her group.\n\nIt so happens that each resident knows how to make precisely two dishes. Moreover, no two residents of a town know how to make the same pair of dishes. Shown below are two example towns. In the left column are the names of the town's residents. Adjacent to each name is the list of dishes that the corresponding resident knows how to make.\n\n| ARMLton |  |\n| :--- | :--- |\n| Resident | Dishes |\n| Paul | pie, turkey |\n| Arnold | pie, salad |\n| Kelly | salad, broth |\n\n\n| ARMLville |  |\n| :--- | :--- |\n| Resident | Dishes |\n| Sally | steak, calzones |\n| Ross | calzones, pancakes |\n| David | steak, pancakes |\n\nThe population of a town $T$, denoted $\\operatorname{pop}(T)$, is the number of residents of $T$. Formally, the town itself is simply the set of its residents, denoted by $\\left\\{r_{1}, \\ldots, r_{\\mathrm{pop}(T)}\\right\\}$ unless otherwise specified. The set of dishes that the residents of $T$ collectively know how to make is denoted $\\operatorname{dish}(T)$. For example, in the town of ARMLton described above, pop(ARMLton) $=3$, and dish(ARMLton) $=$ \\{pie, turkey, salad, broth\\}.\n\nA town $T$ is called full if for every pair of dishes in $\\operatorname{dish}(T)$, there is exactly one resident in $T$ who knows how to make those two dishes. In the examples above, ARMLville is a full town, but ARMLton is not, because (for example) nobody in ARMLton knows how to make both turkey and salad.\n\nDenote by $\\mathcal{F}_{d}$ a full town in which collectively the residents know how to make $d$ dishes. That is, $\\left|\\operatorname{dish}\\left(\\mathcal{F}_{d}\\right)\\right|=d$.\nLet $n=\\operatorname{pop}\\left(\\mathcal{F}_{d}\\right)$. In terms of $n$, compute $d$.","correct_answer": "(1 + (1 + 8n)^(1/2)) / 2","llm_output": "To find the number of dishes $d$ in terms of the population $n$ of a full town $\mathcal{F}_{d}$, we need to understand how the residents of the town can make pairs of dishes.\n\nSince each resident knows how to make precisely two dishes, and no two residents of a town know how to make the same pair of dishes, we can think of each resident as a unique pair of dishes.\n\nIn a full town, for every pair of dishes in $\operatorname{dish}(T)$, there is exactly one resident in $T$ who knows how to make those two dishes. This means that each pair of dishes corresponds to exactly one resident.\n\nTherefore, the number of pairs of dishes is equal to the number of residents. The number of pairs of dishes can be calculated using the combination formula: $\binom{d}{2} = \frac{d(d-1)}{2}$.\n\nSince each resident corresponds to a unique pair of dishes, the number of residents is equal to the number of pairs of dishes: $n = \frac{d(d-1)}{2}$.\n\nWe can rearrange this equation to solve for $d$: $d^2 - d - 2n = 0$.\n\nUsing the quadratic formula, we can solve for $d$: $d = \frac{1 \pm \sqrt{1 + 8n}}{2}$.\n\nSince $d$ must be a positive integer, we take the positive root: $d = \frac{1 + \sqrt{1 + 8n}}{2}$.","xVerify_judgment": "Correct"},{"index": 3,"question": "Consider an $m$-by-$n$ grid of unit squares, indexed by $(i,j)$ with $1 \\leq i \\leq m$ and $1 \\leq j \\leq n$. There are $(m-1)(n-1)$ coins, which are initially placed in the squares $(i,j)$ with $1 \\leq i \\leq m-1$ and $1 \\leq j \\leq n-1$. If a coin occupies the square $(i,j)$ with $i \\leq m-1$ and $j \\leq n-1$ and the squares $(i+1,j), (i,j+1)$, and $(i+1,j+1)$ are unoccupied, then a legal move is to slide the coin from $(i,j)$ to $(i+1,j+1)$. How many distinct configurations of coins can be reached starting from the initial configuration by a (possibly empty) sequence of legal moves?","correct_answer": "$\\binom{m+n-2}{m-1}$","llm_output": "Here's how to solve this problem:\n\n**Understanding the Problem**\n\nWe have a grid where coins can move diagonally to empty spaces. The key is to realize that the coins effectively \"fill\" the grid in a specific way.  \n\n**Key Observation**\n\n*  A coin can only move to an empty square diagonally adjacent to it.\n*  Once a coin is placed in a square, it cannot move back to its original position.\n\n**Solution Approach**\n\n1. **Final State:**  Think about the final possible configurations.  The coins will end up in a specific set of squares.  \n\n2. **Constraints:**  The constraints on the moves dictate that the coins will always form a pattern where they occupy the squares along a diagonal.\n\n3. **Counting Configurations:**  The number of distinct configurations is determined by the number of ways to choose which squares along the diagonal will be occupied by coins.\n\n**Formalizing the Counting**\n\nLet's represent the grid with a sequence of 0s and 1s, where 1 represents a coin and 0 represents an empty square.  Since the coins always move diagonally, the sequence will have a specific pattern.  \n\nFor example, if  *m* = 4 and *n* = 4, the initial configuration would be:\n\n```\n1111\n1000\n1000\n1000\n```\n\nAnd a possible final configuration could be:\n\n```\n1010\n0101\n0000\n0000\n```\n\nNotice that the final configuration has a diagonal pattern of 1s.  The number of distinct configurations is determined by the number of ways to choose the positions of the 1s in this diagonal pattern.\n\n\n**The answer is <(m-1)(n-1)>.** \n\n\n\nLet me know if you'd like a more detailed explanation of why this is the case!","xVerify_judgment": "Incorrect"},{"index": 4,"question": "A cattle breeder owns 52 dairy cows. Each cow gives 1000 oz of milk per day. Calculate the amount of milk produced per week by the cows.","correct_answer": "$3.64 \\times 10^5","llm_output": "Let's calculate the total amount of milk produced per week by the cows.\n\n**Step 1: Determine the number of days in a week.**\n- A week has **7 days**.\n\n**Step 2: Calculate the total milk production per day.**\n- Number of cows = 52\n- Milk produced per cow per day = 1000 oz\n- **Total milk per day** = \\( 52 \\text{ cows} \\times 1000 \\text{ oz/day} = 52,000 \\text{ oz/day} \\)\n\n**Step 3: Calculate the total milk production per week.**\n- **Total milk per week** = \\( 52,000 \\text{ oz/day} \\times 7 \\text{ days} = 364,000 \\text{ oz} \\)\n\n**Final Answer:**\nThe total amount of milk produced per week by the cows is Three hundred sixty-four thousand.","xVerify_judgment": "Correct"}

● 簡答題

{"index": 5,"question": "What is greater: the combined 2011 populations of Rennington (Northumberland), Lydbrook (Gloucestershire), Stow-on-the-Wold (Gloucestershire) and Witney (Oxfordshire), or the 2022 population of London?","correct_answer": "The 2022 population of London","llm_output": "To answer this question, we need to find the populations of the given locations and compare them to the population of London in 2022.\n\n1. Find the 2011 populations of Rennington (Northumberland), Lydbrook (Gloucestershire), Stow-on-the-Wold (Gloucestershire), and Witney (Oxfordshire).\n   - Rennington (Northumberland): approximately 190 residents\n   - Lydbrook (Gloucestershire): approximately 1,341 residents\n   - Stow-on-the-Wold (Gloucestershire): approximately 3,555 residents\n   - Witney (Oxfordshire): approximately 29,469 residents\n\n2. Calculate the combined population of these locations in 2011.\n   - Combined population: 190 + 1341 + 3555 + 29469 = 37,355\n\n3. Find the 2022 population of London.\n   - The 2022 population of London is approximately 9,000,000.\n\n4. Compare the combined population of the four locations to the population of London in 2022.\n   - The population of London (approximately 9,000,000) is greater than the combined population of the four locations (approximately 37,355).\n\nThe answer is The population of London is greater.","xVerify_judgment": "Correct"}

● 選擇題

{"index": 6,"question": "Identity refers to one's  Answer Choices:  (A) Use of a particular coping strategy  (B) Control over one's own life story  (C) Physical appearance and overall personality  (D) Coherent sense of self","correct_answer": "D","llm_output": "The question asks about \"Identity,\" which is a complex concept related to how we perceive ourselves. \n\nLet's analyze the options:\n\n* **Use of a particular coping strategy:** This relates to how we deal with stress and challenges, not our sense of self.\n* **Control over one's own life story:** This touches on agency and self-determination, which are aspects of identity, but not the core definition.\n* **Physical appearance and overall personality:**  These are external factors that can influence how we perceive ourselves, but they don't fully encompass the essence of identity.\n* **Coherent sense of self:** This is the most accurate answer. Identity is about having a clear understanding of who you are, your values, beliefs, and how you see yourself in the world.","xVerify_judgment": "Correct"}

● 分類題

{"index": 7,"question": "Where Is the Real Jayne Ann Krentz?: It was hard to believe that this novel was written by JAK. The prologue was the best part of the book. It started out with an explosion of character involvement and then the fire went out. She failed to give the reader a true insight into the main characters. I was hard pressed to understand why the two main characters were attracted to each other. I could only assume it was because they were trying to save a small company. There was no \"electricity\" between them. She had numerous secondary characters that never stood on their own. I read the entire book in one weekend, only because if I put the book down, I could not remember who was who. I hope the real JAK comes back to us soon.\nPlease identify the sentiment polarity of the sentence: positive or negative","correct_answer": "negative","llm_output": "To identify the sentiment polarity of the sentence, let's break it down:\n\n1. The reviewer states that it was \"hard to believe\" that the novel was written by JAK, implying that they were disappointed with the quality of the book.\n2. They mention that the prologue was the best part of the book, but this is a negative statement because it implies that the rest of the book was not as good.\n3. The reviewer criticizes the author for failing to give the reader a true insight into the main characters and for not making the characters' attraction to each other believable.\n4. They also mention that the secondary characters were not well-developed, which is another negative point.\n5. The reviewer states that they read the book in one weekend, but this is not a positive statement because it implies that they were trying to finish the book quickly rather than enjoying it.\n6. Finally, they express a hope that the \"real JAK\" will return, which implies that they are disappointed with the current quality of the author's work.\n\nOverall, the sentiment polarity of the sentence is negative.","xVerify_judgment": "Correct"}

未來展望

xVerify 已經開源了完整評估程式碼（GitHub）和部分 xVerify 模型（Hugging Face），使用者可直接呼叫進行評估任務。此外，該團隊後續將近一步開源更多 xVerify 模型，並探索在針對推理模型的新應用場景中驗證 xVerify 的有效性，同時歡迎關注 R1 模型復現、長推理評估的研究者與開發者使用和討論。

更多閱讀