
MLNLP
(
機器學習演算法與自然語言處理
)社群是國內外知名自然語言處理社群,受眾覆蓋國內外NLP碩博生、高校老師以及企業研究人員。
社群的願景 是促進國內外自然語言處理,機器學習學術界、產業界和廣大愛好者之間的交流,特別是初學者同學們的進步。
機器學習演算法與自然語言處理

How old are you?
到德文時,非正式場景可以說Wie alt bist du?
,正式場景可以說Wie alt sind Sie?
。這需要依賴上下文,與上下文保持一致,我們如何告訴模型這樣做呢。傳統集束搜尋(Traditional Beam Search)
pip install -q git+https://github.com/huggingface/transformers.git
from
transformers
import
AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained(
"t5-base"
)
model = AutoModelForSeq2SeqLM.from_pretrained(
"t5-base"
)
encoder_input_str =
"translate English to German: How old are you?"
input_ids = tokenizer(encoder_input_str, return_tensors=
"pt"
).input_ids
outputs = model.generate(
input_ids,
num_beams=
10
,
num_return_sequences=
1
,
no_repeat_ngram_size=
1
,
remove_invalid_values=
True
,
)
print(
"Output:\n"
+
100
*
'-'
)
print(tokenizer.decode(outputs[
0
], skip_special_tokens=
True
))
Output:
----------------------------------------------------------------------------------------------------
Wie alt bist du?
限制集束搜尋(Constrained Beam Search)
force_words_ids
來實現控制模型生成結果
tokenizer = AutoTokenizer.from_pretrained(
"t5-base"
)
model = AutoModelForSeq2SeqLM.from_pretrained(
"t5-base"
)
encoder_input_str =
"translate English to German: How old are you?"
force_words = [
"Sie"
]
input_ids = tokenizer(encoder_input_str, return_tensors=
"pt"
).input_ids
force_words_ids = tokenizer(force_words, add_special_tokens=
False
).input_ids
outputs = model.generate(
input_ids,
force_words_ids=force_words_ids,
num_beams=
5
,
num_return_sequences=
1
,
no_repeat_ngram_size=
1
,
remove_invalid_values=
True
,
)
print(
"Output:\n"
+
100
*
'-'
)
print(tokenizer.decode(outputs[
0
], skip_special_tokens=
True
))
Output:
----------------------------------------------------------------------------------------------------
Wie alt sind Sie?
["raining", "rained", "rains",...]
中的任何一個都可以呢。更進一步,我們經常不想要精確到一個字母都不差的詞來作為強制輸出子內容。from
transformers
import
GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained(
"gpt2"
)
tokenizer = GPT2Tokenizer.from_pretrained(
"gpt2"
)
force_word =
"scared"
force_flexible = [
"scream"
,
"screams"
,
"screaming"
,
"screamed"
]
force_words_ids = [
tokenizer([force_word], add_prefix_space=
True
, add_special_tokens=
False
).input_ids,
tokenizer(force_flexible, add_prefix_space=
True
, add_special_tokens=
False
).input_ids,
]
starting_text = [
"The soldiers"
,
"The child"
]
input_ids = tokenizer(starting_text, return_tensors=
"pt"
).input_ids
outputs = model.generate(
input_ids,
force_words_ids=force_words_ids,
num_beams=
10
,
num_return_sequences=
1
,
no_repeat_ngram_size=
1
,
remove_invalid_values=
True
,
)
print(
"Output:\n"
+
100
*
'-'
)
print(tokenizer.decode(outputs[
0
], skip_special_tokens=
True
))
print(tokenizer.decode(outputs[
1
], skip_special_tokens=
True
))
Setting `pad_token_id` to `eos_token_id`:50256
for
open-end generation.
Output:
----------------------------------------------------------------------------------------------------
The soldiers, who were all scared and screaming at each other as they tried to get out of the
The child was taken to a
local
hospital
where
she screamed and scared
for
her life, police said.
screaming
,生成的第二句使用了screamed
,同時也都使用了scared
。
num_beams=3
的集束搜尋的第一步的另一種展示方式如下:
The dog
,而集束搜尋會允許進一步考慮The nice
和The car
。
num_beams
。我們不能配置num_beams
太大,因為對於n步的生成,我們要計算個分支。隨著num_beams
變大,分支數會快速變大,例如num_beams=10
計算10步,就意味著10,000,000,000
個分支。<eos>
,或者生成token數到達上線。在計算的每一步都會經歷,列出所有生成分支、排序、減少分支至num_beams、重複計算。"is fast"
。top k
個機率最高的下一個token,然後把它們都加入至考慮範圍。在限制集束搜尋中,我們仍然會這樣做,不過我們也會加入我們的強制生成token。
dog
和nice
,同時我們也把強制生成tokenis
也放入考慮分支中,從而儘可能生成我們想要的短語is fast
。
Banks
is fast
時,大多時候,我們得到的是不符合邏輯的輸出,例如The is fast
。這實際上是一個較複雜問題。在huggingface/transformers的request issue中有深入討論這個問題的複雜性。num_beams=3
,我們只保留三個分支,所以留下了["The is fast", "The dog is", "The dog and"]
,分別對應機率最高的Bank 2、Bank 1、 Bank 0。"The is fast"
完全滿足我們的強制限制,但它是不符合常識的短語。幸運的是,我們還有"The dog is"
、"The dog and"
分支可以在後面的步驟中繼續計算,它們很有希望會輸出更符合常識的結果,進而在BANK 2的排序中替換掉"The is fast"
。
"The is fast"
分支的下一個token預測,不再需要加入強制限制token了,因為強制限制token已經完全滿足了。同時注意分支如"The dog is slow"
或"The dog is mad"
,它們雖然包含了限制詞"is"
,但是在"is"
後面加入了"slow"
。因此只能重新開始生成"is fast"
,所以它們從Bank 1回到了Bank 0。"The dog is fast"
,即滿足了強制限制的短語,又滿足較高的輸出機率,即符合常識。"The is fast"
已經在輪序排程選擇(round-robin selection)中被排除掉了,因為它只在Bank 2中排到最後一名,如上圖所示。model.generate()
函式中我們有了force_words_ids
來控制強制生成,但我們可以做一個更好的實施設計。我們把每個限制設計成一個限制物件,它們在集束搜尋過程中,分別記錄下一個限制生成的token,如下所示:from
transformers
import
AutoTokenizer, AutoModelForSeq2SeqLM, PhrasalConstraint
tokenizer = AutoTokenizer.from_pretrained(
"t5-base"
)
model = AutoModelForSeq2SeqLM.from_pretrained(
"t5-base"
)
encoder_input_str =
"translate English to German: How old are you?"
constraints = [
PhrasalConstraint(
tokenizer(
"Sie"
, add_special_tokens=
False
).input_ids
)
]
input_ids = tokenizer(encoder_input_str, return_tensors=
"pt"
).input_ids
outputs = model.generate(
input_ids,
constraints=constraints,
num_beams=
10
,
num_return_sequences=
1
,
no_repeat_ngram_size=
1
,
remove_invalid_values=
True
,
)
print(
"Output:\n"
+
100
*
'-'
)
print(tokenizer.decode(outputs[
0
], skip_special_tokens=
True
))
Output:
----------------------------------------------------------------------------------------------------
Wie alt sind Sie?
OrderedConstraints
, TemplateConstraints
,也許將來可以加進來。當前的限制類只是為了滿足生成結果包含子內容,它在生成結果的位置沒有關係。例如,一個剛才的例子是scared
後面接screaming
,另一個是screamed
後面接scared
。OrderedConstraints
可以允許使用者指定這些順序限制。TemplateConstraints
可以允許使用者輸入更多特徵,例如:
starting_text =
"The woman"
template = [
"the"
,
""
,
"School of"
,
""
,
"in"
]
possible_outputs == [
"The woman attended the Ross School of Business in Michigan."
,
"The woman was the administrator for the Harvard School of Business in MA."
]
starting_text =
"The woman"
template = [
"the"
,
""
,
""
,
"University"
,
""
,
"in"
]
possible_outputs == [
"The woman attended the Carnegie Mellon University in Pittsburgh."
,
]
impossible_outputs == [
"The woman attended the Harvard University in MA."
]
-
限制生成結果必須包含短語 -
一些短語是有可選列表,一些是不可選的 -
短語生成在指定的位置的
-
Guided Open Vocabulary Image Captioning with Constrained Beam Search -
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation -
Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting -
Guided Generation of Cause and Effect

關於我們
