語言模型和文字摘要前沿專題，齊聚8位頂會NAACL一作

NAACL是自然語言處理領域的頂級學術會議，為了進一步促進國際間學術交流，青源會將於8月4日上午09:00-12:20舉辦「青源Seminar丨NAACL專場線上分享會」，召集人為青源研究組成員、耶魯大學博士生唐相儒。

本次分享會將聚焦“語言模型”和“文字摘要”兩大前沿主題，邀請了相關主題的8位 NAACL 論文一作進行專場分享和圓桌討論。

點選

活動官網+直播預約

或閱讀原文預約線上直播，微信掃描下方二維碼加入講者微信群。

掃碼加入講者交流群

(點選檢視高畫質圖片)

Yusheng Su

蘇裕勝

清華大學計算機博士生

蘇裕勝，目前是清華大學計算機博士三年級學生，主要研究方向為自然語言處理（預訓練語言模型），在WWW, NAACL, ACL, IEEE/TASLP等會議上發表過多篇論文。同時擔任過COLING, EMNLP, ACL, NAACL, ICML等會議審稿人。

On Transferability of Prompt Tuning for Natural Language Processing

Prompt tuning (PT) 只需要調整少量引數即可實現與全引數微調相當的效能，是一種使用超大規模預訓練語言模型的引數高效方法。然而，與微調相比，PT 需要更多的訓練時間。因此，我們探索是否能透過prompt遷移來增強PT，我們在這項工作中實驗研究了prompt在不同下游任務和不同型別、規模的預訓練語言模型之間的遷移性。

我們發現：

（1）在零樣本設定下，訓練過的prompt可以有效地遷移到同一預訓練語言模型的類似任務上，也可以遷移到其他不同的預訓練語言模型上並完成類似任務。

（2）此外，這些訓練過的prompt也可以直接作為相似任務prompt的初始化，來提高 PT 的訓練速度。

（3）為了探索影響遷移性的因素，我們研究了各種遷移性指標，發現prompt所啟用神經元的重疊率與遷移性存在較強相關性。我們的研究結果表明，prompt遷移是一種有前景的增強PT的方式，我們鼓勵進一步的研究更多關注prompt如何啟用預訓練語言模型以完成各種任務。

Xuandong Zhao

趙宣棟

UCSB計算機博士生

趙宣棟，目前是UCSB計算機博士三年級，導師為李磊和王宇翔。曾在阿里巴巴，微軟等公司實習，研究興趣為機器學習和自然語言處理（模型保護和隱私保護）。

Provably Confidential Language Modelling

Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter all privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.

Weiyan SHi

史唯豔

哥倫比亞大學博士生

我主要的研究方向是對話系統，尤其是策略性和有影響力的對話系統（比如，說服對話系統）。其他的研究方向包括對話生成，和隱私保護的NLP模型。

Selective Differential Privacy for Language Modeling

With the increasing applications of language models, it has become crucial to protect these models from leaking private information. Previous work has attempted to tackle this challenge by training RNN-based language models with differential privacy guarantees. However, applying classical differential privacy to language models leads to poor model performance as the underlying privacy notion is over-pessimistic and provides undifferentiated protection for all tokens in the data. Given that the private information in natural language is sparse (for example, the bulk of an email might not carry personally identifiable information), we propose a new privacy notion, selective differential privacy, to provide rigorous privacy guarantees on the sensitive portion of the data to improve model utility. To realize such a new notion, we develop a corresponding privacy mechanism, Selective-DPSGD, for RNN-based language models. Besides language modeling, we also apply the method to a more concrete application–dialog systems. Experiments on both language modeling and dialog system building show that the proposed privacy-preserving mechanism achieves better utilities while remaining safe under various privacy attacks compared to the baselines.

Jingfeng Yang

楊靖鋒

亞馬遜研究科學家

現為亞馬遜研究科學家（暫時放棄華盛頓大學計算機系自然語言處理的博士offer）。碩士在佐治亞理工學院畢業，導師為楊笛一教授，本科在北大獲得生物與計算機雙學位。主要研究方向為語義解析、文字生成、多語自然語言處理等。在ACL、 EMNLP、 NAACL 等發表多篇一作文章，擔任ACL、 EMNLP、 NAACL、 NeurlPS、 AAAI 等會議審稿人，曾在谷歌、亞馬遜、微軟、愛丁堡大學等研究實習。

Compositional Generalization in Large Langauge Model Era

組合泛化仍是是大模型的最重要的難點之一，是實現推理、分佈外泛化，以及通往通用人工智慧這一最終目標的關鍵。我們兩篇NAACL的文章分別從兩種視角提出兩種方式來增強模型的組合泛化能力。從模型角度，我們可以透過序列Prompt填充、以及整合預訓練模型和精調模型，來保證分佈內泛化能力的同時，提升分佈外泛化能力，其中，我們發現預訓練模型的限制解碼、以及在限制詞表上機率重新歸一化是這一技術獲得成功的關鍵。從資料角度，我們提出了透過語義樹子樹替換的方法進行資料擴增，然後再將擴增資料作為Seq2seq生成模型的訓練資料。這兩種方法在一系列組合性語義解析的測試中取得了明顯提升。

Jiacheng Xu

徐嘉誠

Salesforce研究院

研究科學家

徐嘉誠是Salesforce研究院的研究科學家，專注於自然語言處理，尤其是自然語言生成和文字摘要方向的前沿研究。此前，他於2022年博士畢業於美國德州大學奧斯汀分校，導師為Greg Durrett。他於2017年從復旦大學本科畢業，師從邱錫鵬和黃萱菁教授。他此前曾在谷歌（2020）和微軟（2019）實習。

Massive-scale Decoding for Text Generation using Lattices

Conditional neural text generation models generate high-quality outputs, but often concentrate around a mode when what we really want is a diverse set of options. We present a search algorithm to construct lattices encoding a massive number of generation options. First, we restructure decoding as a best-first search, which explores the space differently than beam search and improves efficiency by avoiding pruning paths. Second, we revisit the idea of hypothesis recombination: we can identify pairs of similar generation candidates during search and merge them as an approximation. On both summarization and machine translation, we show that our algorithm encodes thousands of diverse options that remain grammatical and high-quality into one lattice. This algorithm provides a foundation for building downstream generation applications on top of massive-scale diverse outputs.

Xiangru Tang

唐相儒

耶魯大學博士生

唐相儒目前是耶魯大學計算機系博士一年級，導師為Mark Gerstein。此前，他於耶魯大學獲得計算機碩士學位，合作導師為Dragomir Radev。他的主要研究方向為預訓練語言模型、文字生成和計算生物學。

CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.

Yue Fang

房越

北京郵電大學研究生

北京郵電大學人工智慧學院研二在讀學生，研究方向為對話摘要。

From spoken dialogue to formal summary: An utterance rewriting for dialogue summarization

Due to the dialogue characteristics of unstructured contexts and multi-parties with first-person perspective, many successful text summarization works have failed when dealing with dialogue summarization. In dialogue summarization task, the input dialogue is usually spoken style with ellipsis and co-references but the output summaries are more formal and complete. Therefore, the dialogue summarization model should be able to complete the ellipsis content and co-reference information and then produce a suitable summary accordingly. How- ever, the current state-of-the-art models pay more attention on the topic or structure of summary, rather than the consistency of dialogue summary with its input dialogue context, which may suffer from the personal and logical inconsistency problem. In this paper, we propose a new model, named ReWriteSum, to tackle this problem. Firstly, an utterance rewriter is conducted to complete the ellipsis content of dialogue content and then obtain the rewriting utterances. Then, the co-reference data aug- mentation mechanism is utilized to replace the referential person's name with its specific name to enhance the personal information.

Xiangci Li

李向磁

UT Dallas計算機博士生

李向磁是UT Dallas第二年博士生，師從Prof. Jessica Ouyang，主要研究方向為科研文獻處理（資訊抽取和相關工作摘要生成）。於南加州大學獲得碩士學位，師從彭楠贇。曾在Chan Zuckerburg Initiative，百度和騰訊北美人工智慧實驗室實習。

CORWA: A Citation-Oriented Related Work Annotation Dataset

Academic research is an exploratory activity to discover new solutions to problems. By this nature, academic research works perform literature reviews to distinguish their novelties from prior work. In natural language processing, this literature review is usually conducted under the “Related Work” section. The task of related work generation aims to automatically generate the related work section given the rest of the research paper and a list of papers to cite. Prior work on this task has focused on the sentence as the basic unit of generation, neglecting the fact that related work sections consist of variable length text fragments derived from different information sources. As a first step toward a linguistically-motivated related work generation framework, we present a Citation Oriented Related Work Annotation (CORWA) dataset that labels different types of citation text fragments from different information sources. We train a strong baseline model that automatically tags the CORWA labels on massive unlabeled related work section texts. We further suggest a novel framework for human-in-the-loop, iterative, abstractive related work generation.