In Spring 2018, the Cx Oath lab released the largest dataset of article summaries to date. The Newsroom dataset consists of 1.3 million article summaries, and was designed for training and evaluation of automatic summarization systems. The dataset contains summaries written in the newsrooms of 38 major publications between 1998 and 2017 that show a wide variety of summarization styles. The dataset is available along with tools to explore the data, compare summarization styles across publications and time, and evaluate the performance of existing state-of-the-art automatic summarization systems.
The Newsroom dataset and its accompanying paper were presented at the 2018 conference of the North American Association for Computational Linguistics by authors Max Grusky, Mor Naaman, and Yoav Artzi.