Reuters Releases Free News Archive

Volume 18, Issue 4 — April 2001

Table of Contents

Previous Issues

Subscribe Now!

ITI Home

Reuters Releases Free News Archive

Reuters has announced that it is, for the first time, making available free of charge large quantities of archived Reuters news stories for use by research communities around the world. The first Reuters Corpus archive includes over 800,000 English-language news stories, equivalent to Reuters’ annual global news output.

According to the announcement, the Reuters Corpus offers researchers a unique body of static information upon which to research, test, and benchmark emerging technologies such as language processing, speech synthesis, voice recognition, indexation, search, and information retrieval.

Richard Willis, head of research and standards at the Reuters Chief Technology Office, said: "Reuters has always been heavily involved in language and data research. And to strengthen our links with the research community around the world, we have made available one of the most complete news archives ever released. The data provided will aid research into many aspects of language processing and information retrieval."

The archive includes all English-language stories produced by Reuters globally between August 1996 and August 1997. The news data is available on two CD-ROMs and is formatted in XML. All the news stories are fully referenced using a total of 775 different category codes for topic, geography, and industry sector.

Marc Moens, head of Edinburgh University’s Language Technology Group, said: "Because of its size and the amount of preparation that has gone into it, the Reuters collection provides scope for many new types of research and development work. It allows for the systematic evaluation of progress and comparison of results between different development groups. I am sure this corpus will soon be seen as a standard in document-related work."

Yorick Wilks, a professor at Sheffield University, said: "We can already see the potential benefits of such a corpus for stylistic language analysis. The topic codes would also give us the opportunity to analyze the geographic location, industry area, or topic that received news coverage from Reuters. Areas such as semantic Web applications, categorization research, and machine learning of topic routings would also benefit. This will be a very useful resource."

As part of the research agreement covering use of the archive, researchers will supply Reuters with a copy of any material published using the data. Working with this feedback from research groups, Reuters hopes to introduce other corpora, including multilingual versions and volumes covering other date ranges. Further information on the Corpus is available at http://www.reuters.com/researchandstandards/corpus.

Source: Reuters, London, 011-44-20-7542-6487; http://www.reuters.com.

Table of Contents

Previous Issues

Subscribe Now!

ITI Home