YAWN: A semantically annotated Wikipedia XML corpus

Ralf Schenkel, Fabian Suchanek, Gjergji Kasneci

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

73 Scopus citations

Abstract

The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce algorithms to annotate pages and links with concepts from the WordNet thesaurus. This annotation process exploits categorical information in Wikipedia, which is a high-quality, manually assigned source of information, extracts additional information from lists, and utilizes the invocations of templates with named parameters. We give examples how such annotations can be exploited for high-precision queries.

Original languageEnglish
Title of host publicationDatenbanksysteme in Business, Technologie und Web, BTW 2007 - 12th Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS), Proceedings
Pages277-291
Number of pages15
StatePublished - 2007
Externally publishedYes
Event12th Symposium of the German Informatics Society Section "Databases and Information Systems" (DBIS) on Database Systems in Business, Technology and Web, BTW 2007 - Aachen, Germany
Duration: 7 Mar 20079 Mar 2007

Publication series

NameDatenbanksysteme in Business, Technologie und Web, BTW 2007 - 12th Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS), Proceedings

Conference

Conference12th Symposium of the German Informatics Society Section "Databases and Information Systems" (DBIS) on Database Systems in Business, Technology and Web, BTW 2007
Country/TerritoryGermany
CityAachen
Period7/03/079/03/07

Fingerprint

Dive into the research topics of 'YAWN: A semantically annotated Wikipedia XML corpus'. Together they form a unique fingerprint.

Cite this