Information extraction

From Wikipedia, the free encyclopedia

Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured or semistructured information from unstructured machine-readable documents. It is a sub-discipline of language engineering, a branch of computer science.

It aims to apply methods and technologies from practical computer science such as compiler construction and artificial intelligence to the problem of processing unstructured textual data automatically, with the objective to extract structured knowledge in some domain. A typical example is the extraction of information on corporate merger events, whereby instances of the relation $M E R G E (c o m p a n y 1, c o m p a n y 2, d a t e)$ are extracted from online news ("Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp.").

The significance of Information Extraction is determined by the growing amount of information available in unstructured (i.e. without metadata) form, for instance on the Internet. This knowledge can be made more accessible by means of transformation into relational form.

A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted. Current approaches to IE use natural language processing techniques that focus on very restricted domains. For example, the Message Understanding Conference (MUC) is a competition-based conference that focused on the following domains in the past:

MUC-1 (1987), MUC-2 (1989): Naval operations messages.
MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
MUC-5 (1993): Joint ventures and microelectronics domain.
MUC-6 (1995): News articles on management changes.
MUC-7 (1998): Satellite launch reports.

Natural Language texts may need to use some form a Text Simplification to create a more easily machine readable text to extract the sentences.

Typical subtasks of IE are:

Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
Coreference: identification chains of noun phrases that refer to the same object. For example, anaphora is a type of coreference.
Terminology extraction: finding the relevant terms for a given corpus