Abstract:
In the last two decades, a huge amount of data are increasingly become available due to the exponential growth of the World Wide Web. Mostly, such data consist of unstructured or semi-structured texts, which often contain references to structured information (e.g., person names, contact records, etc.). Information Extraction (IE) is the discipline aiming at generally discover structured information from unstructured or semi-structured text corpora.
More precisely, in this report we focus on two IE-related tasks, namely Named-Entity Recognition (NER) and Relation Extraction (RE). Solutions to these are successfully applied to several domains. As an example, Web search engines have recently started rendering structured answers on their retrieved result pages yet leveraging almost unstructured Web documents.
Concretely, we propose a novel method to infer relations among entities, which has been tested and evaluated on a real-world application scenario: entertainment event news, where starting from a generic press review, we try to discover new events hidden in it. Our method is subdivided in two steps, each one specifically addressing an IE task: the first step concerns NER and uses a supervised learning technique to correctly and automatically identify named entities from unstructured text news; the second step, instead, deals with the RE task, and introduces a novel, unsupervised learning strategy to automatically infer relations between entities, as detected during the first step.
Finally, well-known measures over a real dataset have been used to evaluate the two parts of the system. Concerning the first part, results highlight the quality of our NER approach, which indeed performs consistently with other existing, state-of-the-art solutions. Regarding the RE approach, experimental results indicate that if enough relevance can be found on the Web (in our case, documents concerning the candidate event), it's possible to infer correct relations which lead to the discovery of new events.