Abstract:
Natural language, spoken and written, is the most important way for humans to communicate information to each other.
In the last decades \emph{natural language processing} (NLP) researchers have studied methods aimed at making computers ``understand'' the information enclosed in human language.
\emph{Information Extraction} (IE) is a field of NLP that studies methods aimed at extracting information from text so that it can be used to populate a structured information repository, such as a relational database.
The main approaches by means of which the task of IE has been tackled rely on supervised machine learning, which needs human-labeled data examples in order to train the systems that extract information from yet unseen data.
The best-performing supervised machine learning methods for IE are certainly probabilistic graphical models, and, specifically, Conditional Random Fields (CRFs).
In this thesis we investigate two major aspects related to IE from text via CRFs: the creation of CRFs models that outperform the commonly adopted, state-of-the-art, linear-chain CRFs, and the impact of the quality of training data on the accuracy of CRFs system for IE.
In the first part of the thesis we use the capabilities of the CRFs framework to create new kinds of CRFs, that unlike the commonly adopted linear-chain CRFs are customized to the structure of the task taken into consideration.
We exemplify this approach on two different tasks, i.e., IE from medical documents and \emph{opinion mining} from product reviews.
CRFs, like any machine learning-based approach, may suffer if the quality of training data is low.
Therefore, the second part of the thesis is devoted to (1) the study of how the quality of training data affects the accuracy of a CRFs system for IE; and (2) the production of human-annotated training data via semi-supervised \emph{active learning} (AL).
We start by facing the task of IE from medical documents written in the Italian language; this consists in extracting chunks of text that instantiate concepts of interest for medical practitioners, such as, drug dosages, pathologies, treatments, etc..
We propose two novel approaches: a cascaded, two-stage method composed by two layers of CRFs, and a confidence-weighted ensemble method that combines standard linear-chain CRFs and the proposed two-stage method.
Both the proposed models are shown to outperform a standard linear-chain CRFs IE system.
We then investigate the problem of aspect-oriented sentence-level opinion mining from product reviews, that consists in predicting, for all sentences in the review, whether the sentence expresses a positive, neutral, or negative opinion (or no opinion at all) about a specific aspect of the product.
We propose a set of increasingly powerful models based on CRFs, including a hierarchical multi-label CRFs scheme that jointly models the overall opinion expressed in a product review and the set of aspect-specific opinions expressed in each of its sentences.
Also in this task the proposed CRF models obtain better results than linear-chain CRFs.
We then study the impact that the quality of training data has on the learning process and thus on the accuracy of a classifier.
Low quality in training data sometimes derives from the fact that the person who has annotated the data is not an expert of the data domain.
We test the impact of training data quality on the accuracy of IE systems oriented to the clinical domain.
We finally investigate the process of AL in order to obtain good-quality training data with minimum human annotation effort.
We propose several AL strategies for a type of semi-supervised CRFs specifically devised for partially labeled sequences.
We show that, with respect to the proposed strategies, margin-based strategies always obtain the best results on the four tasks we have tested them on.