Beyond linear chain: a journey through conditional random fields for information extraction from text

dc.contributor.advisor	Focardi, Riccardo	it_IT
dc.contributor.author	Marcheggiani, Diego <1983>	it_IT
dc.date.accessioned	2014-03-11	it_IT
dc.date.accessioned	2015-01-17T08:21:38Z
dc.date.issued	2014-06-12	it_IT
dc.identifier.uri	http://hdl.handle.net/10579/5588
dc.description.abstract	Natural language, spoken and written, is the most important way for humans to communicate information to each other. In the last decades \emph{natural language processing} (NLP) researchers have studied methods aimed at making computers ``understand'' the information enclosed in human language. \emph{Information Extraction} (IE) is a field of NLP that studies methods aimed at extracting information from text so that it can be used to populate a structured information repository, such as a relational database. The main approaches by means of which the task of IE has been tackled rely on supervised machine learning, which needs human-labeled data examples in order to train the systems that extract information from yet unseen data. The best-performing supervised machine learning methods for IE are certainly probabilistic graphical models, and, specifically, Conditional Random Fields (CRFs). In this thesis we investigate two major aspects related to IE from text via CRFs: the creation of CRFs models that outperform the commonly adopted, state-of-the-art, linear-chain CRFs, and the impact of the quality of training data on the accuracy of CRFs system for IE. In the first part of the thesis we use the capabilities of the CRFs framework to create new kinds of CRFs, that unlike the commonly adopted linear-chain CRFs are customized to the structure of the task taken into consideration. We exemplify this approach on two different tasks, i.e., IE from medical documents and \emph{opinion mining} from product reviews. CRFs, like any machine learning-based approach, may suffer if the quality of training data is low. Therefore, the second part of the thesis is devoted to (1) the study of how the quality of training data affects the accuracy of a CRFs system for IE; and (2) the production of human-annotated training data via semi-supervised \emph{active learning} (AL). We start by facing the task of IE from medical documents written in the Italian language; this consists in extracting chunks of text that instantiate concepts of interest for medical practitioners, such as, drug dosages, pathologies, treatments, etc.. We propose two novel approaches: a cascaded, two-stage method composed by two layers of CRFs, and a confidence-weighted ensemble method that combines standard linear-chain CRFs and the proposed two-stage method. Both the proposed models are shown to outperform a standard linear-chain CRFs IE system. We then investigate the problem of aspect-oriented sentence-level opinion mining from product reviews, that consists in predicting, for all sentences in the review, whether the sentence expresses a positive, neutral, or negative opinion (or no opinion at all) about a specific aspect of the product. We propose a set of increasingly powerful models based on CRFs, including a hierarchical multi-label CRFs scheme that jointly models the overall opinion expressed in a product review and the set of aspect-specific opinions expressed in each of its sentences. Also in this task the proposed CRF models obtain better results than linear-chain CRFs. We then study the impact that the quality of training data has on the learning process and thus on the accuracy of a classifier. Low quality in training data sometimes derives from the fact that the person who has annotated the data is not an expert of the data domain. We test the impact of training data quality on the accuracy of IE systems oriented to the clinical domain. We finally investigate the process of AL in order to obtain good-quality training data with minimum human annotation effort. We propose several AL strategies for a type of semi-supervised CRFs specifically devised for partially labeled sequences. We show that, with respect to the proposed strategies, margin-based strategies always obtain the best results on the four tasks we have tested them on.	it_IT
dc.language.iso	en	it_IT
dc.publisher	Università Ca' Foscari Venezia	it_IT
dc.rights	© Diego Marcheggiani, 2014	it_IT
dc.title	Beyond linear chain: a journey through conditional random fields for information extraction from text	it_IT
dc.title.alternative		it_IT
dc.type	Doctoral Thesis	it_IT
dc.degree.name	Informatica	it_IT
dc.degree.level	Dottorato di ricerca	it_IT
dc.degree.grantor	Scuola di dottorato in Scienze e tecnologie (SDST)	it_IT
dc.description.academicyear	2014 (proroghe semestrali 2012/2013)	it_IT
dc.description.cycle	26	it_IT
dc.degree.coordinator	Focardi, Riccardo	it_IT
dc.location.shelfmark	D001398	it_IT
dc.location	Venezia, Archivio Università Ca' Foscari, Tesi Dottorato	it_IT
dc.rights.accessrights	openAccess	it_IT
dc.thesis.matricno	955882	it_IT
dc.format.pagenumber	[10], XIV, 117 p.	it_IT
dc.subject.miur	INF/01 INFORMATICA	it_IT
dc.description.note		it_IT
dc.degree.discipline		it_IT
dc.contributor.co-advisor	Sebastiani, Fabrizio	it_IT
dc.date.embargoend
dc.provenance.upload	Diego Marcheggiani (955882@stud.unive.it), 2014-03-11	it_IT
dc.provenance.plagiarycheck	Riccardo Focardi (focardi@unive.it), 2014-05-05	it_IT