Beyond linear chain: a journey through conditional random fields for information extraction from text

DSpace/Manakin Repository

Show simple item record

dc.contributor.advisor Focardi, Riccardo it_IT
dc.contributor.author Marcheggiani, Diego <1983> it_IT
dc.date.accessioned 2014-03-11 it_IT
dc.date.accessioned 2015-01-17T08:21:38Z
dc.date.issued 2014-06-12 it_IT
dc.identifier.uri http://hdl.handle.net/10579/5588
dc.description.abstract Natural language, spoken and written, is the most important way for humans to communicate information to each other. In the last decades \emph{natural language processing} (NLP) researchers have studied methods aimed at making computers ``understand'' the information enclosed in human language. \emph{Information Extraction} (IE) is a field of NLP that studies methods aimed at extracting information from text so that it can be used to populate a structured information repository, such as a relational database. The main approaches by means of which the task of IE has been tackled rely on supervised machine learning, which needs human-labeled data examples in order to train the systems that extract information from yet unseen data. The best-performing supervised machine learning methods for IE are certainly probabilistic graphical models, and, specifically, Conditional Random Fields (CRFs). In this thesis we investigate two major aspects related to IE from text via CRFs: the creation of CRFs models that outperform the commonly adopted, state-of-the-art, linear-chain CRFs, and the impact of the quality of training data on the accuracy of CRFs system for IE. In the first part of the thesis we use the capabilities of the CRFs framework to create new kinds of CRFs, that unlike the commonly adopted linear-chain CRFs are customized to the structure of the task taken into consideration. We exemplify this approach on two different tasks, i.e., IE from medical documents and \emph{opinion mining} from product reviews. CRFs, like any machine learning-based approach, may suffer if the quality of training data is low. Therefore, the second part of the thesis is devoted to (1) the study of how the quality of training data affects the accuracy of a CRFs system for IE; and (2) the production of human-annotated training data via semi-supervised \emph{active learning} (AL). We start by facing the task of IE from medical documents written in the Italian language; this consists in extracting chunks of text that instantiate concepts of interest for medical practitioners, such as, drug dosages, pathologies, treatments, etc.. We propose two novel approaches: a cascaded, two-stage method composed by two layers of CRFs, and a confidence-weighted ensemble method that combines standard linear-chain CRFs and the proposed two-stage method. Both the proposed models are shown to outperform a standard linear-chain CRFs IE system. We then investigate the problem of aspect-oriented sentence-level opinion mining from product reviews, that consists in predicting, for all sentences in the review, whether the sentence expresses a positive, neutral, or negative opinion (or no opinion at all) about a specific aspect of the product. We propose a set of increasingly powerful models based on CRFs, including a hierarchical multi-label CRFs scheme that jointly models the overall opinion expressed in a product review and the set of aspect-specific opinions expressed in each of its sentences. Also in this task the proposed CRF models obtain better results than linear-chain CRFs. We then study the impact that the quality of training data has on the learning process and thus on the accuracy of a classifier. Low quality in training data sometimes derives from the fact that the person who has annotated the data is not an expert of the data domain. We test the impact of training data quality on the accuracy of IE systems oriented to the clinical domain. We finally investigate the process of AL in order to obtain good-quality training data with minimum human annotation effort. We propose several AL strategies for a type of semi-supervised CRFs specifically devised for partially labeled sequences. We show that, with respect to the proposed strategies, margin-based strategies always obtain the best results on the four tasks we have tested them on. it_IT
dc.language.iso en it_IT
dc.publisher Università Ca' Foscari Venezia it_IT
dc.rights © Diego Marcheggiani, 2014 it_IT
dc.title Beyond linear chain: a journey through conditional random fields for information extraction from text it_IT
dc.title.alternative it_IT
dc.type Doctoral Thesis it_IT
dc.degree.name Informatica it_IT
dc.degree.level Dottorato di ricerca it_IT
dc.degree.grantor Scuola di dottorato in Scienze e tecnologie (SDST) it_IT
dc.description.academicyear 2014 (proroghe semestrali 2012/2013) it_IT
dc.description.cycle 26 it_IT
dc.degree.coordinator Focardi, Riccardo it_IT
dc.location.shelfmark D001398 it_IT
dc.location Venezia, Archivio Università Ca' Foscari, Tesi Dottorato it_IT
dc.rights.accessrights openAccess it_IT
dc.thesis.matricno 955882 it_IT
dc.format.pagenumber [10], XIV, 117 p. it_IT
dc.subject.miur INF/01 INFORMATICA it_IT
dc.description.note it_IT
dc.degree.discipline it_IT
dc.contributor.co-advisor Sebastiani, Fabrizio it_IT
dc.date.embargoend
dc.provenance.upload Diego Marcheggiani (955882@stud.unive.it), 2014-03-11 it_IT
dc.provenance.plagiarycheck Riccardo Focardi (focardi@unive.it), 2014-05-05 it_IT


Files in this item

This item appears in the following Collection(s)

Show simple item record