Improving the quality of text clustering and cluster labeling

DSpace/Manakin Repository

Show simple item record

dc.contributor.advisor Orlando, Salvatore <1961> it_IT
dc.contributor.author Pourvali, Mohsen <1984> it_IT
dc.date.accessioned 2016-12-11 it_IT
dc.date.accessioned 2017-05-22T06:02:42Z
dc.date.available 2017-05-22T06:02:42Z
dc.date.issued 2017-03-01 it_IT
dc.identifier.uri http://hdl.handle.net/10579/10311
dc.description.abstract The abundance of available electronic information is rapidly increasing with the advancements in digital processing. Furthermore, huge amounts of textual data have given rise to the need for efficient techniques that can organize the data in manageable forms. In order to tackle this challenge, clustering algorithms try to group automatically similar documents. While clustering plays a significant role that helps to categorize documents, it owes intrinsic limits when it comes to allowing human users to understand the content of documents at a deeper level. This is where cluster labeling techniques come into the scene. The goal of cluster labeling is to label - i.e., describe in an informative way - clusters of documents according to their content. Document clustering and cluster labeling are two vital problems in the information retrieval domain because of their ability to organize increasing amount of texts and describe such the huge amount in a concise way. In this thesis, we have addressed these problems in three parts. In the first part, we investigate how we can improve the effectiveness of text clustering by summarizing some documents in a corpus, specifically the ones that are much significantly longer than the mean. The contribution in this part is twofold. First, we show that text summarization can improve the performance of classical text clustering algorithms, in particular, by reducing noise coming from long documents that can negatively affect clustering results. Moreover, we show that the clustering quality can be used to quantitatively evaluate different summarization methods. In the second part, we explore a multi-strategy technique that aims at enriching documents for improving clustering quality. Specifically, we use a combination of entity linking and document summarization, to determine the identity of the most salient entities mentioned in texts. We further investigate ensemble clustering in order to combine multiple clustering results, generated based on the combination of the specific set of features, into a single result of better quality. In the third part, we investigate the problem of cluster labeling whose quality obviously depends on the quality of document clustering. To this end, we first explore and categorize cluster labeling techniques, providing a thorough discussion of the relevant state-of-the-art literature. We then present a fusion-based topic modeling approach to enrich documents' vectors of corpus with the aim of improving the quality of text clustering. We further exploit such vectors through a fusion method for cluster labeling. Finally, we experimentally prove the effectiveness of our solutions, explained in three parts, in the clustering and cluster labeling problems with various datasets. it_IT
dc.language.iso en it_IT
dc.publisher Università Ca' Foscari Venezia it_IT
dc.rights © Mohsen Pourvali, 2017 it_IT
dc.title Improving the quality of text clustering and cluster labeling it_IT
dc.title.alternative it_IT
dc.type Doctoral Thesis it_IT
dc.degree.name Informatica it_IT
dc.degree.level Dottorato di ricerca it_IT
dc.degree.grantor Dipartimento di Scienze Ambientali, Informatica e Statistica it_IT
dc.description.academicyear 2015/2016, sessione 29° ciclo it_IT
dc.description.cycle 29 it_IT
dc.degree.coordinator Focardi, Riccardo it_IT
dc.location.shelfmark D001702 it_IT
dc.location Venezia, Archivio Università Ca' Foscari, Tesi Dottorato it_IT
dc.rights.accessrights openAccess it_IT
dc.thesis.matricno 956115 it_IT
dc.format.pagenumber XI, 107 p. it_IT
dc.subject.miur INF/01 INFORMATICA it_IT
dc.description.note it_IT
dc.degree.discipline it_IT
dc.provenance.upload Mohsen Pourvali (956115@stud.unive.it), 2016-12-11 it_IT
dc.provenance.plagiarycheck Salvatore Orlando (orlando@unive.it), 2017-01-19 it_IT


Files in this item

This item appears in the following Collection(s)

Show simple item record