Improving the quality of text clustering and cluster labeling

dc.contributor.advisor	Orlando, Salvatore <1961>	it_IT
dc.contributor.author	Pourvali, Mohsen <1984>	it_IT
dc.date.accessioned	2016-12-11	it_IT
dc.date.accessioned	2017-05-22T06:02:42Z
dc.date.available	2017-05-22T06:02:42Z
dc.date.issued	2017-03-01	it_IT
dc.identifier.uri	http://hdl.handle.net/10579/10311
dc.description.abstract	The abundance of available electronic information is rapidly increasing with the advancements in digital processing. Furthermore, huge amounts of textual data have given rise to the need for efficient techniques that can organize the data in manageable forms. In order to tackle this challenge, clustering algorithms try to group automatically similar documents. While clustering plays a significant role that helps to categorize documents, it owes intrinsic limits when it comes to allowing human users to understand the content of documents at a deeper level. This is where cluster labeling techniques come into the scene. The goal of cluster labeling is to label - i.e., describe in an informative way - clusters of documents according to their content. Document clustering and cluster labeling are two vital problems in the information retrieval domain because of their ability to organize increasing amount of texts and describe such the huge amount in a concise way. In this thesis, we have addressed these problems in three parts. In the first part, we investigate how we can improve the effectiveness of text clustering by summarizing some documents in a corpus, specifically the ones that are much significantly longer than the mean. The contribution in this part is twofold. First, we show that text summarization can improve the performance of classical text clustering algorithms, in particular, by reducing noise coming from long documents that can negatively affect clustering results. Moreover, we show that the clustering quality can be used to quantitatively evaluate different summarization methods. In the second part, we explore a multi-strategy technique that aims at enriching documents for improving clustering quality. Specifically, we use a combination of entity linking and document summarization, to determine the identity of the most salient entities mentioned in texts. We further investigate ensemble clustering in order to combine multiple clustering results, generated based on the combination of the specific set of features, into a single result of better quality. In the third part, we investigate the problem of cluster labeling whose quality obviously depends on the quality of document clustering. To this end, we first explore and categorize cluster labeling techniques, providing a thorough discussion of the relevant state-of-the-art literature. We then present a fusion-based topic modeling approach to enrich documents' vectors of corpus with the aim of improving the quality of text clustering. We further exploit such vectors through a fusion method for cluster labeling. Finally, we experimentally prove the effectiveness of our solutions, explained in three parts, in the clustering and cluster labeling problems with various datasets.	it_IT
dc.language.iso	en	it_IT
dc.publisher	Università Ca' Foscari Venezia	it_IT
dc.rights	© Mohsen Pourvali, 2017	it_IT
dc.title	Improving the quality of text clustering and cluster labeling	it_IT
dc.title.alternative		it_IT
dc.type	Doctoral Thesis	it_IT
dc.degree.name	Informatica	it_IT
dc.degree.level	Dottorato di ricerca	it_IT
dc.degree.grantor	Dipartimento di Scienze Ambientali, Informatica e Statistica	it_IT
dc.description.academicyear	2015/2016, sessione 29° ciclo	it_IT
dc.description.cycle	29	it_IT
dc.degree.coordinator	Focardi, Riccardo	it_IT
dc.location.shelfmark	D001702	it_IT
dc.location	Venezia, Archivio Università Ca' Foscari, Tesi Dottorato	it_IT
dc.rights.accessrights	openAccess	it_IT
dc.thesis.matricno	956115	it_IT
dc.format.pagenumber	XI, 107 p.	it_IT
dc.subject.miur	INF/01 INFORMATICA	it_IT
dc.description.note		it_IT
dc.degree.discipline		it_IT
dc.provenance.upload	Mohsen Pourvali (956115@stud.unive.it), 2016-12-11	it_IT
dc.provenance.plagiarycheck	Salvatore Orlando (orlando@unive.it), 2017-01-19	it_IT