Abstract:
The abundance of available electronic information is rapidly increasing with the advancements in digital processing. Furthermore, huge amounts of textual data have given rise to the need for efficient techniques that can organize the data in manageable forms. In order to tackle this challenge, clustering algorithms try to group automatically similar documents. While clustering plays a significant role that helps to categorize documents, it owes intrinsic limits when it comes to allowing human users to understand the content of documents at a deeper level. This is where cluster labeling techniques come into the scene. The goal of cluster labeling is to label - i.e., describe in an informative way - clusters of documents according to their content. Document clustering and cluster labeling are two vital problems in the information retrieval domain because of their ability to organize increasing amount of texts and describe such the huge amount in a concise way. In this thesis, we have addressed these problems in three parts.
In the first part, we investigate how we can improve the effectiveness of text clustering by summarizing some documents in a corpus, specifically the ones that are much significantly longer than the mean. The contribution in this part is twofold. First, we show that text summarization can improve the performance of classical text clustering algorithms, in particular, by reducing noise coming from long documents that can negatively affect clustering results. Moreover, we show that the clustering quality can be used to quantitatively evaluate different summarization methods.
In the second part, we explore a multi-strategy technique that aims at enriching documents for improving clustering quality. Specifically, we use a combination of entity linking and document summarization, to determine the identity of the most salient entities mentioned in texts. We further investigate ensemble clustering in order to combine multiple clustering results, generated based on the combination of the specific set of features, into a single result of better quality.
In the third part, we investigate the problem of cluster labeling whose quality obviously depends on the quality of document clustering. To this end, we first explore and categorize cluster labeling techniques, providing a thorough discussion of the relevant state-of-the-art literature. We then present a fusion-based topic modeling approach to enrich documents' vectors of corpus with the aim of improving the quality of text clustering. We further exploit such vectors through a fusion method for cluster labeling.
Finally, we experimentally prove the effectiveness of our solutions, explained in three parts, in the clustering and cluster labeling problems with various datasets.