Beyond words: a comparative analysis of LLM embeddings for effective clustering

Publié le 16 avril 2024 - Intelligent Data Analysis

Auteurs : Imed Keraghel, Stanislas Morbieu, Mohamed Nadif

The document clustering process involves the grouping of similar unlabeled textual documents. This task relies on the use of document embedding techniques, which can be derived from various models, including traditional and neural network-based approaches. The emergence of Large Language Models (LLMs) has provided a new method of capturing information from texts through customized numerical representations, potentially enhancing text clustering by identifying subtle semantic connections. The objective of this paper is to demonstrate the impact of LLMs of different sizes on text clustering. To accomplish this, we select five different LLMs and compare them with three less resourceintensive embedding methods. Additionally, we utilize six clustering algorithms. We simultaneously assess the performance of the embedding models and clustering algorithms in terms of clustering quality, and highlight the strengths and limitations of the models under investigation.