Computer Science
Clustering et réduction de la dimension avec données manquantes
Publié le - 56ièmes Journées de Statistique de la SFDS
In the context of high-dimensional clustering, we are interested in the Reduced K-means (RKM) subspace clustering method. However, RKM is not suitable for missing data. We propose two methods inspired by commonly used approaches for clustering with missing data: direct approaches through the K-POD method and multiple imputation (MI). RKM provides the partitions and their associated representation subspaces, thus necessitating an additional phase of aggregation of the associated subspaces, which we propose to perform with AFM. Both methods are evaluated on simulated data under the MCAR (Missing Completely At Random) and MAR (Missing At Random) mechanisms, with different rates of missing data. The evaluation is carried out through various scenarios, depending on the balance of the clusters, their sizes and their separation. The results show that both methods succeed in recovering the original clusters and corresponding subspaces in the various scenarios. However, the MI-RKM method requires more time for data imputation and parameter aggregation, which increases computational complexity.