Machine Learning

A bipartite ranking approach to the two-sample problem

Published on - Electronic Journal of Statistics

Authors: Stéphan Clémençon, Myrto Limnios, Nicolas Vayatis

The two-sample problem consists in testing whether two independent samples are drawn from the same (unknown) distribution. Its study in high-dimension is the subject of much attention, especially because the information acquisition processes at work in the Big Data era often involve various poorly controlled sources, leading to datasets possibly exhibiting strong sampling bias. While the efficiency of classic methods relying on computing a discrepancy measure between the empirical distributions of each sample, is negatively impacted by increasing dimensionality, we develop a two-step approach based on statistical learning and an extension of rank tests. By dividing the initial samples in two, a bipartite ranking algorithm first learns a real-valued scoring function inducing a preorder on the multivariate space. Then, a rank statistic based on the scores of the remaining observations, tests for differences in distribution. Because the ranking algorithm learns how to map the data onto the real line as the likelihood ratio between the original multivariate distributions, the approach resists to large dimensions (ignoring ranking model bias issues) and preserves the advantages of univariate rank tests. We prove nonasymptotic error bounds based on recent results for two-sample linear rank-processes, and experimentally show how the promoted approach surpasses state-of-the-art methods.