Apprentissage efficient dans les problèmes de semi-bandits stochastiques combinatoires

Publié le 30 novembre 2020

Auteurs : Pierre Perrault

Combinatorial stochastic semi-bandits appear naturally in many contexts where the exploration/exploitation dilemma arises, such as web content optimization (recommendation/online advertising) or shortest path routing methods. This problem is formulated as follows: an agent sequentially optimizes an unknown and noisy objective function, defined on a power set mathcal{P}([n]). For each set A tried out, the agent suffers a loss equal to the expected deviation from the optimal solution while obtaining observations to reduce its uncertainty on the coordinates from A. Our objective is to study the efficiency of policies for this problem, focusing in particular on the following two aspects: statistical efficiency, where the criterion considered is the regret suffered by the policy (the cumulative loss) that measures learning performance; and computational efficiency. It is sometimes difficult to combine these two aspects in a single policy. In this thesis, we propose different directions for improving statistical efficiency, while trying to maintain the computational efficiency of policies.In particular, we have improved optimistic methods by developing approximation algorithms and refining the confidence regions used. We also explored an alternative to the optimistic methods, namely randomized methods, and found them to be a serious candidate for combining the two types of efficiency.