We consider the bandit problem with an infinite number of Bernoulli arms, of which the unknown parameters are assumed to be i.i.d. random variables with a common distribution F. Our goal is to construct optimal strategies of choosing “arms” so that the expected long-run failure rate is minimized. We first review a class of strategies and establish their asymptotic properties when F is known. Based on the results, we propose a new strategy and prove that it is asymptotically optimal when F is unknown. Finally, we show that the proposed strategy performs well for a number of simulation scenarios.
Journal of Statistical Planning and Inference, 142, 86-94