TY - JOUR
T1 - Bandit learning in concave n-person games
AU - Bravo, Mario
AU - Leslie, David
AU - Mertikopoulos, Panayotis
N1 - Funding Information:
M. Bravo gratefully acknowledges the support provided by FONDECYT grant 11151003. P. Mer-tikopoulos was partially supported by the Huawei HIRP flagship grant ULTRON, and the French National Research Agency (ANR) grant ORACLESS (ANR–16–CE33–0004–01). Part of this work was carried out with financial support by the ECOS project C15E03.
Publisher Copyright:
© 2018 Curran Associates Inc.All rights reserved.
PY - 2018
Y1 - 2018
N2 - This paper examines the long-run behavior of learning with bandit feedback in non-cooperative concave games. The bandit framework accounts for extremely low-information environments where the agents may not even know they are playing a game; as such, the agents' most sensible choice in this setting would be to employ a no-regret learning algorithm. In general, this does not mean that the players' behavior stabilizes in the long run: no-regret learning may lead to cycles, even with perfect gradient information. However, if a standard monotonicity condition is satisfied, our analysis shows that no-regret learning based on mirror descent with bandit feedback converges to Nash equilibrium with probability 1. We also derive an upper bound for the convergence rate of the process that nearly matches the best attainable rate for single-agent bandit stochastic optimization.
AB - This paper examines the long-run behavior of learning with bandit feedback in non-cooperative concave games. The bandit framework accounts for extremely low-information environments where the agents may not even know they are playing a game; as such, the agents' most sensible choice in this setting would be to employ a no-regret learning algorithm. In general, this does not mean that the players' behavior stabilizes in the long run: no-regret learning may lead to cycles, even with perfect gradient information. However, if a standard monotonicity condition is satisfied, our analysis shows that no-regret learning based on mirror descent with bandit feedback converges to Nash equilibrium with probability 1. We also derive an upper bound for the convergence rate of the process that nearly matches the best attainable rate for single-agent bandit stochastic optimization.
UR - http://www.scopus.com/inward/record.url?scp=85064821930&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85064821930
SN - 1049-5258
VL - 2018-December
SP - 5661
EP - 5671
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 32nd Conference on Neural Information Processing Systems, NeurIPS 2018
Y2 - 2 December 2018 through 8 December 2018
ER -