There are significant research opportunities in the integration of Machine Learning (ML) methods and Combinatorial Optimization Problems (COPs). In this work, we focus on metaheuristics to solve COPs that have an important learning component. These algorithms must explore a solution space and learn from the information they obtain in order to find high-quality solutions. Among the metaheuristics, we study Hyper-Heuristics (HHs), algorithms that, given a number of low-level heuristics, iteratively select and apply heuristics to a solution. The HH we consider has a Markov model to produce sequences of low-level heuristics, which we combine with a Multi-Armed Bandit Problem (MAB)-based method to learn its parameters. This work proposes several improvements to the HH metaheuristic that yields a better learning for solving problem instances. Specifically, this is the first work in HHs to present Exponential Weights for Exploration and Exploitation (EXP3) as a learning method, an algorithm that is able to deal with adversarial settings. We also present a case study for the Vehicle Routing Problem with Time Windows (VRPTW), for which we include a list of low-level heuristics that have been proposed in the literature. We show that our algorithms can handle a large and diverse list of heuristics, illustrating that they can be easily configured to solve COPs of different nature. The computational results indicate that our algorithms are competitive methods for the VRPTW (2.16% gap on average with respect to the best known solutions), demonstrating the potential of these algorithms to solve COPs. Finally, we show how algorithms can even detect low-level heuristics that do not contribute to finding better solutions to the problem.