Non-uniform Analysis for Non-convex Optimization in Machine Learning

  • Author / Creator
    Mei, Jincheng
  • The optimization of non-convex objective functions is a topic of central interest in machine learning. Remarkably, it has recently been shown that simple gradient-based optimization can achieve globally optimal solutions in important non-convex problems that arise in machine learning, including policy gradient optimization in reinforcement learning (RL), generalized linear model training in supervised learning (SL), and over-parameterized neural network training in deep learning. However, previous work generally relies on uniform properties of the optimization landscape, ignoring relevant problem structure, which limits both the applicability and strength of the theoretical results that can be obtained. In this thesis, motivated by fundamental problems in RL and SL, I investigate a non-uniform analysis for non-convex optimization. Chapter 2 studies policy gradient optimization (PG) in RL and resolves three open problems in the literature by introducing a new analysis tool called non-uniform Lojasiewicz inequality (NL). In particular, this chapter shows that (i) PG optimization with a softmax parameterization converges to a globally optimal policy at a O(1/t) rate; (ii) adding entropy regularization improves the convergence rate of PG to O(e^{−c t}) (where c > 0) to a regularized optimal policy; and (iii) an Omega(1/t) lower bound can be established on the worst case convergence of softmax PG. The separation of rates is further explained using the concept of the NL degree. These results provide a theoretical explanation of the optimization advantage of entropy regularization. Next, Chapter 3 reconsiders a common policy parameterization used in machine learning: the softmax transform. Two negative results are established for using the softmax transform in gradient based optimization. In particular, this chapter shows that (i) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization (“softmax gravity well”); and (ii) optimizing log-probabilities under the softmax must exhibit slow convergence (“softmax damping”). I propose an alternative escort mapping that demonstrates better optimization properties for PG and cross entropy minimization in SL. This analysis is based on the NL inequality and a new non-uniform smoothness (NS) property. These difficulties with the softmax and the advantage of the escort transform are further explained by the concept of the NL coefficient. Chapter 4 then introduces a non-uniform analysis that combines the nonuniform smoothness (NS) property and the NL inequality, using the combination to more accurately characterize non-convex objective landscapes and inspire new geometry-aware gradient descent methods. One interesting result for general optimization is that geometry-aware first-order methods can converge to global optimality faster than the classical Omega(1/t^2) lower bounds if one additionally considers these non-uniform properties. This chapter then applies new geometry-aware first-order methods to PG and generalized linear model training (GLM). For PG, it is shown that normalizing gradient ascent can accelerate convergence to O(e^{−c t}) for some c > 0, while incurring less overhead than existing algorithms. For GLM, it is shown that geometry-aware normalized gradient descent can also achieve a linear convergence rate, which significantly improves the best known results. Additionally, I show that these geometry-aware gradient descent methods can escape landscape plateaus faster than standard gradient descent. Finally, Chapter 5 extends the analysis to stochastic policy optimization, and shows that the preferability of optimization methods depends critically on whether stochastic versus exact gradients are used. By introducing the concept of committal rate, this chapter contributes two key findings: (i) identifying a criterion for determining almost sure global convergence; and (ii) revealing an inherent trade-off between exploiting geometry to accelerate convergence versus achieving almost sure global optimality. This committal rate theory is then used to explain why practical policy optimization methods are sensitive to random initialization, leading to the development of an ensemble method that can be guaranteed to achieve near-optimal solutions with high probability.

  • Subjects / Keywords
  • Graduation date
    Fall 2021
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.