import numpy as np from matplotlib import pyplot as plt, animation from mpl_toolkits.mplot3d import Axes3D import scipy.optimize as opt import scipy.linalg as la
Gradient descent is an iterative optimization method which, given a function , finds a minimizer of by repeatedly taking steps in the direction of the negative gradient until some stopping criteria is met. It's important to recognize that the gradient is a function which takes as input a vector and outputs another vector . This output vector points in the direction of greatest increase of at . Thus, moving in the opposite direction ensures we move in the direction of greatest decrease.
With that general idea in mind, the actual gradient descent method is defined as follows. Given an initial starting point , define a sequence recursively as
where is called the step size. It can be shown that the sequence converges for all initial guesses. A proof of convergence goes beyond my focus here, but the fact that it does converge is very important to the success of gradient descent. Otherwise, there would be no guarantee that gradient descent ever reaches a minimizer, which would render it useless as a practical optimization method.
There are many ways one could choose the step size . One way is to define a new function as
and minimize this at each step. The function is a one dimensional function that is the same as along the slice in the direction of the gradient.
A necessary condition for a point to minimize is that the gradient will output the zero vector with that point as input. If the gradient does not output zero, then the input is not a minimizer. Thus, in order to know when to stop the algorithm, we can check the norm of the gradient output and stop when it reaches 0. In practice, we actually only need to check that the gradient is very close to zero. It will likely never be exactly zero because of the way floating point numbers work.
To help visualize gradient descent, let's do a simple example. Let and let , so . Suppose we start gradient descent at the point . We have and . A 3d graph and contour plot of are shown below. The point is the orange dot and the negative gradient vector is the black arrow.
f = lambda x: x**2 + 7*x**2 + 2*x + x + 2 Df = lambda x: np.array([2*x + 2, 14*x + 1]) x, y = np.linspace(-1.2, 1.2, 1000), np.linspace(-1.2, 1.2, 1000) X, Y = np.meshgrid(x, y) Z = f([X, Y]) ax = plt.axes(projection='3d') ax.plot_surface(X, Y, Z) ax.plot(,,, marker='.', ms=20) ax.view_init(elev=20, azim=45) ax.set_xlabel('x', size=15) ax.set_ylabel('y', size=15) plt.show() plt.contour(X, Y, Z, levels=np.logspace(0, 1, 8)) plt.quiver(, , [-2], [-15]) plt.plot(0, 1, marker='.', color='orange', ms=20) plt.xlabel('x', size=15) plt.ylabel('y', size=15) plt.show()
In the contour plot, darker colors represent lower values. As you can see, the negative gradient vector is pointing in the direction of greatest decrease of at .
Now let's visualize the function . At , we have . The function is a parameterization of the curve along in the direction of . A value of represents the input and as increases, we move in the 1-d slice of along . The next graph is a visualization of that slice.
phi = lambda a: f([-2*a, 1-15*a]) a = np.linspace(-.01, 2/15, 100) plt.plot(a, phi(a)) plt.plot(0, 10, '.', ms=15) plt.xlabel(r'$\alpha$', size=15) plt.show()
The code below implements the gradient descent algorithm.
def gradient_descent(f, Df, x0, maxiter=100, tol=1e-5): """Compute the minimizer of f using the method of gradient descent. Parameters: f (function): The objective function. Accepts a NumPy array of shape (n,) and returns a float. Df (function): The first derivative of f. Accepts and returns a NumPy array of shape (n,). x0 ((n,) ndarray): The initial guess. maxiter (int): The maximum number of iterations to compute. tol (float): The stopping tolerance. Returns: ((n,) ndarray): The approximate minimizer of f. (float): The minimum value of f. (int): The number of iterations computed. """ for k in range(1, maxiter+1): # Compute derivative and check stopping criterion dfx0 = Df(x0) if la.norm(dfx0, ord=np.inf) < tol: break # Define and optimize 1-D function in the direction of negative gradient phi = lambda a: f(x0 - a*dfx0) x0 = x0 - opt.minimize_scalar(phi).x*dfx0 return x0, f(x0), k
Let's use this function to find a minimizer of from our example above.
minimizer, value = gradient_descent(f, Df, np.array([0, 1]))[:2] print("Minimizer: ", minimizer) print("Minimum value: ", value)
Minimizer: [-0.99999701 -0.07142863] Minimum value: 0.964285714295
While minimizing at each stage of the algorithm works, it may not always be the best choice. One reason for this is that finding the minimizer of may take a significant amount of time. That time might be better used for computing the next iteration. Often a good, but not optimal, step size is sufficient to guarantee minimization. This can be difficult, though, because if the step size is too large, the method can overshoot and move to a higher point rather than descend.
The other reason is that the function might be narrow canyon near a minimum. In this case, gradient descent can oscillate between either side of the canyon and descend very slowly. A good example of this type of function is the rosenbrock function, defined as . The global minimizer of the rosenbrock function is . The contour plot below shows how steep and narrow this function is.
x = np.linspace(-5, 5, 200) X, Y = np.meshgrid(x, x) Z = opt.rosen([X, Y]) fig, ax = plt.subplots() cax = ax.contourf(X, Y, Z, 50) fig.colorbar(cax) plt.show()
Gradient descent takes over 11,000 iterations to compute the minimizer starting at , which is relatively close to the minimizer.
gradient_descent(opt.rosen, opt.rosen_der, np.array([-2,-2]), maxiter=20000)
(array([ 0.9999912 , 0.99998234]), 7.775709809680371e-11, 11990)
Use in Deep Learning
One of the most well-known uses of gradient descent is in deep learning. Deep learning refers to the use of "deep" neural networks for artificial intelligence. Neural networks can become quite complex, but when it comes down to it, a neural network is just a function. Specifically, it's function that is defined by a large set of parameters. When someone refers to a neural network "learning", what's really going on is the network is adjusting these parameters in order to produce more sensible output. One of the most common ways it can adjust these parameters is by treating them as inputs to another function called the loss function, then optimizing the loss function with gradient descent.