Skimmed the beginning of the paper, ISTM that a more accurate, less clickbait description of the work would be: For l-lipshitz-smooth convex functions, there is an existing best known bound on convergence rate with constant-step sizes. Paper proves a better bound with a variant gradient descent algorithm with occasional big steps. Unclear to me if the existing bound is known to be tight (i.e. if there exist known worst-case functions that are right at the bound).
Convex functions do not have suboptimal local optima, so this is seemingly unrelated to this as a known effective technique for escaping those, which other commenters are talking about.
Yea like the ones where we throw an entire AWS server room to try multiple random seeds, using gradient descent until it works, but honestly a RNG might work just as well, in a few contrived cases and we write our paper/make our product from it
Seems a little dismissive. I think it’s a pretty interesting result from a theoretical point of view, and might be applied in some situations where you break a harder optimization problem into smaller problems with certain likely shapes of the piece wise problem. Regardless it’s not at all intuitive this pattern should have any optimality in any gradient descent space, so understanding it might yield some more fundamental concepts around more general optimization problems.
I'm confused, we've known about this for the past 60+ years, most commonly as a method for avoiding local minima/maxima when performing simulated annealing. What's the twist here? That it's proven?
From the article:
Grimmer’s paper focused only on smooth functions, which have no sharp kinks, and convex functions, which are shaped like a bowl and only have one optimal value at the bottom.
Yes! I said exactly the same thing, it seems very strange the the whole body of work around operations research with heuristics like Tabu search, squeaky wheel and other ways to try to shake out closer to optimal without the expense of enumerating every solution…
Operations research has known this probably since the 60s and certainly since my dissertation in 2000[1] so I don’t really understand the article?
[1] amusingly my dissertation was about routing taxis efficiently around a set of pickup points, I distinctly remember thinking in a cab from the station that it was a shame there weren’t any devices that had GPS, maps, Internet and a programming interface to build something that looked a lot like Uber… oh well, execution and timing is everything I guess.
Typically optimal step sizes (α) along a descent direction in gradient descent are determined using a line search, which is itself a (cheaper) univariate optimization problem.
My guess is that they proved this for a class of target functions, and it's the proof that is novel (novel at least insofar as the chosen distribution of target functions is concerned). Practitioners always knew, but lacked the proof.
Haha, probably not good to say this for my YC application but I’d probably have made Uber absolutely fantastic for the drivers and not made much money and gone out of business once other businesses were competing with me… or made V1 too much like my desired product that it takes me a year to build.
I assume that they're assuming a particular distribution of the target function when making this claim. Otherwise you have No Free Lunch. I wonder what that distribution is.
Think of all solutions to a problem as a weirdly shaped wavy graph, you might select a solution that has some states that appear to work great but cause other better states to be impossible, the gradient search is good but you that and big jumps occasionally to check you aren’t in a local peak or trough. There’s loads of stuff in operations research about it.
It's an odd instance: the step schedule is fixed, and just takes big steps in the middle. Since it's fixed and can't adapt, it's essentially blind but still manages to work better than small steps:
> Grimmer found that the fastest sequences always had one thing in common: The middle step was always a big one. Its size depended on the number of steps in the repeating sequence. For a three-step sequence, the big step had length 4.9. For a 15-step sequence, the algorithm recommended one step of length 29.7. And for a 127-step sequence, the longest one tested, the big central leap was a whopping 370. At first that sounds like an absurdly large number, Grimmer said, but there were enough total steps to make up for that giant leap, so even if you blew past the bottom, you could still make it back quickly. His paper showed that this sequence can arrive at the optimal point nearly three times faster than it would by taking constant baby steps. “Sometimes, you should really overcommit,” he said.
Not really sure what is going on with the fractal structure, however:
> The results also raise an additional theoretical mystery that has kept Grimmer up at night. Why did the ideal patterns of step sizes all have such a symmetric shape? Not only is the biggest step always smack in the center, but the same pattern appears on either side of it: Keep zooming in and subdividing the sequence, he said, and you get an “almost fractal pattern” of bigger steps surrounded by smaller steps. The repetition suggests an underlying structure governing the best solutions that no one has yet managed to explain. But Grimmer, at least, is hopeful.
Convex functions do not have suboptimal local optima, so this is seemingly unrelated to this as a known effective technique for escaping those, which other commenters are talking about.