Risky Giant Steps Can Solve Optimization Problems Faster

dooglius · on Aug 13, 2023

Skimmed the beginning of the paper, ISTM that a more accurate, less clickbait description of the work would be: For l-lipshitz-smooth convex functions, there is an existing best known bound on convergence rate with constant-step sizes. Paper proves a better bound with a variant gradient descent algorithm with occasional big steps. Unclear to me if the existing bound is known to be tight (i.e. if there exist known worst-case functions that are right at the bound).

Convex functions do not have suboptimal local optima, so this is seemingly unrelated to this as a known effective technique for escaping those, which other commenters are talking about.

imtringued · on Aug 13, 2023

Smooth convex functions? Kids these days. How about you talk about actually difficult optimization problems.

thumbuddy · on Aug 13, 2023

Yea like the ones where we throw an entire AWS server room to try multiple random seeds, using gradient descent until it works, but honestly a RNG might work just as well, in a few contrived cases and we write our paper/make our product from it

fnordpiglet · on Aug 14, 2023

Seems a little dismissive. I think it’s a pretty interesting result from a theoretical point of view, and might be applied in some situations where you break a harder optimization problem into smaller problems with certain likely shapes of the piece wise problem. Regardless it’s not at all intuitive this pattern should have any optimality in any gradient descent space, so understanding it might yield some more fundamental concepts around more general optimization problems.

andy_ppp · on Aug 13, 2023

Ah, do you have a link to the paper?

dooglius · on Aug 13, 2023

The "a study" in the article is a link https://arxiv.org/abs/2307.06324

demondemidi · on Aug 13, 2023

I'm confused, we've known about this for the past 60+ years, most commonly as a method for avoiding local minima/maxima when performing simulated annealing. What's the twist here? That it's proven?

T-A · on Aug 13, 2023

From the article: Grimmer’s paper focused only on smooth functions, which have no sharp kinks, and convex functions, which are shaped like a bowl and only have one optimal value at the bottom.

andy_ppp · on Aug 13, 2023

Yes! I said exactly the same thing, it seems very strange the the whole body of work around operations research with heuristics like Tabu search, squeaky wheel and other ways to try to shake out closer to optimal without the expense of enumerating every solution…

romusha · on Aug 13, 2023

Quanta is running out of cool, deep materials

andy_ppp · on Aug 13, 2023

Operations research has known this probably since the 60s and certainly since my dissertation in 2000[1] so I don’t really understand the article?

[1] amusingly my dissertation was about routing taxis efficiently around a set of pickup points, I distinctly remember thinking in a cab from the station that it was a shame there weren’t any devices that had GPS, maps, Internet and a programming interface to build something that looked a lot like Uber… oh well, execution and timing is everything I guess.

cbsmith · on Aug 13, 2023

The article, as usual, is terrible. I think this paper is more about figuring out the optimal step sizes.

wenc · on Aug 13, 2023

Typically optimal step sizes (α) along a descent direction in gradient descent are determined using a line search, which is itself a (cheaper) univariate optimization problem.

https://en.wikipedia.org/wiki/Line_search

Most nonlinear optimizers already do this.

hackerlight · on Aug 13, 2023

My guess is that they proved this for a class of target functions, and it's the proof that is novel (novel at least insofar as the chosen distribution of target functions is concerned). Practitioners always knew, but lacked the proof.

atq2119 · on Aug 13, 2023

> oh well, execution and timing is everything I guess.

Don't forget a willingness to break the law.

andy_ppp · on Aug 13, 2023

Haha, probably not good to say this for my YC application but I’d probably have made Uber absolutely fantastic for the drivers and not made much money and gone out of business once other businesses were competing with me… or made V1 too much like my desired product that it takes me a year to build.

hackerlight · on Aug 13, 2023

I assume that they're assuming a particular distribution of the target function when making this claim. Otherwise you have No Free Lunch. I wonder what that distribution is.

andy_ppp · on Aug 13, 2023

Think of all solutions to a problem as a weirdly shaped wavy graph, you might select a solution that has some states that appear to work great but cause other better states to be impossible, the gradient search is good but you that and big jumps occasionally to check you aren’t in a local peak or trough. There’s loads of stuff in operations research about it.

toolslive · on Aug 13, 2023

Nature (not the magazine) does this too: genetic crossover is a big jump (while mutation is a small jump).

PeterStuer · on Aug 13, 2023

Isn't this what simulated annealing and similar global optimization algorithms have been doing for decades?

catgary · on Aug 13, 2023

If they’re optimizing over gradient descent, couldn’t they indirectly be doing some hessian-estimation methods?

gsatic · on Aug 13, 2023

Explore - Exploit tradeoff?

gwern · on Aug 13, 2023

It's an odd instance: the step schedule is fixed, and just takes big steps in the middle. Since it's fixed and can't adapt, it's essentially blind but still manages to work better than small steps:

> Grimmer found that the fastest sequences always had one thing in common: The middle step was always a big one. Its size depended on the number of steps in the repeating sequence. For a three-step sequence, the big step had length 4.9. For a 15-step sequence, the algorithm recommended one step of length 29.7. And for a 127-step sequence, the longest one tested, the big central leap was a whopping 370. At first that sounds like an absurdly large number, Grimmer said, but there were enough total steps to make up for that giant leap, so even if you blew past the bottom, you could still make it back quickly. His paper showed that this sequence can arrive at the optimal point nearly three times faster than it would by taking constant baby steps. “Sometimes, you should really overcommit,” he said.

Not really sure what is going on with the fractal structure, however:

> The results also raise an additional theoretical mystery that has kept Grimmer up at night. Why did the ideal patterns of step sizes all have such a symmetric shape? Not only is the biggest step always smack in the center, but the same pattern appears on either side of it: Keep zooming in and subdividing the sequence, he said, and you get an “almost fractal pattern” of bigger steps surrounded by smaller steps. The repetition suggests an underlying structure governing the best solutions that no one has yet managed to explain. But Grimmer, at least, is hopeful.

(It's explore-exploit all the way down...?)