More

HarHarVeryFunny · 2025-12-23T16:33:51 1766507631

I'd like to know what more of the use cases are too, but one would be for doing renaming operations where the LSP understands the code and will tell the caller exactly what edits to make to which files. So I assume with LSP integration you could then ask Claude Code to do the rename, and it would do it per LSP instructions rather than itself having to understand the source code and maybe doing a worse job.

HarHarVeryFunny · 2025-12-21T18:59:57 1766343597

Waymos rely on remote operators to take over when the vehicle doesn't know what to do, and obviously if the remote connection is gone then this is no longer available, and one might speculate that the cars then "fail safe" by not proceeding if they are in a situation where remote help is called for and inaccessible.

Perhaps traffic lights being out is what caused the cars to stop operating autonomously and try to phone home for help, or perhaps losing the connection home is itself enough to trigger a fail safe shutdown mode ?

It reminds a bit of the recent TeslaBot video, another of their teleoperated stunts, where we see the bot appearing to remove a headset with both hands that it wasn't wearing (but that it's remote operator was), then fall over backwards "dead" as the remote operator evidentially clocked off his shift or went for a bathroom break.

MBCook · 2025-12-21T19:07:00 1766344020

That’s clearly unacceptable. It needs to gracefully handle not having that fallback. That is an incredibly obvious possible failure.

Things go wrong -> get human help

Human not available -> just block the road???

How is there not a very basic “pull over and wait” final fallback.

I can get staying put if the car thinks it hit someone or ran over something. But in a situation like this where the problem is fully external it should fall back to “park myself” mode.

JumpCrisscross · 2025-12-21T19:12:12 1766344332

> How is there not a very basic “pull over and wait” final fallback

Barring everything else, the proper failsafe for any vehicle should be to stop moving and tell the humans inside to evacuate. This is true for autonomous vehicles as well as manned ones–if you can't figure out how to pull over during a disaster, ditching is absolutely a valid move.

Wowfunhappy · 2025-12-21T19:45:40 1766346340

If the alternative is that the vehicle explodes, sure. And since GP did say "final fallback", I suppose you're right. But if the cars are actually reaching that point, they probably shouldn't be on the road in the first place.

The not-quite-final fallback should be to pull over.

MBCook · 2025-12-21T21:34:46 1766352886

Yeah. I wasn’t considering people, just getting the car out of the way.

I wasn’t considering people taking it as a given that any time the car gives up the doors should be unlocked for passengers to leave if they feel it’s safe.

And as a passenger, I’d feel way safer getting out if it pulled over instead of just stopped in the middle of the street and other cars were trying to drive around it.

No one should ever be trapped inside by the car.

torham · 2025-12-21T19:33:47 1766345627

They now apparently run these things on the interstate, the car needs to do more than just stop.

NetMageSCW · 2025-12-22T21:13:47 1766438027

How is it going to pull over at a four way stoplight intersection? Drive on the sidewalk?

HarHarVeryFunny · 2025-12-24T13:15:54 1766582154

Seems I was pretty much correct.

https://waymo.com/blog/2025/12/autonomously-navigating-the-r...

"Navigating an event of this magnitude presented a unique challenge for autonomous technology. While the Waymo Driver is designed to handle dark traffic signals as four-way stops, it may occasionally request a confirmation check to ensure it makes the safest choice. While we successfully traversed more than 7,000 dark signals on Saturday, the outage created a concentrated spike in these requests. This created a backlog that, in some cases, led to response delays contributing to congestion on already-overwhelmed streets."

HarHarVeryFunny · 2025-12-21T15:23:21 1766330601

Right, how do you know the gene pool now mostly contains large aggressive bears that instinctively stay away from villages, and small cuddly bears that are enjoying left over pasta suppers ?

Maybe it's just that many of the large aggressive bears living near villages have just been shot or scared away, but the genetics is unchanged and the offspring of large aggressive bears currently living away from villages will have no aversion to trying their luck in the village ?

HarHarVeryFunny · 2025-12-20T18:09:11 1766254151

The numbers do appear quite staggering. It can't just be the dead drivers - there must be similar numbers of stoned drivers who are causing accidents, maybe killing others, while surviving themselves.

As far as driving goes, any amount of drugs or alchohol is going to reduce reactions times, in addition to any impaired judgement or ability to control the vehicle. Even a couple of 1/10ths of a second in increased reaction time is enough to make the difference between braking in time and hitting another car or pedestrian/etc.

HarHarVeryFunny · 2025-12-20T16:56:29 1766249789

I agree with the sentiment, although not sure if "error" is the right category/verbiage for actionable logs.

In an ideal world things like logs and alarms (alerting product support staff) should certainly cleanly separate things that are just informative, useful for the developer, and things that require some human intervention.

If you don't do this then it's like "the boy that cried wolf", and people will learn to ignore errors and alarms since you've trained them to understand that usually no action is needed. It's also useful to be able to grep though log files and distinguish failures of different categories, not just grep for specific failures.

HarHarVeryFunny · 2025-12-20T16:33:37 1766248417

Chain of thought, now including "reasoning", are basically a work around for the simplistic nature of the Transformer neural network architecture that all LLMs are based on.

The two main limitations of the Transformer that it helps with are:

1) A Transformer is just a fixed-size stack of layers, with a one-way flow of data through the layers from input to output. The fixed number of layers equates to how many "thought" steps the LLM can put into generating each word of output, but good responses to harder questions may require many more steps and iterative thinking...

The idea of "think step by step", aka chain of thought, is to have the model break it's response down into a sequence of steps, each building on what came before, so that the scope of each step is withing the capability of the fixed number of layers of the transformer.

2) A Transformer has extremely limited internal memory from one generated word to the next, so telling the model to go one step at a time, feeding its own output back in as input, in effect makes the model's output a kind of memory that makes up for this.

So, chain of thought prompting ultimately give the model more thinking steps (more words generated), together with memory of what it is thinking, in order to be able to generate a better response.

HarHarVeryFunny · 2025-12-20T16:09:03 1766246943

"Look Ma, no hands!" vibe coding, as described by Karpathy, where you never look at the code being generated, was never a good idea, and still isn't. Some people are now misusing "vibe coding" to describe any use of LLMs for coding, but there is a world of difference between using LLMs in an intelligent considered way as part of the software development process, and taking a hit on the bong and "vibe coding" another "how many calories in this plate of food" app.

mvkel · 2025-12-20T18:07:15 1766254035

Karpathy himself has used "vibe coding" to describe "usage of LLMs for coding," so it's fair to say the definition has expanded.

https://karpathy.bearblog.dev/year-in-review-2025/

girvo · 2025-12-20T22:18:58 1766269138

Which frankly makes it pretty useless. Describing how I use them at work as "vibe coding" in the same vein as a random redditor generating whatever on Replit is useless. It's a definition so wide as to have no explanatory power.

HarHarVeryFunny · 2025-12-20T16:02:32 1766246552

Yes, it's a strange take. It's not that programmers have changed their mind about unchanging LLMs, but rather that LLMs have changed and are now useful for coding, not just CoPilot autocomplete like the early ones.

What changed was the use of RLVR training for programming, resulting in "reasoning" models that are now attempting to optimize for a long-horizon goal (i.e. bias generation towards "reasoning steps" that during training let to a verified reward), as opposed to earlier LLMs where RL was limited to RLHF.

So, yeah, the programmers who characterized early pre-RLVR coding models as of limited use were correct. Now the models are trained differently and developers find them much more useful.

zahlman · 2025-12-20T16:09:53 1766246993

I thought I'd read a lot of these threads this year, and also discussed off-site the use of coding agents and the technology behind them; but this is genuinely the first time I've seen the term "RLVR".

HarHarVeryFunny · 2025-12-20T17:08:15 1766250495

RLVR "reinforcement learning for verifiable rewards" refers to RL used to encourage reasoning towards achieving long-horizon goals in areas such as math and programming, where the correctness/desirability of a generated response (or perhaps an individual reasoning step) can be verified in some way. For example generated code can be verified by compiling and running it, or math results verified by comparing to known correct results.

The difficulty of using RL more generally to promote reasoning is that in the general case it's hard to define correctness and therefore quantify a reward for the RL training to use.

somewhereoutth · 2025-12-20T20:45:23 1766263523

> generated code can be verified by compiling and running it

I think this gets to the crux of the issue with LLMs for coding (and indeed 'test orientated development'). For anything beyond a most basic level of complexity (i.e. anything actually useful), code cannot be verified by compiling and running it. It can only be verified - to a point - by skilled human inspection/comprehension. That is the essence of code really, a definition of action, given by humans, to a machine for running with /a prior/ unenumerated inputs. Otherwise it is just a fancy lookup table. By definition then not all inputs and expected outputs can be tabulated, tested for, or rewarded for.

HarHarVeryFunny · 2025-12-20T22:56:50 1766271410

I was talking about the RL training process for giving these models coding ability in the first place.

As far as using the trained model to generate code, then of course it's up to the developer to do code reviews, testing, etc as normal, although of course an LLM can be used to assist writing test cases etc as well.

zahlman · 2025-12-20T17:26:40 1766251600

> The difficulty of using RL more generally to promote reasoning is that in the general case it's hard to define correctness and therefore quantify a reward for the RL training to use.

Ah, hence the "HF" angle.

HarHarVeryFunny · 2025-12-20T18:27:44 1766255264

RLHF really has a different goal - it's not about rewarding/encouraging reasoning, but rather rewarding outputs that match human preferences for whatever reason (responses that are more on-point, or politer, or longer form, etc, etc).

The way RLHF works is that a smallish amount of feedback data of A/B preferences from actual humans is used to train a preference model, and this preference model is then used to generate RL rewards for the actual RLHF training.

RLHF has been around for a while and is what tamed base models like GPT 3 into GPT 3.5 that was used for the initial ChatGPT, making it behave in more of an acceptable way!

RLVR is much more recent, the basis of the models that do great at math and programming. If you talk about reasoning models being RL trained then it's normally going to imply RLVR, but it seems there's a recent trend of people calling it RLVR to be more explcit.

throw1235435 · 2025-12-21T21:17:30 1766351850

Agree with this. The RLVR changes (starting with o1 I think) was what changed/disrupted the industry. Before that I thought these things were just better autocomplete.

HarHarVeryFunny · 2025-12-20T14:49:18 1766242158

Interesting interview that lifts the curtain, a tiny bit, on the development of Gemini, and the views of the DeepMind team.

Matt Turck seems to have a good sense of what areas Sebastian may be able to talk about, and what not, with one off-the-table area evidentially being training on synthetic reasoning traces, which must be an area of active research and differentiation.

HarHarVeryFunny · 2025-12-20T14:40:36 1766241636

> The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR

I don't see these descriptions as very insightful.

The difference between general/animal intelligence and jagged/LLM intelligence is simply that humans/animals really ARE intelligent (the word was created to describe this human capability), while LLMs are just echoing narrow portions of the intelligent output of humans (those portions that are amenable to RLVR capture).

For an artificial intelligence to be intelligent in it's own right, and therefore be generally intelligent, it would need to need - like an animal - to be embodied (even if only virtually), autonomous, predicting the outcomes of it's own actions (not auto-regressively trained), learning incrementally and continually, built with innate traits like curiosity and boredom to put and keep itself in learning situations, etc.

Of course not all animals are generally intelligent - many (insects, fish, reptiles, many birds) just have narrow "hard coded" instinctual behaviors, but others like humans are generalists who evolution have therefore honed for adaptive lifetime learning and general intelligence.

naasking · 2025-12-21T01:10:30 1766279430

> while LLMs are just echoing narrow portions of the intelligent output of humans

But they aren't just echoing, that's the point. You really need to stop ignoring the extrapolation abilities in these domains. The point of the jagged analogy is that they match or exceed human intelligence in specific areas in a way that is not just parroting.

HarHarVeryFunny · 2025-12-21T18:22:58 1766341378

It's tiresome in 2025 to keep on having to use elaborate long winded descriptions to describe how LLMs work, just to prove that one does understand, rather than be able to assume that people generally understand, and be able to use shorter descriptions.

Would "riffing" upset you less than "echoing"? Or an explicit "echoing statistics" rather than "echoing training samples"? Does "Mashups of statistical patterns" do it for you?

The jagged frontier of LLM capabilty is just a way of noting the fact that they act more like a collection of narrow intelligences rather than a general intelligence who's performance might be expected to be more even.

Of course LLMs are built and trained to generate based on language statistics, not to parrot individual samples, but given your objection it's amusing to note that some of the areas where LLMs do best, such as math and programming, are the ones where they have been RL-trained to override these more general language patterns and instead more closely follow the training data.