More

Y_Y · 2026-01-08T12:29:06 1767875346

I don't feel like the solution to this is having victims set up LLCs.

Y_Y · 2026-01-08T12:25:58 1767875158

Surely they'll sell if the price is right.

Y_Y · 2026-01-07T23:02:52 1767826972

I wholly dis^H^H^Happrove of what you say—but won't defend to the death your right to say it.

(Apologies to not Voltaire)

Y_Y · 2026-01-07T22:55:17 1767826517

I do something that, but just parse the output of `docker ps` or whatever. Since each line has the same format it's very straightforward.

Y_Y · 2026-01-07T22:53:36 1767826416

Fwiw this is how cars work when you change to a country that drives on the other side of the road. It seems like mirroring the car would make sense. But really everything is shifted to the opposite side as a translation without reflection. It's easier to manufacture, but as many of you will know and is apparent to all rental agencies, adapting doesn't take long for the average driver, even on manual transmission.

Y_Y · 2026-01-07T22:28:09 1767824889

I'm taking the Red Cross public next. With the price of healthcare these days my earnings projections are uber-extreme.

Y_Y · 2026-01-07T22:26:42 1767824802

But when you're a moron how can you distinguish?

I'm being (mostly) serious, suppose you're a stuffed ahort trying to boost your valuation, how can you work out who's smart enough to train your LLM? (Never mind how to get them to work for you!)

aspenmartin · 2026-01-07T22:34:11 1767825251

I do a lot of human evaluations. Lots of Bayesian / statistical models that can infer rater quality without ground truth labels. The other thing about preference data you have to worry about (which this article gets at) is: preferences of _who_? Human raters are a significantly biased population of people, different ages, genders, religions, cultures, etc all inform preferences. Lots of work being done to leverage and model this.

Then for LMArena there is the host of other biases / construct validity: people are easily fooled, even PhD experts; in many cases it’s easier for a model to learn how to persuade than actually learn the right answers.

But a lot of dismissive comments as if frontier labs don’t know this, they have some of the best talent in the world. They aren’t perfect but they in a large sene know what they’re doing and what the tradeoffs of various approaches are.

Human annotations are an absolute nightmare for quality which is why coding agents are so nice: they’re verifiable and so you can train them in a way closer to e.g. alphago without the ceiling of human performance

fc417fc802 · 2026-01-07T22:48:00 1767826080

> in many cases it’s easier for a model to learn how to persuade than actually learn the right answers

So we should expect the models to eventually tend toward the same behaviors that politicians exhibit?

c0balt · 2026-01-07T23:42:39 1767829359

Maybe a happy to deceive marketing/sales role would be more accurate.

RA_Fisher · 2026-01-07T23:41:07 1767829267

100% (am a Bayesian statistician).

Isn’t it fascinating how it comes down to quality of judgement (and the descriptions thereof)?

We need an LMArena rated by experts.

Lerc · 2026-01-08T03:14:21 1767842061

As a statistician, do you you think you could, given access to the data, identify the subset of LMArena users that are experts?

RA_Fisher · 2026-01-08T12:09:52 1767874192

Yes, for sure! I can think of a few ways.

zqy123007 · 2026-01-08T01:14:40 1767834880

they always know, they just have non-AGI incentive and asymetric upside to play along...

wongarsu · 2026-01-07T22:43:59 1767825839

Sure, on the surface judging the judge is just as hard as being the judge

But at least the two examples of judging AI provided in the article can be solved by any moron by expending enough effort. Any moron can tell you what Dorothy says to Toto when entering Oz by just watching the first thirty minutes of the movie. And while validating answer B in the pan question takes some ninth-grade math (or a short trip to wikipedia), figuring out that a nine inch diameter circle is in fact not the same area as a 9x13 inch square is not rocket science. And with a bit of craft paper you could evaluate both answers even without math knowledge

So the short answer is: with effort. You spend lots of effort on finding a good evaluator, so the evaluator can judge the LLM for you. Or take "average humans" and force them to spend more effort on evaluating each answer

michaelmrose · 2026-01-08T01:38:47 1767836327

Maybe you need to have people rate others ratings to remove at least the worst idiots.

atleastoptimal · 2026-01-07T22:37:26 1767825446

that’s why Mercor is worth 2billion

Y_Y · 2026-01-07T22:07:06 1767823626

Ah yes, markdown, the ultimate structure for machine-readable data

actionfromafar · 2026-01-08T03:06:13 1767841573

Someone had to come up with something even more annoying than yaml for machine-readable data. :)

Y_Y · 2026-01-07T22:04:58 1767823498

But it sure does seem like that sometimes

Y_Y · 2026-01-07T22:03:25 1767823405

You can't phase out common "knowledge"!

JumpCrisscross · 2026-01-08T02:16:17 1767838577

True. But you can stop recommending bad science. The original food pyramid was an industry wish list.