Hacker Newsnew | past | comments | ask | show | jobs | submit | Y_Y's commentslogin

I don't feel like the solution to this is having victims set up LLCs.

Surely they'll sell if the price is right.

I wholly dis^H^H^Happrove of what you say—but won't defend to the death your right to say it.

(Apologies to not Voltaire)


I do something that, but just parse the output of `docker ps` or whatever. Since each line has the same format it's very straightforward.

Fwiw this is how cars work when you change to a country that drives on the other side of the road. It seems like mirroring the car would make sense. But really everything is shifted to the opposite side as a translation without reflection. It's easier to manufacture, but as many of you will know and is apparent to all rental agencies, adapting doesn't take long for the average driver, even on manual transmission.

I'm taking the Red Cross public next. With the price of healthcare these days my earnings projections are uber-extreme.

But when you're a moron how can you distinguish?

I'm being (mostly) serious, suppose you're a stuffed ahort trying to boost your valuation, how can you work out who's smart enough to train your LLM? (Never mind how to get them to work for you!)


I do a lot of human evaluations. Lots of Bayesian / statistical models that can infer rater quality without ground truth labels. The other thing about preference data you have to worry about (which this article gets at) is: preferences of _who_? Human raters are a significantly biased population of people, different ages, genders, religions, cultures, etc all inform preferences. Lots of work being done to leverage and model this.

Then for LMArena there is the host of other biases / construct validity: people are easily fooled, even PhD experts; in many cases it’s easier for a model to learn how to persuade than actually learn the right answers.

But a lot of dismissive comments as if frontier labs don’t know this, they have some of the best talent in the world. They aren’t perfect but they in a large sene know what they’re doing and what the tradeoffs of various approaches are.

Human annotations are an absolute nightmare for quality which is why coding agents are so nice: they’re verifiable and so you can train them in a way closer to e.g. alphago without the ceiling of human performance


> in many cases it’s easier for a model to learn how to persuade than actually learn the right answers

So we should expect the models to eventually tend toward the same behaviors that politicians exhibit?


Maybe a happy to deceive marketing/sales role would be more accurate.

100% (am a Bayesian statistician).

Isn’t it fascinating how it comes down to quality of judgement (and the descriptions thereof)?

We need an LMArena rated by experts.


As a statistician, do you you think you could, given access to the data, identify the subset of LMArena users that are experts?

Yes, for sure! I can think of a few ways.

they always know, they just have non-AGI incentive and asymetric upside to play along...

Sure, on the surface judging the judge is just as hard as being the judge

But at least the two examples of judging AI provided in the article can be solved by any moron by expending enough effort. Any moron can tell you what Dorothy says to Toto when entering Oz by just watching the first thirty minutes of the movie. And while validating answer B in the pan question takes some ninth-grade math (or a short trip to wikipedia), figuring out that a nine inch diameter circle is in fact not the same area as a 9x13 inch square is not rocket science. And with a bit of craft paper you could evaluate both answers even without math knowledge

So the short answer is: with effort. You spend lots of effort on finding a good evaluator, so the evaluator can judge the LLM for you. Or take "average humans" and force them to spend more effort on evaluating each answer


Maybe you need to have people rate others ratings to remove at least the worst idiots.

that’s why Mercor is worth 2billion

Ah yes, markdown, the ultimate structure for machine-readable data

Someone had to come up with something even more annoying than yaml for machine-readable data. :)

But it sure does seem like that sometimes

You can't phase out common "knowledge"!

True. But you can stop recommending bad science. The original food pyramid was an industry wish list.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: