More

nl · 2025-12-19T02:37:09 1766111829

Not the OP but I have https://github.com/nlothian/autocoder which supports a Github-centric workflow using the following options:

  - Claude
  - Codex
  - Kilocode
  - Amp
  - Mistral Vibe

Very vibe coded though.

nl · 2025-12-19T00:33:11 1766104391

The $20 does this fine.

The OpenAI token limits seem more generous than the Anthropic ones too.

rbancroft · 2025-12-19T00:51:48 1766105508

Listening to Dario at the NYT DealBook summit, and reading between the lines a bit, it seems like he is basically saying Anthropic is trying to be a reponsible, sustainable business and charging customers accordingly, and insinuating that OpenAI is being much more reckless, financially.

nl · 2025-12-19T02:27:40 1766111260

I think it's difficult to estimate how profitable both are - depends too much on usage and that varies so much.

I think it is widely accepted that Anthropic is doing very well in enterprise adoption of Claude Code.

In most of those cases that is paid via API key not by subscription so the business model works differently - it doesn't rely on low usage users subsidizing high usage users.

OTOH OpenAI is way ahead on consumer usage - which also includes Codex even if most consumers don't use it.

I don't think it matters - just make use of the best model at the best price. At the moment Codex 5.2 seems best at the mid-price range, while Opus seems slightly stronger than Codex Max (but too expensive to use for many things).

nl · 2025-12-18T01:42:43 1766022163

> GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data.

Where are you getting that? All the citations I've seen say the opposite, eg:

> Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design.

https://massedcompute.com/faq-answers/

> The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.

Both Cerebras and Grok have custom AI-processing hardware (not CPUs).

The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing.

danpalmer · 2025-12-18T04:03:13 1766030593

I thought it was generally accepted that inference was faster on TPUs. This was one of my takeaways from the LLM scaling book: https://jax-ml.github.io/scaling-book/ – TPUs just do less work, and data needs to move around less for the same amount of processing compared to GPUs. This would lead to lower latency as far as I understand it.

The citation link you provided takes me to a sales form, not an FAQ, so I can't see any further detail there.

> Both Cerebras and Grok have custom AI-processing hardware (not CPUs).

I'm aware of Cerebras' custom hardware. I agree with the other commenter here that I haven't heard of Grok having any. My point about knowledge grounding was simply that Grok may be achieving its latency with guardrail/knowledge/safety trade-offs instead of custom hardware.

nl · 2025-12-18T05:47:19 1766036839

Sorry I meant Groq custom hardware, not Grok!

I don't see any latency comparisons in the link

danpalmer · 2025-12-18T07:03:01 1766041381

The link is just to the book, the details are scattered throughout. That said the page on GPUs specifically speaks to some of the hardware differences and how TPUs are more efficient for inference, and some of the differences that would lead to lower latency.

https://jax-ml.github.io/scaling-book/gpus/#gpus-vs-tpus-at-...

Re: Groq, that's a good point, I had forgotten about them. You're right they too are doing a TPU-style systolic array processor for lower latency.

mips_avatar · 2025-12-18T03:33:55 1766028835

I'm pretty sure xAI exclusively uses Nvidia H100s for Grok inference but I could be wrong. I agree that I don't see why TPUs would necessarily explain latency.

danpalmer · 2025-12-18T07:04:58 1766041498

To be clear I'm only suggesting that hardware is a factor here, it's far from the only reason. The parent commenter corrected their comment that it was actually Groq not Grok that they were thinking of, and I believe they are correct about that as Groq is doing something similar to TPUs to accelerate inference.

nl · 2025-12-18T01:38:58 1766021938

OpenAI and Anthropic don't train on your questions if you have pressed the opt-out button and are using their UI. LMArena is a different matter.

nl · 2025-12-18T01:37:21 1766021841

I have a bunch of private benchmarks I run against new models I'm evaluating.

The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs.

However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily.

grog454 · 2025-12-18T02:10:23 1766023823

Ok, but then your "post" isn't scientific by definition since it cannot be verified. "Post" is in quotes because I don't know what you're trying to but you're implying some sort of public discourse.

For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629

eru · 2025-12-18T02:14:15 1766024055

I didn't see anyone claiming any 'science'? Did I miss something?

grog454 · 2025-12-18T02:32:37 1766025157

I guess there's two things I'm still stuck on:

1. What is the purpose of the benchmark?

2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret?

To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

nl · 2025-12-18T03:36:38 1766028998

1. The purpose of the benchmark is to choose what models I use for my own system(s). This is extremely common practice in AI - I think every company I've worked with doing LLM work in the last 2 years has done this in some form.

2. I discussed that up-thread, but https://github.com/microsoft/private-benchmarking and https://arxiv.org/abs/2403.00393 discuss some further motivation for this if you are interested.

> To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

This is an odd way of looking at it. There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.

grog454 · 2025-12-18T04:23:15 1766031795

I see the potential value of private evaluations. They aren't scientific but you can certainly beat a "vibe test".

I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.

> There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.

Then you must not be working in an environment where a better benchmark yields a competitive advantage.

eru · 2025-12-18T05:29:34 1766035774

> I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.

In principle, we have ways: if nl's reports consistently predict how public benchmarks will turn out later, they can build up a reputation. Of course, that requires that we follow nl around for a while.

nl · 2025-12-18T03:40:37 1766029237

As ChatGPT said to you:

> A secret benchmark is: Useful for internal model selection

That's what I'm doing.

grog454 · 2025-12-18T23:36:40 1766101000

My question was "What's the value of a secret benchmark to anyone but the secret holder?"

The root of this whole discussion was a post about how Gemini 3 outperformed other models on some presumably informal question benchmark (a"vibe test"?). When asked for the benchmark, the response from the op and and someone else was that secrecy was needed to protect the benchmark from contamination. I'm skeptical of the need in the op's cases and I'm skeptical of the effectiveness of the secrecy in general. In a case where secrecy has actual value, why even discuss the benchmark publicly at all?

nl · 2025-12-17T02:58:44 1765940324

Flickr failed because they sold to Yahoo which was bad place to end up. But a successful Flickr would look a lot like Instagram

Del.icio.us is the same story. Good product ahead of its time, bought by Yahoo and died. Could have been Pinterest.

mullingitover · 2025-12-17T04:09:04 1765944544

Fair point, there's a good chance we'd be living in a techno utopia right now if someone was able to go back in time and prevent Yahoo from murdering so many promising startups. Conversely, if Yahoo had just spent the relative pocket change that Google was asking for back in the day perhaps we'd be living under the oppressive thumb of a trillion dollar market cap Alta Vista.

nl · 2025-12-17T02:55:02 1765940102

You realize that AI is driving huge advertising growth at Meta, right?

> Meta, the parent company of Facebook and Instagram, reported strong second-quarter 2025 earnings, driven primarily by robust advertising revenue growth. Total revenue reached US$47.52 billion, up 22% from last year, with advertising accounting for $46.56 billion, an increase of 21%, surpassing Wall Street expectations. The growth was fuelled by an 11% rise in ad impressions across Meta’s Family of Apps and a 9% increase in the average ad price. Net income climbed 36% to $18.34 billion, marking ten consecutive quarters of profit outperformance. The Family of Apps segment generated $47.15 billion in revenue and $24.97 billion in operating income, while Reality Labs posted a $4.53 billion operating loss.

> Much of this growth is credited to Meta’s AI advancements in its advertising offerings, such as smarter ad recommendations and campaign automation. Currently, over 4 million advertisers use the AI-powered Advantage+ campaigns, achieving a 22% improvement in returns. Building on this success, Meta plans to enable brands to fully create and target ads using AI by the end of 2026.

(emphasis mine)

https://www.campaignasia.com/article/metas-q2-ad-revenue-bea...

leptons · 2025-12-18T17:44:39 1766079879

You realize that Zuck is trying to produce AGI, which is a money pit deeper than anything he's ever thrown money away on.

nl · 2025-12-17T02:47:00 1765939620

Humans aren't smart, they are really just good at appearing to be smart.

Prove me wrong.

antod · 2025-12-17T03:49:58 1765943398

You'll just claim we only "appeared" to prove you wrong ;)

prng2021 · 2025-12-17T05:13:10 1765948390

If you don’t think humans are smart, then what living creature qualifies as smart to you? Or do you think humans created the word but it describes nothing that actually exists in the real world?

nl · 2025-12-17T06:31:34 1765953094

I think most things humans do are reflexive, type one "thinking" that AIs do just as well as humans.

I think our type two reasoning is roughly comparable to LLM reasoning when it is within the LLM reinforcement learning distribution.

I think some humans are smarter than LLMs out-of-distribution, but only when we think carefully, and in many cases LLMs perform better than many humams even in this case.

prng2021 · 2025-12-17T12:04:03 1765973043

You didn’t answer my question

nl · 2025-12-18T00:47:20 1766018840

That's because it's reductionist and I reject the supposition.

I think humans are smart. I also think AI is smart.

prng2021 · 2025-12-18T02:19:52 1766024392

Your original comment was:

“Humans aren't smart, they are really just good at appearing to be smart. Prove me wrong.”

nl · 2025-12-17T02:46:02 1765939562

Well that's not true - see the Terry Tao article using AlphaEvolve to discover new proofs.

Additionally, "novel ideas" isn't something that is included in something that smart people do so why would it be a requirement for AI.

nl · 2025-12-16T12:20:19 1765887619

Is IntelliCode the same as Intelisense (the non-AI based suggestions thing)?

rob74 · 2025-12-16T12:22:00 1765887720

It's actually in the third paragrapth of the article:

> The classic IntelliSense with language server for the used language is still free – but without AI support.

uallo · 2025-12-16T12:57:35 1765889855

No, IntelliCode is an extension:

https://marketplace.visualstudio.com/items?itemName=VisualSt...