I always wondered how they achieved this - is it just retries while generating tokens and as soon as they find mismatch - they retry? Or the model itself is trained extremely well in this version of 4.5?
They're using the same trick OpenAI have been using for a while: they compile a grammar and then have that running as part of token inference, such that only tokens that fit the grammar are selected as the next-token.
Yea, and now there are mature OSS solutions with outlines and xgrammar, so it makes even more weird that only now do we have this supported by Anthropic.
This makes me wonder if there are cases where one would want the LLM to generate a syntactically invalid response (which could be identified as such) rather than guarantee syntactic validity at the potential cost of semantic accuracy.
I would have suspected it too, but I’ve been struggling with OpenAI returning syntactically invalid JSON when provided with a simple pydantic class (a list of strings), which shouldn’t be possible unless they have a glaring error in their grammar.
You might be using JSON mode, which doesn’t guarantee a schema will be followed, or structured outputs not in strict mode. It is possible to get the property that the response is either a valid instance of the schema or an error (eg for refusal)
Hmm, wouldn't it sacrifice a better answer in some cases (not sure how many though)?
I'll be surprised if they hadn't specifically trained for structured "correct" output for this, in addition to picking next token following the structure.
In my experience (I've put hundreds of billions of tokens through structured outputs over the last 18 months), I think the answer is yes, but only in edge cases.
It generally happens when the grammar is highly constrained, for example if a boolean is expected next.
If the model assigns a low probability to both true and false coming next, then the sampling strategy will pick whichever one happens to score highest. Most tokens have very similar probabilities close to 0 most of the time, and if you're picking between two of these then the result will often feel random.
It's always the result of a bad prompt though, if you improve the prompt so that the model understands the task better, then there will then be a clear difference in the scores the tokens get, and so it seems less random.
It's not just the prompt that matters, it's also field order (and a bunch of other things).
Imagine you're asking your model to give you a list of tasks mentioned in a meeting, along with a boolean indicating whether the task is done. If you put the boolean first, the model must decide both what the task is and whether it is done at the same time. If you put the task description first, the model can separate that work into two distinct steps.
There are more tricks like this. It's really worth thinking about which calculations you delegate to the model and which you do in code, and how you integrate the two.
Grammars work best when aligned with prompt. That is, if your prompt gives you the right format of answer 80% of the time, the grammar will take you to a 100%. If it gives you the right answer 1% of the time, the grammar will give you syntactically correct garbage.
Sampling is already constrained with temperature, top_k, top_p, top_a, typical_p, min_p, entropy_penalty, smoothing etc. – filtering tokens to valid ones according to grammar is just yet another alternative. It does make sense and can be used for producing programming language output as well – what's the point in generating/bothering with up front know, invalid output? Better to filter it out and allow valid completions only.
That's on my list of blog-post-worthy things to test, namely text rendering to image in Python directly and passing both input images to the model for compositing.
It was this exact part of the conversation that touched me negatively too. marsf expresses some very valid criticism that, instead of being publicly addressed, is being handled by "let's discuss it privately". This always means that they don't want to discuss, they just want to shut you down.
I don’t think so. Working in tech with many busy people, I say “hop on a call”, but only in “let’s sync live, it’ll be faster” situations.
This stuck out to me as rude. I would never say that to someone on my team who expressed serious concerns, far less than this person quitting after years of dedication.
I would offer an apology, explanation, and follow up questions to understand more in public, then say I’m happy to set up time to talk privately if they would like to or feel more comfortable.
In my experience, and in my feeling as someone reading such things, you need to tone-match. The resignation message was somewhat formal, structured and serious in tone. Replying in such an informal tone means that you are not taking things seriously, which is insulting. Even more so because that informal answer is public.
I'm tone-deaf by culture and by personality. I often make those kinds of mistakes. But a public resignation like this is a brightly flashing warning light saying: "this needs a serious formal answer".
What about the reply in the link indicates to you that the person has empathy for marsf’s complaints and is willing to change anything at Mozilla in response to them?
For the reasons I stated above, the response comes off as faking understanding to manage a PR issue rather than genuine empathy and possible negotiation, but I am often wrong about many things.
> The Takeaway: Skills are the right abstraction. They formalize the “scripting”-based agent model, which is more robust and flexible than the rigid, API-like model that MCP represents.
Just to not confuse, MCP is like an api but the underlying api can execute an Skill. So, its not MCP vs Skill as a contest. It's just the broad concept of a "flexible" skill vs "parameter" based Api. And again parameter based APIs can also be flexible depending on how we write it except that it lacks SKILL.md in case of Skills which guides llm to be more generic than a pure API.
By the way, if you are a Mac user, you can execute Skills locally via OpenSkills[1] that I have created using apple contianers.
Since you are on Mac, if you need some kind code execution sandbox, check out Coderunner[1] which is based on Apple container, provides a way execute any LLM generated cod e without risking arbitrary code execution on your machine.
I have recently added claude skills to it. So, all the claude skills can be executed locally on your mac too.
yeah CS skins is one of the biggest markets of digital-only-aesthetic-items before NFT came around (and now probably still bigger than NFTs). The main thing with NFTs was that there's no "central database", CS skins solely lives in Valve's database.
making a butterfly knife for Valve isn't hard (in the past Steam Customer Service duplicated items lost in scams). It's hard for the players because they have to "gamble" for it through paying keys to open cases.
Can you explain the shadow banking / conversion angle? All I found was that selling knives was used to get a discount on steam balance thanks to price arbitrage.
> "Selling Knives" (挂刀) refers to the technique of buying in-game items from 3rd-party (Chinese) trading sites like NetEase BUFF, C5, IGXE, and UUYP, and then selling them on the Steam Market to obtain a discounted Steam Wallet balance by capitalizing on price differences.
I'm surprised the price difference did not disappear if people make that trade.
China notoriously has intense capital controls. It's difficult for ordinary Chinese citizens to take capital out of the country. CS2 items can be bought and sold in both USD and RMB, and can be transferred between Chinese and international accounts. It's not about Steam wallet balances.
Interesting. I'm curious though, assuming I am Chinese and I trade knives for USD - where would I be able to receive USD to evade capital control? Surely not my bank account or Steam wallet. Or is it for people with bank account in both countries? But in that case crypto could be more convenient? I'm puzzled
Yes you would need to receive in a foreign USD bank account outside of China, the whole goal is to get the capital out of China and into a foreign account. Cryptocurrency transactions/exchanges are illegal in China so that's definitely not convenient! Meanwhile you can buy CS2 items with any ordinary payment method.
Remember the 15% transaction fee on the Steam market? That's why the price difference hasn't disappeared. Players can avoid this fee through gifting and off-platform transactions.
And all of this is just Chinese players trying to buy games cheaper—after all, what else can Steam wallet funds be used for?
Some comments claim this is a way for Chinese people to evade financial regulations, but that's complete nonsense. The Steam market's capacity is entirely insufficient to meet the demand. They could easily choose to legally exchange currency using the foreign exchange quotas of relatives and friends, engage in cross-border wash trading through underground banks, or use fake trade and fake investment schemes.
I feel watches and cars are different. You cant magically "print" 10000000 Bentley's so supply will be constrained and they are expensive to make. I feel the luxury is more tangible than just being rare.
A lot of real economies are based on fake constraints. Or the constraint is a closely held secret that's pretty arbitrary and not based on any grand amount of skill or effort.
Hey, we built coderunner[1] exactly for this purpose. It's completely local. We use apple containers for this (which are 1:1 mapped to a lightweight VM).
Very cool! Apple containers run on Apple ARM so it's complimentary to my stack which doesn't support ARM yet (but soon will when extending to Qemu which supports ARM). Thanks for sharing!
reply