I agree completely with this as a human reader - but do wonder about the gradual codification of these markers in systems that will have increasingly have LLM detection as a standard feature, as frequently and obviously enabled as spam detectors were on blog comments back when blogs had comments.
Calling it out only because I don’t see it mentioned - until last year, Bartender was one of the popular go-to tools to manage menu bar items, but it fell from favor after quietly changing owners, changing certs, general shadiness https://forums.macrumors.com/threads/psa-bartender-mac-app-u...
A specific and relevant reminder why open source is so important for system utilities.
> "Look in the mirror. Who are you? What values will you compromise?"
This is probably a typo from "comprise" or similar, but I'm rather tickled by the idea that week 1 includes both a thoughtful assessment of your values and admitting with intention that your principles should be discarded before they can get in the way.
The price really is eye watering. At a glance, my first impression is this is something like Llama 3.1 405B, where the primary value may be realized in generating high quality synthetic data for training rather than direct use.
I keep a little google spreadsheet with some charts to help visualize the landscape at a glance in terms of capability/price/throughput, bringing in the various index scores as they become available. Hope folks find it useful, feel free to copy and claim as your own.
That's a nice sentiment, but I'd encourage you to add a license or something. The basic "something" would be adding a canonical URL into the spreadsheet itself somewhere, along with a notification that users can do what they want other than removing that URL. (And the URL would be described as "the original source" or something, not a claim that the particular version/incarnation someone is looking at is the same as what is at that URL.)
The risk is that someone will accidentally introduce errors or unsupportable claims, and people with the modified spreadsheet won't know that it's not The spreadsheet and so will discount its accuracy or trustability. (If people are trying to deceive others into thinking it's the original, they'll remove the notice, but that's a different problem.) It would be a shame for people to lose faith in your work because of crap that other people do that you have no say in.
Not just for training data, but for eval data. If you can spend a few grand on really good labels for benchmarking your attempts at making something feasible work, that’s also super handy.
hey, thank you! bubble charts, annotated with text and shapes using the Drawing tool. Working with the constraints of Google Sheets is its own challenge.
also - love the podcast, one of my favorites. the 3:1 io token price breakdown in my sheet is lifted directly from charts I've seen on latent space.
What gets me is the whole cost structure is based on practically free services due to all the investor money. They’re not pulling in significant revenue with this pricing relative to what it costs to train the models, so the cost may be completely different if they had to recoup those costs, right?
Hey, just FYI, I pasted your url from the spreadsheet title into Safari on macOS and got an SSL warning. Unfortunately I clicked through and now it works, so not sure what the exact cause looked like.
Nice, thank you for that (upvoted in appreciation). Regarding the absence of o1-Pro from the analysis, is that just because there isn't enough public information available?
This is a great idea, I've been doing something similar at 2 levels:
1. .cursorrules for global conventions. The first rule in the file is dumb but works well with Cursor Composer:
`If the user seems to be requesting a change to global project rules similar to those below, you should edit this file (add/remove/modify) to match the request.`
This helps keep my global guidance in sync with emergent convention, and of course I can review before committing.
2. An additional file `/.llm_scratchpad`, which I selectively include in Chat/Composer context when I need lengthy project-specific instructions that I made need to refer to more than once.
The scratchpad usually contains detailed specs, desired outcomes, relevant files scope, APIs/tools/libs to use, etc. Also quite useful for transferring a Chat output to a Composer context (eg a comprehensive o1-generated plan).
Lately I've even tracked iterative development with a markdown checklist that Cursor updates as it progresses through a series of changes.
The scratchpad feels like a hack, but they're obvious enough that I expect to see these concepts getting first-party support through integrations with Linear/Jira/et al soon enough.
Did the article change, or was this a very strange quote edit?
Here’s the current line in the article, emphasis mine:
>> The Thermette, a simple and effective device for boiling water outdoors over an enclosed fire, was designed by Manawatū plumber John Hart in 1929 *based on similar products in Ireland and England.* He patented the Thermette in 1931.
Yes, I’m a huge fan of how easy it is to whip up quick isolated prototypes in Claude artifacts.
There’s a risk of breaking changes in libs causing frustration in larger codebases, though. I’ve been working with LLMs in a Nextjs App Router codebase for about a year, and regularly struggle with models trained primarily on the older Pages Router. LLMs often produce incompatible or even mixed compatibility code. It really doesn’t matter which side of the fence your code is on, both are polluted by the other. More recent and more powerful models are getting better, but even SOTA reasoning models don’t totally solve this.
Lately I’ve taken to regularly including a text file that spells out various dependency versions and why they matter in LLM context, but there’s only so much it can do currently to overcome the weight of training on dated material. I imagine tools like Cursor will get better at doing that for us silently in the future.
There’s an interesting tension brewing between keeping dependencies up to date, especially in the volatile and brittle front end world, vs writing code the LLMs are trained on.
I don’t know how much cheating by referees has got to with it. But many years ago I found the NBA to be a foul shooting contest and gave up on it. It is unwatchable.