I'm excited to see whether the instruction following improvements play out in the use of Codex.
The biggest issue I'e seen _by far_ with using GPT models for coding has been their inability to follow instructions... and also their tendency to duplicate-act on messages from up-thread instead of acting on what you just asked for.
I think thats part of the issue I have with it constantly.
Let's say I am solving a problem. I suggest strategy Alpha, a few prompts later I realize this is not going to work. So I suggest strategy Bravo, but for whatever reason it will hold on to ideas from A and the output is a mix of the two. Even if I say forget about Alpha we don't want anything to do that, there will be certain pieces which only makes sense with Alpha, in the Bravo solution.
I usually just start with a new chat at that point and hope the model is not relying on previous chat context.
This is a hard problem to solve because its hard to communicate our internal compartmentalization to a remote model.
Unfortunately, if it's in context then it can stay tethered to the subject. Asking it not to pay attention to a subject, doesn't remove attention from it, and probably actually reinforces it.
If you use the API playground, you can edit out dead ends and other subjects you don't want addressed anymore in the conversation.
That's just how context works. If you're going to backpedal, go back in the conversation and edit your prompt or start a new session. I'll frequently ask for options, get them, then edit that prompt and just tell it to do whatever I decided on.
I've only had that happen when I use /compact, so I just avoid compacting altogether on Codex/Claude. No great loss and I'm extremely skeptical anyway that the compacted summary will actually distill the specific actionable details I want.
Huh really? It’s the exact opposite of my experience. I find gpt-5-high to be by far the most accurate of the models in following instructions over a longer period of time. Also much less prone to losing focus when context size increases
Are you using the -codex variants or the normal ones?
As in: if you look at this image, can you place yourself on a scale of 1 - 5 of as to the fidelity with which you can picture an apple if you try to imagine it?
I'm a 5 for example, and in asking many people this question I've gotten a solid spectrum of answers from 1 - 5. Generally in a single group of a handful of people I'll get several different numbers.
I've had mixed results with this method, especially for folks in category 5 because they grew up in a world where people casually talked about [actual] visualization and they've associated [not actually visualization] with the word (thinking it is a metaphor for something else). As someone who cannot visualize at all when faced with this question I feel like my answer wants to be.. "null" / "the premise of this question doesn't make sense" and not "5"
A variant that I've found helpful for teasing out this case:
1. Ask the test subject to visualize an Apple
2. Ask them for a few very specific details about the apple they are currently visualizing (what color is it? does it have a leaf or a bite out of it?, etc)
In many cases aphantastics will not object to the activity in step 1, but they won't be doing the same thing as the folks who are actually visualizing. They'll just do what they do when people talk about "visualizing".
When you get to step 2 someone who is actually visualizing can immediately answer the questions and don't think they are strange, they are just reporting what they are visualizing in front of them.
An aphantastic in step 2 is often confused. They aren't actually visualizing any specific apple so there isn't a reference to answer the questions. You'll get a response like.. well what kind of Apple is it? How should I know if it has a bite out of it? You first have to either provide more context or reword the question to something like: What is a color an Apple could be? or What color is your favorite Apple?
I have sometimes wondered whether there is a personality or cognitive trait that makes one unable to respond to tests measuring personality or cognitive traits.
Every personality test I have ever taken, on many of the questions I've felt that I could answer almost anything and still be truthful.
When I see this apple scale, I simultaneously feel that both 1 and 5 apply to how I visualize an apple. It's hard for me to describe what's going on in my brain, and I don't think language or images are very helpful at illuminating it.
If such a meta-trait were to exist, which would have more to do with the narratives and metaphors we use to describe our mental processes than the processes themselves, it would be funny if that's actually a good deal of what was being measured all along.
Something that might help - in this specific instance - is trying to contrast with others.
That is: if you show this photo to people that you know and you compare and contrast _how detailedly_ you can imagine the apples, that can help.
For example: are you imagining a _specific_ apple? What high-level color is it? How about more specifically? How does the color change across the surface? If so, does it have any distinguishing features? Leaves on the stem or no? What does the bottom look like? Can you turn it around and describe that?
Folks who are high up on the spectrum (like 1) can often answer these questions specifically, whereas as you go down the spectrum these tasks seem progressively more impossible.
That doesn't really help me, because I feel that I could answer those questions in detail but also truthfully say that there is no image at all of the apple in my head.
I guess what I'm saying is that image / not-image binary doesn't really map on to how I perceive the experience of my own imagination.
But again, maybe that just means I'm a 5 and I'm coping.
I'm definitely in category 4 by default, though I can do category 2 with concentration. But I don't really feel like it's a problem? If things have colors and surfaces then your view of one object can block your view of another object which seems like it makes visualizing complex scenes or devices much less convenient.
From the comments here I'm almost getting the impression aphantasia is more common than not, which is wild to me. I'm quite sure I'd place myself on one. I can imagine an apple, and it will vividly appear. I can transform it, see reflections on it or imagine the feel and sound from slicing it. However, I do not experience the apple as part of an overlay, as some others have described it. Rather, it's as if I use a different set of eyes, ears or a another skin. The more vividly imagine the apple, the less aware I become of my actual senses. I can of course also imagine how the apple would interact with the environment around me, but the combined environment is still distinct from reality. I also have very precise memories of faces I've seen. I've always wondered if there's an inverted correlation with my number memory, which is much more diffuse.
I think I'm a 3? I don't see a real apple, more like the concept of an apple, but the harder I try, the more characteristics I can conjure -- although it would be a stretch to say I "see" them.
I think this test is bad at accounting for subjectivity. A literal image you see with your eyes doesn't map exactly to an image you "see" with your mind.
It... doesn't, but I've found that a large number of people (I've asked at least many dozens) find it relatively easy to rank themselves on it, and differentiate amongst one another's subjective perceptions.
Also see my sibling comment about contrasting and tasks!
According to my wife (a 1), yes. Seems wild to me as a ~4. If I concentrate really, really hard on trying to imagine visual detail I can get something to a ~3 at low detail or hold individual small details at a 2 until I stop concentrating on them.
As in: if you look at this image, can you place yourself on a scale of 1 - 5 of with what kind of fidelity you can picture an apple if you try to imagine it?
I'm a 5 for example, and in asking many people this question I've gotten a solid spectrum of answers from 1 - 5. Generally in a single group of a handful of people I'll get several different numbers.
I have seen it and unfortunately, I'm a 5. I quite literally cannot picture an apple in any form. I understand what I'm supposed to be picturing but when I try there's nothing that appears. It's fascinating to me too, since I typically have quite vivid dreams and I've been able to lucid dream on a number of occasions.
Now, I've chatted with friends, and my one friend is close to a 2, or maybe a 1 from how he described it (being able to visualize the apple and rotate it 3-dimensionally).
I fully believe this to be real, but I struggle to internalize that there are people who genuinely can't picture an apple. That is a very useful simple tool. Thank you for sharing it.
Even this feels like only a partial scale. I can picture what an apple looks like, rotate it in my, and see how light would reflect off of it as it moves.
How about smell? Can you call you mind what it would smell like to slice open an apple and experience that in some sense? Or what it would sound or feel like? I'm curious if it's literally "seeing" or if it's the entire experience of imagining an event.
I'm not the person you're replying to, but I'm also a 5.
I can do none of the things you describe. I know how an apple looks, smells, tastes and sounds when you cut into it, but I can't visualise or hear those sounds at will. I cannot call to mind any visual image of an apple.
I also can't visualise my wife or children's faces, although again, I know what they look like (so it's not face blindness).
I do think I also have SDAM as well, which I think quite often goes hand in hand with total aphantasia.
Hasn't really affected how I go about in the world. I don't feel deficient in any way. It was only a few years ago I found out my experience isn't what the majority experiences.
This is incredible to me. I wonder if you have some other mechanism of "knowing" or recalling that I don't that substitutes in. It's entirely possible, given that so many people report not being aware their experience is atypical.
I find this absolutely fascinating. I appreciate you sharing.
I'm 4 at a push. When I read, I see _very_ vague images in my head, but that's about it.
I'm very adept at conjuring up sound, though. Maybe it doesn't apply in the same way, but I can hear full symphonies and pick out individual instruments and harmonies and the like.
One thing that I would watch out for with _all_ OpenAI products is that they use Keychain's Access Control lists to encrypt data such that _you_ cannot unlock or access it.
The net effect however is that if you use their apps, you can get data _in_, but you can only get it _out_ in ways that their apps themselves allow. The actual encryption of Chrome data seems to be _potentially_ held behind the 'ChatGPT Safe Storage' key that - since it's Chromium - you can unlock (for now), but the rest of Atlas's keys do the same thing as the above - locking you out of the encrypted data on your own computer.
I understand the rationale that it protects the data from malware or users willingly filling in a password dialogue, but the current approach means that any _intentional_ access by other apps or automated backups or export (even with Full Disk Access or similar) can't see into the apps' data.
It's very 'batteries-included', for one thing - when a novice wants to code, I recommend them Zed because it'll just handle and manage LSPs for them for a variety of languages. Meanwhile with VSCode step 1 of installing and using it for e.g. Rust is to go and install a random extension (and the VSCode store, whilst sorted by popularity, can be intimidating / confusing for a novice, who might install a random/scammy extension). The 'recommended extensions' thing helps, but it's still subpar.
It has some other niceties – I love how if you Cmd+Shift+F to search across the project, that you get a multi-buffer [1] - I often use that for larger manual refactors for a ton of places in my codebase.
But honestly... as others have said, speed is just _such_ a strong feature for my taste - it makes a world of difference compared to VSCode, because in VSC I'll be typing vim commands, the editor will fail to keep up and it'll do the wrong thing - whereas in Zed it's fast enough that I never really run into stalls.
The biggest problem with VSC for me is that sometimes undo history is completely broken with VIM. If you don't commit frequently, it is very easy to mess up the with the project and lose all your work, if you undo anything.
Having everything be an extension is the double edged sword of VS Code. Zed is great for the ecosystem and I use it as an alternate editor for quick text editing but I dont foresee it replacing VS Code as my IDE. Once youve configured VS Code to your liking with devcontainers, and extensions declared by the config file, it becomes excellent.
I wish they wouldn't use JS to demonstrate the AI's coding abilities - the internet is full of JS code and at this point I expect them to be good at it.
Show me examples in complex (for lack of a better word) languages to impress me.
I recently used OpenAI models to generate OCaml code, and it was eye opening how much even reasoning models are still just copy and paste machines.
The code was full of syntax errors, and they clearly lacked a basic understanding of what functions are in the stdlib vs those from popular (in OCaml terms) libraries.
Maybe GPT-5 is the great leap and I'll have to eat my words, but this experience really made me more pessimistic about AI's potential and the future of programming in general.
I'm hoping that in 10 years niche languages are still a thing, and the world doesn't converge toward writing everything in JS just because AIs make it easier to work with.
> I wish they wouldn't use JS to demonstrate the AI's coding abilities - the internet is full of JS code and at this point I expect them to be good at it. Show me examples in complex (for lack of a better word) languages to impress me.
Agreed. The models break down on not even that complex of code either, if it's not web/javascript. Was playing with Gemini CLI the other day and had it try to make a simple Avalonia GUI app in C#/.NET, kept going around in circles and couldn't even get a basic starter project to build so I can imagine how much it'd struggle with OCaml or other more "obscure" languages.
This makes the tech even less useful where it'd be most helpful - on internal, legacy codebases, enterprisey stuff, stacks that don't have numerous examples on github to train from.
> on internal, legacy codebases, enterprisey stuff
Or anything that breaks the norm really.
I recently wrote something where I updated a variable using atomic primitives. Because it was inside a hot path I read the value without using
atomics as it was okay for the value to be stale.
I handed it the code because I had a question about something unrelated and it wouldn't stop changing this piece of code to use atomic reads.
Even when I prompted it not to change the code or explained why this was fine it wouldn't stop.
FWIW, and this depends on the language obviously, but formal memory models typically do forbid races between atomic and non-atomic accesses to the same memory location.
While what you were doing may have been fine given your context, if you're targeting e.g. standard C++, you really shouldn't be doing it (it's UB). You can usually get the same result with relaxed atomic load/store.
(As far as AI is concerned, I do agree that the model should just have followed your direction though.)
Yes, for me it is and it was even before this experience.
But, you know, there's a growing crowd that believes AI is almost at AGI level and that they'll vibe code their way to a Fortune 100 company.
Maybe I spend too much time rage baiting myself reading X threads and that's why I feel the need to emphasize that AI isn't what they make it out to be.
The snake game they showcased - if you ask Qwen3-coder-30b to generate a snake game in JS - it generates the exact same layout, the exact same two buttons below, and the exact same text under the 2 buttons. It just regurgigates its training data.
I used ChatGPT to convert an old piece of OCaml code of mine to Rust and while it didn't really work—and I didn't expect it to—it seemed a very reasonable starting point to actually do the rest of the work manually.
Honestly, why would anyone find this information useful? Creating a brand new greenfield project is a terrible test. Because literally anything it outputs as long as it looks good as long as it works following the happy path. Coding with LLMs falls apart in situations where complex reasoning is required. Situations such as having debugging issues in a service where there's either no framework in use or they've significantly modified a framework to make it better suit the authors needs.
Yeah, I guess it's just the easiest thing to generate and evaluate.
A more useful demonstration like making large meaningful changes to a large complicated codebase would be much harder to evaluate since you need to be familiar with the existing system to evaluate the quality of the transformation.
Would be kinda cool to instead see diffs of nontrivial patches to the Ruby on Rails codebase or something.
> Honestly, why would anyone find this information useful?
This seems to impress the mgmt types a lot, e.g. "I made a WHOLE APP!", when basically what most of this is is frameworks and tech that had crappy bootstrapping to begin with (React and JS are rife with this, in spite of their popularity).
The biggest issue I'e seen _by far_ with using GPT models for coding has been their inability to follow instructions... and also their tendency to duplicate-act on messages from up-thread instead of acting on what you just asked for.