Something major missing from the LLM toolkit at the moment is that it can't actually run (and e.g. test or benchmark) its own code. Without that, the LLM is flying blind. I guess there are big security risks involved in making this happen. I wonder if anyone has figured out what kind of sandbox could safely be handed to a LLM.
I have experimented with using LLM for improving unit test coverage of a project. If you provide the model with test execution results and updated test coverage information, which can be automated, the LLM can indeed fix bugs and add improvements to tests that it created. I found it has high success rate at creating working unit tests with good coverage. I just used Docker for isolating the LLM-generated code from the rest of my system.
It depends a lot on the language. I recently tried this with Aider, Claude, and Rust, and after writing one function and its tests the model couldn't even get the code compiling, much less the tests passing. After 6-8 rounds with no progress I gave up.
Obviously, that's Rust, which is famously difficult to get compiling. It makes sense that it would have an easier time with a dynamic language like Python where it only has to handle the edge cases it wrote tests for and not all the ones the compiler finds for you.
I've found something similar, when you keep telling the LLM what the compiler says, it keeps adding more and more complexity to try to fix the error, and it either works by chance (leaving you with way overengineered code) or it just never works.
I've very rarely seen it simplify things to get the code to work.
I have the same observation, looks like LLMs are highly biased to add complexity to solve problems: for example add explicit handling of the edge-cases I pointed out rather than rework the algorithm to eliminate edge-cases altogether. Almost everytime it starts with something that's 80% correct, then iterate into something that's 90% correct while being super complex, unmaintainable and having no chance to ever cover the last 10%
Unfortunately this is my experience as well, to the point where I can't trust it with any technology that I'm not intimately familiar with and can thoroughly review.
Hmm, I worked with students in an “intro to programming” type course for a couple years. As far as I’m concerned, “I added complexity until it compiled and now it works but I don’t understand it” is pretty close to passing the Turing test, hahaha.
That’s sort of interesting. If code -> tests -> code is enough to get a clean room implementation, really, I wonder if this sort of tool would test that.
OpenAI is moving in that direction. The Canvas mode of ChatGPT can now runs its own python in a WASM interpreter, client side, and interpret results. They also have a server-side VM sandboxed code interpreter mode.
There are a lot of things that people ask LLMs to do, often in a "gotcha" type context, that would be best served by it actually generating code to solve the problem rather than just endlessly making more parameter/more layer models. Math questions, data analysis questions, etc. We're getting there.
The new Cursor agent is able to check the linter output for warnings and errors, and will continue to iterate (for a reasonable number of steps) until it has cleared them up. It's not quite executing, but it does improve output quality. It can even back itself out of a corner by restoring a previous checkpoint.
It works remarkably well with typed Python, but struggles miserably with Rust despite having better error reporting.
It seems like with Rust it's not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.
> It seems like with Rust it's not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.
What do you mean? Memory management is not related to files in Rust (or most languages).
> It seems like with Rust it [the AI] is not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.
In rust, when you refactor something that deals with the borrow checker's shenanigans, you will likely have to change a bunch of files (from experience). This means that an AI will likely also have to change a bunch of files which they say the AI isn't so good at. They don't say this HAS to happen, just that it usually does because the borrow checker is an asshole.
This aligns with my experience as well, though I dealt with Rust before there was AI, so I can say little in regards to how the AI deals with that.
I believe that Claude has been running JavaScript code for itself for a bit now[1]. I could have sworn it also runs Python code, but I cannot find any post concretely describing it. I've seen it "iterate" on code by itself a few times now, where it will run a script, maybe run into an error, and instantly re-write it to fix that error.
This lets ChatGPT write and then execute Python code in a Kubernetes sandbox. It can run other languages too, but that's not documented or supported. I've even had it compile and execute C before: https://simonwillison.net/2024/Mar/23/building-c-extensions-...
Gemini can run Python (including via the Gemini LLM API if you turn on that feature) but it's a lot more restricted than ChatGPT - I don't believe it can install extra wheels, for example.
Claude also has Artifacts, which can write a UI in HTML and JavaScript and show that to the user... but can't actually execute code in a way that's visible to the LLM itself so doesn't serve the same feedback look purposes as those other tools. https://simonwillison.net/tags/claude-artifacts/
Running code would be a downstream (client) concern. There's the ability to get structured data from LLMs (usually called 'tool use' or 'function calling') which is the first port of call. Then running it is usually an iterative agent<>agent task where fixes need to be made. FWIW Langchain seems to be what people use to link things together but I find it overkill.* In terms of actually running the code, there are a bunch of tools popping up at different areas in the pipeline (replit, agentrun, riza.io, etc)
What we really need (from end-user POV) is that kinda 'resting assumption' that LLMs we talk to via chat clients are verifying any math they do. For actually programming, I like Replit, Cursor, ClaudeEngineer, Aider, Devin. There are bunch of others. All of them seem to now include ongoing 'agentic' steps where they keep trying until they get the response they want, with you as human in the chain, approving each step (usually).
* I (messing locally with my own tooling and chat client) just ask the LLM for what I want, delimited in some way by a boundary I can easily check for, and then I'll grab whatever is in it and run it in a worker or semi-sandboxed area. I'll halt the stream then do another call to the LLM with the latest output so it can continue with a more-informed response.
This is a major issue when it comes to things like GitHub Copilot Workspace, which is a project that promises a development environment purely composed of instructing an AI to do your bidding like fix this issue, add this feature. Currently it often writes code using packages that don't exist, or it uses an old version of a package that it saw most during training. It'll write code that just doesn't even run (like putting comments in JSON files).
The best way I can describe working with GitHub Copilot Workspace is like working with an intern who's been stuck on an isolated island for years, has no access to technology, and communicates with you by mailing letters with code handwritten on them that he thinks will work. And also if you mail too many letters back and forth he gets mad and goes to sleep for the day saying you reached a "rate limit". It's just not how software development works
The only proper way to code with an LLM is to run its code, give it feedback on what's working and what isn't, and reiterate how it should. Then repeat.
The problem with automating it is that the number of environments you'd need to support to actually run arbitrary code with is practically infinite, and with local dependencies genuinely impossible unless there's direct integration, which means running it on your machine. And that means giving an opaque service full access to your environment. Or at best, a local model that's still a binary blob capable of outputting virtually anything, but at least it won't spy on you.
Any LLM-coding agent that doesn't work inside the same environment as the developer will be a dead end or a toy.
I use ChatGPT to ask for code examples or sketching out pieces of code, but it's just not going to be nearly as good as anything in an IDE. And once it runs in the IDE then it has access to what it needs to be in a feedback loop with itself. The user doesn't need to see any intermediate steps that you would do with a chatbot where you say "The code compiles but fails two tests what should I do?"
Don't they? It highly depends on the errors. Could range from anything like a simple syntax error to a library version mismatch or functionality deprecation that requires some genuine work to resolve and would require at least some opinion input from the user.
Furthermore LLMs make those kinds of "simple" errors less and less, especially if the environment is well defined. "Write a python script" can go horribly wrong, but "Write a python 3.10 script" is most likely gonna run fine but have semantic issues where it made assumptions about the problem because the instructions were vague. Performance should increase with more user input, not less.
They could, but if the LLM can iterate and solve it then the user might not need to know. So when the user input is needed, at least it's not merely to do what I do know: feed the compiler messages or test failures back to ChatGPT who then gives me a slightly modified version. But of course it will fail and that will need manual intervention.
I often find that ChatGPT often reasons itself to a better solution (perhaps not correct or final, but better) if it just gets some feedback from e.g. compiler errors. Usually it's like
Me: "Write a function that does X and satisifies this test code"
LLM: responds with function (#1)
Me: "This doesn't compile. Compiler says X and Y"
LLM: Apologies: here is the fixed version (#2)
Me: "Great, now it compiles but it fails one of the two test methods, here is the output from the test run: ..."
LLM: I understand. Here is an improved verison that should pass the tests (#3)
Me: "Ok now you have code that could theoretically pass the tests BUT you introduced the same syntax errors you had in #1 again!"
LLM: I apologize, here is a corrected version that should compile and pass the tests (#4)
etc etc.
After about 4-5 iterations with nothing but gentle nudging, it's often working. And there usually isn't more nudging than returning the output from compiler or test runs. The code at the 4th step might not be perfect but it's a LOT better than it was first. The problem with this workflow is that it's like having a bad intern on the phone pair programming. Copying and pasting code back and forth and telling the LLM what the problem with it is, is just not very quick. If the iterations are automatic so the only thing I can see is step #4, then at least I can focus on the manual intervention needed there. But fixing a trivial syntax error beteween #1 and #2 is just a chore. I think ChatGPT is simply pretty bad here, and the better models like opus probably doesn't have these issues to the same extent
It can't be done in the LLM itself of course, but the wrapper you're taking about already exists in multiple projects fighting in SWEbench. The simplest one is aider with --auto-test https://aider.chat/docs/usage/lint-test.html
We have it run code and the biggest thing we find is that it gets into a loop quite fast if it doesn't recognise the error; fixing it by causing other errors and then fixing it again by causing the initial error.
This is a good idea. You could take a set of problems, have the LLM solve it, then continuously rewrite the LLM's context window to introduce subtle bugs or coding errors in previous code submissions (use another LLM to be fully hands off), and have it try to amend the issues through debugging the compiler or test errors. I don't know to what extent this is already done.
I don't think that's always true. Gemini seemed to run at least some programs, which I believe because if you asked it to write a python program that would take forever, it does. For example the prompt "Write a python script that prints 'Hello, World', then prints a billion random characters" used to just timeout on Gemini.
I think that there should be a guard to check the code before running it. It can be human or another LLM checking code based on its safety. I'm working on an AI assistant for data science tasks. It works in a Jupyter-like environment, and humans execute the final code by running a cell.
It'd be great if it could describe the performance of code in detail, but for now just adding a skill to detect if a bit of code has any infinite loops would be a quick and easy hack to be going on with.
It is exactly the halting problem. Finding some infinite loops is possible, there are even some obvious cases, but finding "any" infinite loops is not. In fact, even the obvious cases are not if you take interrupts into account.
I think that's the joke. In a sci-fi story, that would make the computer explode.
The halting problem isn't so relevant in most development, and nothing stops you having a classifier that says "yes", "no" or "maybe". You can identify code that definitely finishes, and you can identify code that definitely doesn't. You can also identify some risky code that probably might. Under condition X, it would go into an infinite loop - even if you're not sure if condition X can be met.
The problem is that you can do this for specific functions/methods, but you cannot do this for a PROGRAM. All programs are "maybe", by definition. You want it to run until you tell it to stop, but you may never tell it to stop. Ergo, all programs have some sort of infinite loop in them somewhere, even if it is buried in your framework or language runtime.
Yeah, sorry, I wasn’t clear: not the user, the programmer. This is true for almost all programs. Even a simple “print hello world” involves at least one intentional infinite loop: sending bytes to the buffer. The buffer could remain full forever.
Ideally you could this one step further and feed production logs, user session replays and feedback into the LLM. If the UX is what I'm optimizing for, I want it to have that context, not for it to speculate about performance issues that might not exist.
I think the GPT models have been able to run Python (albeit limited) for quite a while now. Expanding that to support a variety of programming languages that exist though? That seems like a monumental task with relatively little reward.
Chatgpt has a Code Interpreter tool that can run Python in a sandbox, but it's not yet enabled for o1. o1 will pretend to use it though, you have to watch very carefully to check if that happened or not.
That's a bit like saying the drawback of a database is that it doesn't render UIs for end-users, they are two different layers of your stack, just like evaluation of code and generation of text should be.