First time commenter - I was so triggered by this benchmark, so I just had to come out of lurking.
I've spent time going over the description and the cases and its an misrepresented travesty.
The benchmark takes existing cases from Upwork, then reintroduces the problems back in the code and then asks the LLM to fix them testing against newly written 'comprehensive tests'.
Lets look at some of the cases:
1. The regex zip code validation problem
Looking at the Upwork problem - https://github.com/Expensify/App/issues/14958 it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.
O1 doesnt actually seem to fix anything (besides arbitrary dumping all over the code), the reintroduced bug is messing with the state, not with the back button navigation.
Anyways, I went thru a sample of 20-30 last night and gave up. Noone needs to take my words - force pushing aside, anyone can pull the repo and check for themselves.
Most of the 'bugs' are trivialized to a massive degree, which a) makes them very easy to solve for b) doesnt reflect their previous monetary value, which in effect makes the whole premise of 'let measure how SWE agents can provide real money value' invalid.
If they wanted to create a real one, they should've found the commits reflecting the state of the app as of the moment of the bug and setup up the benchmarks around that.
+ // Intentionally use raw character count instead of HTML-converted length
+ const validateCommentLength = (text: string) => {
+ // This will only check raw character count, not HTML-converted length
+ return text.length <= CONST.MAX_COMMENT_LENGTH;
+ };
Also, the patch is supposedly applied over commit da2e6688c3f16e8db76d2bcf4b098be5990e8968 - much later than original fix, but also a year ago, not sure why, might be something to do with cut off dates.
3. Here is the actual merged solution at the time: https://github.com/Expensify/App/pull/15501/files#diff-63222... - as you can see the diff is quite different... Not only that, but the point to which the "bug" was reapplied is so far to the future that repo migrated to typescript even.
---
And they still had to add a whole another level of bullshit with "management" tasks on top of that, guess why =)
My issue is that its not the original bug that is being reintroduced (or the original code checked out at that point), but rather trivialized approximations of how the bug was presenting itself.
I've spent time going over the description and the cases and its an misrepresented travesty.
The benchmark takes existing cases from Upwork, then reintroduces the problems back in the code and then asks the LLM to fix them testing against newly written 'comprehensive tests'.
Lets look at some of the cases:
1. The regex zip code validation problem
Looking at the Upwork problem - https://github.com/Expensify/App/issues/14958 it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.
The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu... is just taking that new code and adding , to two countries....
2. Room showing empty - 14857
The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu...
Adds code explicitly commented as introducing a "radical bug" and "intentionally returning an empty array"...
I could go on and on and on...
The "extensive tests" are also laughable :(
I am not sure if OpenAI is actually aware of how great this "benchmark" is, but after so much fanfare - they should be.