First time commenter - I was so triggered by this benchmark, so I just had to co...

AnthOlei · 2025-02-19T15:40:22 1739979622

They’ve now removed your second example from the testing set - I bet they won’t regenerate their benchmarks without this test.

Good sleuthing, seems someone from OpenAI read your comment and found it embarrassing as well!

yorwba · 2025-02-19T16:54:18 1739984058

For future reference, permalink to the original commit with the RADICAL BUG comment: https://github.com/openai/SWELancer-Benchmark/blob/a8fa46d2b...

The new version (as of now) still has a comment making it obvious that there's an intentionally introduced bug, but it's not as on the nose: https://github.com/openai/SWELancer-Benchmark/blob/2a77e3572...

Snuggly73 · 2025-02-19T17:58:42 1739987922

It was just two examples of widespread problems with the introduced bugs and the tests.

How about this - https://github.com/openai/SWELancer-Benchmark/blob/08b5d3dff... (Intentionally use raw character count instead of HTML-converted length)

Or this one - https://github.com/openai/SWELancer-Benchmark/blob/08b5d3dff... (user is complaining of flickering, so the reintroduced bug adds flickering code :) )

Or the one that they list in A.10 of the paper as O1 successfuly fixing - https://github.com/openai/SWELancer-Benchmark/blob/main/issu...

O1 doesnt actually seem to fix anything (besides arbitrary dumping all over the code), the reintroduced bug is messing with the state, not with the back button navigation.

Anyways, I went thru a sample of 20-30 last night and gave up. Noone needs to take my words - force pushing aside, anyone can pull the repo and check for themselves.

Most of the 'bugs' are trivialized to a massive degree, which a) makes them very easy to solve for b) doesnt reflect their previous monetary value, which in effect makes the whole premise of 'let measure how SWE agents can provide real money value' invalid.

If they wanted to create a real one, they should've found the commits reflecting the state of the app as of the moment of the bug and setup up the benchmarks around that.

izucken · 2025-02-20T05:52:53 1740030773

So it's much worse than I assumed from paper and repo overview?

For further clarification: 1. See the issue example #14268 https://github.com/openai/SWELancer-Benchmark/tree/08b5d3dff.... It has a patch that is supposed to "reintroduce" the bug into the codebase (note the comments):

  +    // Intentionally use raw character count instead of HTML-converted length
  +    const validateCommentLength = (text: string) => {
  +        // This will only check raw character count, not HTML-converted length
  +        return text.length <= CONST.MAX_COMMENT_LENGTH;
  +    };

Also, the patch is supposedly applied over commit da2e6688c3f16e8db76d2bcf4b098be5990e8968 - much later than original fix, but also a year ago, not sure why, might be something to do with cut off dates.

2. Proceed to https://github.com/Expensify/App/issues/14268 to see the actual original issue thread.

3. Here is the actual merged solution at the time: https://github.com/Expensify/App/pull/15501/files#diff-63222... - as you can see the diff is quite different... Not only that, but the point to which the "bug" was reapplied is so far to the future that repo migrated to typescript even.

---

And they still had to add a whole another level of bullshit with "management" tasks on top of that, guess why =)

Prior "bench" analysis for reference: https://arxiv.org/html/2410.06992v1

(edit: code formatting)

pertymcpert · 2025-02-19T20:11:19 1739995879

I'm not quite sure what your issue with the reintroducing bugs is? How else do you expect them to build a test suite?

Snuggly73 · 2025-02-19T21:26:53 1740000413

My issue is that its not the original bug that is being reintroduced (or the original code checked out at that point), but rather trivialized approximations of how the bug was presenting itself.