Where are you seeing that Claude 3 scored 85%? That would be a massive jump

PodgieTar · on March 16, 2024

riku_iki · on March 16, 2024

Human Eval is very different to SWE-Bench on which Devin is tested

PodgieTar · on March 17, 2024

I didn't say it was the same, I compared non-agentic Claude to this. This used HumanEval.

riku_iki · on March 18, 2024

You said:

> how this performs against the same benchmark Devin was using

> ...

> Claude 3 Opus already scored around 85-86% on these benchmarks

Devin used SWE-bench, not HumanEval, which kinda implies you said Opus got 85% on SWE-bench which is not true. This was my confusion..