Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hey. I like your roast on benchmarks.

I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:

Example recent one on GPT-5:

https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...

All results:

https://eval.16x.engineer/evals/coding



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: