> Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exam...

> Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.

That's a better criticism of college exams than the benchmarks and/or those exams likely have either the exact questions or very similar ones in the training data.

The list of things that LLMs do better than the average human tends to rest squarely in the "problems already solved by above average humans" realm.