An interesting countermetric would be to after each iteration ask a fresh LLM (unaware of the context that created the code) to summarize the purpose of the code, and then evaluate how close those summaries are to the original problem spec. It might demonstrate the subjectivity of "better" and how optimization usually trades clarity of intention for faster results.
Or alternatively, it might just demonstrate the power of LLMs to summarize complex code.
Or alternatively, it might just demonstrate the power of LLMs to summarize complex code.