Both Chat GPT 4o and Claude 3.5 can trivially solve this puzzle if you direct th...

riku_iki · on Sept 14, 2024

and what prompt you gave them to generate program? Did you tell explicitly that they need to fill cornered cells? If yes, it is not what benchmark is about. Benchmark is to ask LLM to figure out what is the pattern.

I entered task to Claude and asked to write py code, and it failed to recognize pattern:

To solve this puzzle, we need to implement a program that follows the pattern observed in the given examples. It appears that the rule is to replace 'O' with 'X' when it's adjacent (horizontally, vertically, or diagonally) to exactly two '@' symbols. Let's write a Python program to solve this:

usaar333 · on Sept 14, 2024

arc reasoning challenge. I'm going to give you 2 example input/output pairs and then a third bare input. Please produce the correct third output.

It used its COT to understand cornering -- then I got it to write a program.

But as I try again, it's not reliable.

exe34 · on Sept 14, 2024

> But as I try again, it's not reliable.

this is why I will never try anything like this on a remote server I don't control. all my toy experiments are with local llms that I can make sure are the same ones day after day.