Do AI models really reason?
Sounds a reasonable enough question.
So, Apple did some research.
They tested all the major reasoning models: Claude, DeepSeek-R, o3-mini and found
something that perhaps they didn’t want to find:
They don’t reason.
They memorise.
Until things get hard.
Then?
They give up.
Apple ran a series of complex logic puzzles past the models. Puzzles AI had never seen
before. And watched their performance literally fall
off
a
cliff.
Even when given more time, or even better prompts, the models hit a complexity wall
and collapsed.
So, what does that mean?
Here’s how AI ranks with problem solving:
Low complexity: Your average AI model does fine.
Medium complexity: Reasoning models show a small edge.
High complexity: Everything fails.
Why?
Because the systems aren’t truly logical.
They’re pattern recognisers. Predictive guessers.
They’re brilliant at mimicking intelligence.
That is, until the task requires actual thinking.
What we have are clever mimics that can perform low function tasks – which is not a
bad thing (and we’ll get to that in the next post).
So, scaling will just result in faster mimics, not thinking.
So, the complexity of creativity is their downfall for now.
And until that leap is made.
There are still plenty of creative leaps to be made in this industry.