GPT‑5 vs o3: A Power User’s Reasoning Test

Kalen Howell
Aug 11, 2025
2 min read

Opus 4.1, moderating, o3 versus ChatGPT5

This post is a bit technical. With all the hype and noise around AI and specifically OpenAI’s bumpy release of ChatGPT5, I needed to do some analysis myself. This was a fun little project I was able to complete on Sunday evening, and turned out pretty successful.

I built this project to get a clear look at how OpenAI’s o3 model compares with the new ChatGPT‑5 model on tasks that require reasoning. Key point; I am not a Ph.D. I consider myself an AI power user who uses AI to strategize, analyze, build solutions, and explore various problem domains. o3 is/was my ChatGPT daily driver, so my test focused on this space.

Setup at a glance: Both o3 and ChatGPT5 models were accessed via API's and I used the Zen MCP server to run the models with the same prompts, settings, and stop rules, and store and evaluate the results. I used Claude Code and Opus 4.1 as my co-designer for the test suite and the scoring rubrics, and final evaluator. I ran the suite on August 10, 2025.

What did I test? Three bite‑size but demanding areas chosen to stress reasoning rather than trivia:

(1) a “surprise” logic puzzle from game theory (Unexpected Hanging)

(2) a quantum‑style decision task that explains order effects in human choices

(3) a thought experiment about computation at the limits (hypercomputation and transfinite reasoning). Each task had a plain‑English brief, specific success criteria, and a checklist for edge cases.

Headline results:

That said, the models had different strengths. Per Opus 4.1: o3 often produced elegant math and kept a practical, implementation‑first focus. GPT‑5 tended to connect ideas across fields, surface alternative angles, and show deeper theoretical awareness.

The suite was designed so anyone can audit it. You can read the prompts, see the outputs side by side, and easily follow the scoring. In the noise and hype, my goal was to assess ChatGPT5 and reassure myself, that I would get at least the same, but I expected better, results using the new model.

Also note, that this is not an official, nor should it be considered a state of the art suite of tests, and in fact it could be considered wrong. But for me, I’m reassured that I can continue strategizing, and building with ChatGPT5 and get the same and better results than o3.

What are you building with AI?