Skip to main content
AI Tools

"GPT-5.2 vs Opus 4.5: I Gave Both Models the Same Impossible PRD—Here's What Actually Happened

"Which AI model should I use for coding?" If I had a dollar for every time someone asked me that in the last six months, I'd have enough to cover my A...

11 min read
2,180 words
"GPT-5.2 vs Opus 4.5: I Gave Both Models the Same Impossible PRD—Here's What Actually Happened
Featured image for "GPT-5.2 vs Opus 4.5: I Gave Both Models the Same Impossible PRD—Here's What Actually Happened

"Which AI model should I use for coding?"

If I had a dollar for every time someone asked me that in the last six months, I'd have enough to cover my API costs for a week. Maybe.

Here's the thing—most AI model comparisons are useless. They run synthetic benchmarks, test isolated code snippets, and declare winners based on metrics that have nothing to do with how we actually work. Nobody's sitting around asking AI to write bubble sort implementations anymore. We're trying to build actual software.

So I did something different. I took a monstrous Product Requirements Document—one derived from a real, complete application I'd built—and threw it at five different AI models. Same PRD. Same instructions. Two-hour timer. Let's see who survives.

The results surprised me. And not in the way I expected.

The Setup That Breaks Most Models

Before we dive in, let me explain what I was actually testing. This wasn't some toy project or a "build me a todo app" prompt. The PRD I used was massive—detailed technical documentation covering UI components, backend architecture, AI-powered features, personality prompts, recommendation engines, and more.

Think of it as a streaming platform dashboard. Users browse shows, get AI-powered recommendations, explore thematic concepts, see cast and crew information, check ratings from multiple sources, and interact with an "alchemy" feature that lets you mash shows together to discover new content based on combined concepts.

The kind of thing that would take a team weeks to spec out properly. I gave each model the full document and said "start here."

The contenders:

  • GPT-5.1 Codex Max (the previous champion)
  • GPT-5.2 Medium (OpenAI's recommended tier)
  • GPT-5.2 Extra High (maximum thinking, maximum cost)
  • Claude Opus 4.5 (Anthropic's flagship)

I wanted to know: Can these models actually execute on complex, real-world requirements? Not just write functions—but build coherent applications from dense specifications?

The First 30 Minutes: Where Personalities Emerge

Something interesting happened almost immediately.

GPT-5.1 Codex Max jumped straight into action. Files started appearing. Dependencies installed. A server spun up within the first ten minutes. It felt fast. Productive. Like watching a developer who's had too much coffee just start banging out code.

GPT-5.2 models—both Medium and Extra High—took a different approach. They spent more time thinking. Processing. You could almost see them reading through the PRD, trying to understand the full scope before writing a single line.

And then there was Opus 4.5.

Here's where things got interesting. Opus didn't just start coding. It started talking to me. A to-do list appeared. It verbalized what it understood from the PRD. It asked clarifying questions. It said things like "Based on the requirements, I'm going to structure this as..." before doing anything.

I won't lie—at first, this felt slower. But something about it felt right.

The Communication Gap Nobody Talks About

Let me pause here because this became the single most important insight from the entire experiment.

GPT models start coding immediately. They read your prompt, and they execute. No confirmation. No "here's what I understood." Just action.

Opus models communicate first. They acknowledge what you asked for. They explain their interpretation. They show you the roadmap before building the road.

Why does this matter?

Because when you're working with AI on complex tasks, you need to know what it actually understood. If GPT-5.2 misinterprets a requirement and builds the wrong thing, you won't know until you see the broken output. If Opus misinterprets something, it tells you upfront—and you can correct it before any code gets written.

This isn't about which model is "smarter." It's about which one makes collaboration feel less like gambling.

I've worked with junior developers who code faster than seniors. Speed isn't the same as productivity. The best collaborators—human or AI—confirm understanding before execution.

Hour One: The Feature Race

By the sixty-minute mark, the differences became stark.

GPT-5.1 Codex Max had a running application. It looked... functional. Basic CSS, some missing styling, but the core structure was there. Missing features were obvious though—no recommendation engine, limited metadata display, incomplete navigation.

GPT-5.2 Medium showed improvement over 5.1. The UI felt more polished. It hit more of the requested features. But key art was missing. Some interactive elements weren't wired up. The "medium" in its name felt accurate—middle of the road.

GPT-5.2 Extra High was slower but more thorough. The interface was cleaner. More approachable. Logical layout. The recommendation engine worked. Metadata pulled correctly. Cast and crew information displayed properly. It was genuinely impressive—until I noticed the episode lists under TV show seasons were completely missing.

Opus 4.5 was building something different. Richer visual elements. Animations I hadn't specifically requested but made sense. Financial data displays (profit, ROI). Clear breakdowns of ratings and genres. And—this is key—an inline trailer player that none of the GPT models even attempted.

But Opus's UI was also more complex. Busier. Harder to navigate at first glance. It optimized for completeness over simplicity.

The Gap Document Revelation

Here's something that changed everything.

None of the models fully completed the PRD in one shot. Not even close. They all missed features. They all made assumptions. They all had gaps.

So I created what I'm calling a "gap document"—a second prompt that listed everything the initial build missed, essentially asking each model to audit itself against the original requirements and fill in the blanks.

This is where GPT-5.2 Extra High and Opus 4.5 pulled ahead dramatically.

Both models responded to the gap document by systematically addressing missing features. Episode lists appeared. Recommendation logic improved. UI polish increased. Within another hour of iteration, both had reached what I'd estimate as 90-95% feature completeness.

The older GPT-5.1? It struggled. It kept getting confused about what it had already built versus what was being requested. Context management became a problem.

The lesson: No AI model today can fully execute a complex PRD autonomously. But the best ones can get remarkably close with one round of structured feedback. The gap isn't in raw capability—it's in self-auditing.

UI Philosophy: Elegance vs Completeness

Let me talk about the visual outputs because they reveal something important about how these models think.

GPT-5.2 Extra High produced what I'd call "developer-friendly" interfaces. Clean. Predictable. The kind of UI that's easy to use because it doesn't try to do too much at once. Good spacing. Logical information hierarchy. If you showed it to a product manager, they'd probably approve it without many changes.

Opus 4.5 built something more ambitious. Every possible data point surfaced. Multiple interaction patterns. Visual richness that bordered on overwhelming. It felt like Opus was trying to prove it understood every requirement by showing all of them simultaneously.

Neither approach is wrong. But they serve different purposes.

If you need a quick prototype to validate an idea, GPT-5.2's cleaner output gets you there faster. If you need comprehensive feature coverage and don't mind refining the presentation later, Opus gives you more to work with.

I found myself using GPT-5.2's output as a UI reference while pulling feature implementations from Opus. The combination worked better than either alone.

The AI Features: Where It Gets Interesting

Both models implemented AI-powered features—concept extraction, thematic recommendations, dynamic exploration based on user preferences. This is where the PRD got complicated, and where I expected models to struggle.

They didn't. Not really.

The recommendation engines worked. They understood semantic relationships between shows. They could explain why two seemingly different series might appeal to the same viewer.

Opus's "alchemy" feature stood out. It let users combine multiple shows and extract overlapping thematic concepts to find new recommendations. This was in the PRD, but it required understanding a fairly abstract concept. Opus nailed it. The mashup interface was intuitive, and the recommendations it generated actually made sense.

GPT-5.2 implemented alchemy too, but more literally. It combined shows and spit out recommendations. The magic—the meta-analysis of why certain combinations worked—wasn't there.

For AI-assisted coding of AI features, Opus had the edge. It understood the intent behind the requirements, not just the literal specifications.

What Nobody Got Right

Let's talk about failures because they matter as much as successes.

Key art. Every model struggled with show images. The PRD specified integration with external APIs for poster art and thumbnails. No model reliably retrieved and displayed them. Some made placeholder boxes. Some ignored them entirely. None got it fully working.

Trailers. Only Opus attempted inline trailer playback. The others either skipped it or generated broken implementations.

Self-auditing. No model naturally checked its own work against the original PRD. They all required explicit gap documents to identify missing features. This suggests current AI models are good at executing but bad at validating their own output.

Media handling in general. Text-based features were solid. Anything involving external assets—images, videos, API integrations—fell short. The models could write the code, but debugging API responses and handling edge cases still required human intervention.

Speed vs Quality: The Real Trade-off

Let me give you concrete numbers.

GPT-5.1 Codex Max: Fastest initial output. Lowest feature completeness. Most bugs. Time to "usable" prototype: ~45 minutes. Time to "complete" application: Couldn't get there in two hours.

GPT-5.2 Medium: Moderate speed. Reasonable feature coverage. Some polish issues. Time to usable prototype: ~60 minutes. Time to near-complete: ~100 minutes with gap document.

GPT-5.2 Extra High: Slowest initial output. High feature completeness. Clean UI. Time to usable prototype: ~75 minutes. Time to near-complete: ~110 minutes with gap document.

Opus 4.5: Fast-ish initial output (server started early). Highest feature completeness. Complex UI. Time to usable prototype: ~50 minutes. Time to near-complete: ~95 minutes with gap document.

If you're optimizing purely for speed, GPT-5.2 Medium makes sense. If you're optimizing for completeness and don't mind iteration, Opus 4.5 delivers more features per dollar spent.

The Cost Equation Nobody Wants to Talk About

These models aren't free. And the "Extra High" tier of GPT-5.2 costs significantly more than Medium.

Here's my honest take: GPT-5.2 Extra High didn't justify its premium over Medium for this task. Yes, the output was cleaner. Yes, it caught a few more requirements. But the marginal improvement wasn't worth 3-4x the cost.

Opus 4.5, despite being expensive, felt worth it. The communication style alone saved debugging time. The feature completeness reduced iteration cycles. The understanding of complex AI requirements meant less rework.

Cost efficiency matters. Anthropic's model delivered more value per conversation turn than OpenAI's highest tier. That's not a fanboy statement—it's accounting.

What This Means for Real Development

Let me be direct: Neither GPT-5.2 nor Opus 4.5 is replacing developers anytime soon.

But they're replacing months of work with hours of work. That's not nothing. That's transformative.

The application I used as the basis for this PRD took a team several weeks to build originally. These AI models got to 90-95% feature parity in under two hours. The remaining 5-10%—the edge cases, the polish, the debugging—still needs human attention. But the heavy lifting? Done.

This changes how we think about prototyping. It changes how we validate ideas. It changes the economics of building software.

We're not at "prompt and ship" yet. We're at "prompt, review, iterate, and ship faster than ever before." That's still a massive shift.

My Recommendation

After running this experiment, here's what I'd tell another developer:

Use Opus 4.5 when:

  • You're building something complex with AI-driven features
  • You need high feature completeness and don't mind UI refinement
  • You value communication and want to verify understanding before execution
  • You're willing to pay premium for reduced iteration cycles

Use GPT-5.2 Medium when:

  • You need quick prototypes for validation
  • The UI matters more than feature depth
  • Budget is a primary constraint
  • You're comfortable with more hands-on debugging

Skip GPT-5.2 Extra High. For most use cases, it doesn't justify the cost premium over Medium. The incremental improvements aren't enough.

Don't bother with GPT-5.1 Codex Max anymore. It's been surpassed. The newer models are meaningfully better.

The Bigger Picture

We're watching AI coding tools evolve in real-time. Six months ago, asking a model to execute a complex PRD would have been laughable. Today, it's viable—with caveats.

The gap between "AI-generated code" and "production-ready code" is shrinking. It's not gone. But it's shrinking fast.

The developers who figure out how to collaborate effectively with these tools—who learn to write better PRDs, who master the art of gap documents, who understand which model to use for which task—they're going to have an enormous advantage.

This isn't about being replaced. It's about being amplified.

The best human developers will use AI to multiply their output. The ones who resist will wonder why everyone else ships faster.

I know which side I'm choosing.

Final Thoughts

If you take one thing from this experiment, let it be this: Stop evaluating AI models on synthetic benchmarks. Test them on your actual work. Give them your messy, complex, real-world requirements. See what happens.

The results might surprise you.

They certainly surprised me.


🤝 Hire / Work with me:

Engr Mejba Ahmed

About the Author

Engr Mejba Ahmed

I'm Engr. Mejba Ahmed, a Software Engineer, Cybersecurity Engineer, and Cloud DevOps Engineer specializing in Laravel, Python, WordPress, cybersecurity, and cloud infrastructure. Passionate about innovation, AI, and automation.

Related Topics

Continue Learning

Browse All Articles