Thursday, March 5, 2026

GPT-5.4 Is Here -- and It Looks a Lot More Like a Coworker

See All on AI Model Releases
<<< Previously

OpenAI’s newest release, GPT-5.4, feels less like a routine model update and more like a clear statement about where AI is headed next: toward real professional work.

That’s the big idea behind the launch. GPT-5.4 is being positioned as a model for people who don’t just want clever answers—they want polished spreadsheets, usable presentations, better code, stronger research, and agents that can actually move through multi-step workflows without constantly needing rescue. There’s also a GPT-5.4 Pro tier for users who want maximum performance on harder tasks.

What stands out most is not any single feature, but how many strengths have been folded into one system. GPT-5.4 combines reasoning, coding, tool use, vision, long-context handling, and computer interaction into a single model. In plain English: it’s trying to become the model you use when the task starts looking like actual work.

The benchmark story backs that up.

On GDPval, OpenAI’s benchmark for well-specified knowledge work across 44 occupations, GPT-5.4 wins or ties professionals 83.0% of the time. That’s a sizable jump from 70.9% for GPT-5.2. This is one of the most telling numbers in the release, because GDPval is not about trivia or math puzzles. It is about producing things professionals actually make: sales decks, accounting spreadsheets, schedules, diagrams, and other business deliverables.

That theme shows up again in more specialized evaluations. On internal investment banking spreadsheet modeling tasks, GPT-5.4 scores 87.3%, up from 68.4% for GPT-5.2. In presentations, human raters preferred GPT-5.4 outputs 68% of the time over GPT-5.2, citing stronger aesthetics, more visual range, and better use of generated imagery. In other words, the model is not only getting more accurate—it is getting better at making work products people would actually want to send.

Coding remains a major part of the story too. GPT-5.4 inherits the strengths of GPT-5.3-Codex and edges past it on SWE-Bench Pro, scoring 57.7% versus 56.8%. That is not an enormous leap, but it matters because GPT-5.4 is doing this while also being a broader general-purpose model. It is not just a coding specialist. On Terminal-Bench 2.0, GPT-5.4 posts 75.1%, slightly behind GPT-5.3-Codex at 77.3%, which is a useful reminder that “best overall” does not mean “best on every single benchmark.” Still, the overall message is that coding ability has been preserved while the rest of the model has grown significantly.

One of the most interesting upgrades is computer use. GPT-5.4 is the first general-purpose OpenAI model with native computer-use capabilities, meaning it can interpret screenshots, navigate interfaces, and interact with software using mouse and keyboard style actions. On OSWorld-Verified, which evaluates desktop task completion, GPT-5.4 reaches 75.0%, beating GPT-5.2’s 47.3% and even surpassing the reported human baseline of 72.4%. That is a headline-level result, because it suggests the model is no longer just advising users what to click—it can increasingly operate in digital environments itself.

Its vision results also improve. GPT-5.4 scores 81.2% on MMMU-Pro without tools, up from 79.5% for GPT-5.2, and shows better document parsing on OmniDocBench with a lower normalized edit distance of 0.109 versus 0.140. It also introduces higher-fidelity image input options, which should matter for dense screenshots, large documents, and tasks where visual precision affects performance.

Then there is tool use, which may be the most important capability for serious agent workflows. GPT-5.4 improves on Toolathlon, scoring 54.6% versus 45.7% for GPT-5.2, and reaches 67.2% on MCP Atlas. OpenAI is also introducing “tool search,” which lets the model pull in tool definitions only when needed instead of stuffing every tool into the prompt upfront. In OpenAI’s example, this cut token usage by 47% while maintaining the same accuracy. That is a practical improvement, not just a benchmark win: cheaper, faster, cleaner workflows.

Web research is another area where GPT-5.4 appears stronger. On BrowseComp, it reaches 82.7%, while GPT-5.4 Pro hits 89.3%, compared with 65.8% for GPT-5.2. That suggests a noticeable jump in persistent, multi-step browsing—the kind needed for hard-to-find information rather than quick fact lookups.

There are also quality-of-life improvements in ChatGPT itself. GPT-5.4 Thinking can now give an upfront plan on longer tasks, and users can redirect it mid-response. That may sound small, but it changes the interaction style: less “ask, wait, retry,” and more “steer while it works.”

The pricing reflects the upgrade. GPT-5.4 costs more than GPT-5.2 in the API—$2.50 per million input tokens versus $1.75, and $15 per million output tokens versus $14—but OpenAI argues that the model’s greater token efficiency can reduce total usage on many tasks. GPT-5.4 Pro, as expected, is much pricier.

The simplest way to read this launch is that OpenAI is no longer just shipping smarter chatbots. It is shipping models designed to function as capable digital workers: better at research, better at documents, better at code, better at tools, and increasingly able to act instead of only respond.

GPT-5.4 is not just trying to sound intelligent. It is trying to be useful where usefulness is hardest to fake: in the messy middle of real work.

Ref

No comments:

Post a Comment