18 mar 2026

4 min leer

GPT-5.4 Mini and Nano: OpenAI Just Validated the Multi-Model Agent Architecture

por

Fredrik Falk

Agentes de IA

Share the article

OpenAI released GPT-5.4 mini and nano this week — the latest entries in a model lineup that's getting more deliberately tiered. The headline numbers are strong: GPT-5.4 mini scores 54.4% on SWE-Bench Pro versus 45.7% for GPT-5 mini, runs more than 2x faster, and approaches the performance of the full GPT-5.4 model on several evaluations. GPT-5.4 nano sits below it, optimized for classification, data extraction, ranking, and simple coding tasks at $0.20 per million input tokens.

The specs matter, but the more significant signal is the use case framing OpenAI chose. The announcement explicitly describes these models as built for subagent roles in hierarchical AI systems — where a larger model plans and coordinates, while smaller models execute quickly in parallel. That architectural pattern has real implications for how enterprise teams should think about building and running multi-agent systems.

The subagent tier is now explicit

Previous model releases treated smaller models as cost-reduced versions of larger ones — same use cases, lower capability and price. GPT-5.4 mini and nano are framed differently. OpenAI describes them as optimized specifically for the executor role in multi-model systems: fast, tool-reliable, capable enough for well-defined subtasks, but not the reasoning center of a system.

The Codex integration makes this concrete. GPT-5.4 handles planning and final judgment while delegating to GPT-5.4 mini subagents that handle narrower tasks in parallel — searching a codebase, reviewing a large file, processing supporting documents. GPT-5.4 mini uses 30% of the GPT-5.4 quota, so running multiple subagents in parallel becomes cost-tractable.

OpenAI's framing: "Instead of using one model for everything, developers can compose systems where larger models decide what to do and smaller models execute quickly at scale." That's close to how production-grade multi-agent systems actually need to work.

What the benchmark numbers mean in practice

The performance gaps are not uniform, and that's worth reading carefully.

On SWE-Bench Pro, GPT-5.4 mini (54.4%) is close to GPT-5.4 (57.7%) and well ahead of GPT-5 mini (45.7%). On OSWorld-Verified for computer use, mini (72.1%) nearly matches the full model (75.0%). These are the numbers that matter for agentic tasks — coding, tool use, multimodal reasoning.

The gaps are wider on long-context tasks. On MRCR v2 with 128K–256K context, mini drops to 33.6% against GPT-5.4's 79.3%. That tells you where mini isn't the right fit: tasks requiring deep reasoning across very long documents. For narrower, well-scoped subtasks — the subagent role — the performance profile holds.

GPT-5.4 nano trades more capability for speed and cost. At $0.20/1M input tokens, it's priced for high-volume classification and routing work. Its 52.4% on SWE-Bench Pro still beats where GPT-5 mini was, making it a meaningful upgrade for simple coding subtasks even at nano pricing.

Tool use is the underrated capability

One number worth attention: tool calling. GPT-5.4 mini scores 93.4% on τ2-bench versus GPT-5 mini's 74.1%, and 57.7% on MCP Atlas versus 47.6%. For enterprise AI agents, tool use reliability is often the binding constraint. An agent that reasons well but calls tools incorrectly creates failures that are hard to catch and harder to debug. The improvement in tool-calling accuracy at mini latencies is probably more practically significant than the headline coding benchmarks for most production workflows.

Pricing and availability

GPT-5.4 mini: $0.75/1M input, $4.50/1M output. 400k context window. Available today in the API, Codex, and ChatGPT. Supports text and image inputs, tool use, function calling, web search, file search, computer use, and skills.

GPT-5.4 nano: $0.20/1M input, $1.25/1M output. API only.

For comparison, GPT-5.4 mini represents a significant cost reduction from running everything on GPT-5.4, while maintaining near-parity on the agentic tasks most relevant to enterprise automation.

What this means for enterprise AI architecture

The practical implication is that the case for hierarchical multi-agent architectures just got stronger. Running a high-capability orchestrator alongside faster, cheaper executors is now better supported by the underlying model capabilities — and explicitly validated as a design pattern by OpenAI. For teams running AI agents in production, this reinforces a few design principles: scope subtasks narrowly enough that smaller models can execute reliably, keep complex reasoning and final judgment with the orchestrator, and design tool interfaces to be model-agnostic so you can route tasks to the right tier as capabilities evolve.

The model landscape is moving fast. What GPT-5.4 mini does today at mini pricing would have required a full-size model twelve months ago. Teams that design their agent architectures around capability tiers — rather than around specific models — will have more room to take advantage of that trajectory.

Empieza hoy

Empezar a crear agentes de IA para automatizar procesos

Únase a nuestra plataforma y empiece a crear agentes de IA para diversos tipos de automatizaciones.

Empieza hoy

Empezar a crear agentes de IA para automatizar procesos

Únase a nuestra plataforma y empiece a crear agentes de IA para diversos tipos de automatizaciones.

Plataforma

Soluciones

Our Customers

Recursos

Acerca de

GPT-5.4 Mini and Nano: OpenAI Just Validated the Multi-Model Agent Architecture

por

Fredrik Falk

Category

Agentes de IA

Share the article

The subagent tier is now explicit

What the benchmark numbers mean in practice

Tool use is the underrated capability

Pricing and availability

What this means for enterprise AI architecture

Empezar a crear agentes de IA para automatizar procesos

Empezar a crear agentes de IA para automatizar procesos

Últimos artículos

GPT-5.6 Sol Hits 750 Tokens a Second. Agent Latency Just Became a Buying Decision

Beam vs Bullhorn Automation: Which One Actually Fits Modern Staffing Firms in 2026?

The 2026 BPO Automation Benchmark: Why the 25% Handling-Time Ceiling Is the Wrong Number

GPT-5.6 Sol Hits 750 Tokens a Second. Agent Latency Just Became a Buying Decision

Beam vs Bullhorn Automation: Which One Actually Fits Modern Staffing Firms in 2026?

The 2026 BPO Automation Benchmark: Why the 25% Handling-Time Ceiling Is the Wrong Number

¿Qué es el MCP? Explicación del protocolo de contexto modelo para agentes de IA