Apr 6, 2026

7 min read

Google Gemma 4 Under Apache 2.0: What Changes for Enterprise AI Agents

by

Fredrik Falk

The AI World

Share the article

Google recently released Gemma 4 under the Apache 2.0 license. Four model sizes, from edge devices to a 31B dense model that currently ranks #3 on the Arena AI leaderboard.

The benchmarks are impressive. The license change is what actually matters for enterprise teams.

Previous Gemma releases shipped under a custom Google license with restrictions on commercial use and content policies. That made legal teams nervous and procurement slow. Gemma 4 ships under the same permissive license as Linux, Kubernetes, and most of the software stack enterprises already run. No usage thresholds, no geographic restrictions, no acceptable use policies beyond what the law already requires.

For teams building AI agents in production, this is the first time a top-3 open model comes with zero licensing friction.

What Gemma 4 Actually Ships With

Four model sizes, each designed for a different deployment scenario:

Gemma 4 31B Dense is the flagship. It scores 85.2% on MMLU Pro, 89.2% on AIME 2026 math benchmarks, and 80.0% on LiveCodeBench v6. On the Arena AI leaderboard it sits at 1452, outperforming models with 20 times its parameter count. This is the model you fine-tune for domain-specific agent tasks.

Gemma 4 26B MoE (Mixture of Experts) trades a small amount of quality for significantly better latency. It activates only 3.8 billion of its 25.2 billion parameters during inference, which means faster tokens-per-second at lower compute cost. It scores 1441 on Arena AI, close enough to the 31B that most production workloads will not notice the difference. For high-throughput agent orchestration where you are running dozens of parallel calls, this is the practical choice.

Gemma 4 E4B and E2B are the edge models. The E2B runs in under 1.5GB of memory with 2-bit and 4-bit quantization, which means it fits on a phone, a Raspberry Pi, or an NVIDIA Jetson Orin Nano. Both edge models support a 128K context window and run completely offline with near-zero latency. For enterprises that need agent capabilities on factory floors, in field operations, or anywhere a cloud round-trip is not viable, these models make that possible without building custom infrastructure.

All four sizes share the same architecture improvements: alternating local sliding-window and global full-context attention layers, and a new technique called Per-Layer Embeddings that adds a parallel conditioning pathway alongside the main residual stream. The practical result is better performance per parameter across the board.

Native Function Calling Changes the Agent Equation

The feature that matters most for AI agent platforms is native function calling across all four model sizes.

Gemma 4 supports defining tools as JSON schemas, and the model natively generates structured tool calls in response. No prompt engineering hacks, no output parsing, no hoping the model follows your formatting instructions. You define the interface, the model calls it correctly.

This works alongside structured JSON output and multi-step planning. In practice, that means you can build an agent workflow where the model receives a task, breaks it into steps, calls external APIs at each step, and returns structured results. All natively, without the brittle scaffolding that most open-source agent frameworks currently require.

For enterprise teams, structured output is the feature that determines whether a model is production-ready or permanently stuck in prototyping. When an agent processes an invoice, routes a support ticket, or scores a candidate, the downstream system needs structured data. Not prose. Not "here's my analysis." A JSON object with the fields your system expects, every time.

Gemma 4 delivers this at the model level rather than as a post-processing layer. That removes an entire class of production failures where the model generates valid reasoning but invalid output format.

Extended thinking mode is also available: the model can be configured to reason through complex problems step by step before producing a final answer. For agent tasks that involve multi-criteria decisions, conditional logic, or ambiguous inputs, this is the difference between an agent that handles the easy 80% and one that handles the hard 20% that actually creates value.

The License Gap Nobody Talks About

Open-source AI models have a licensing problem that most technical evaluations ignore completely.

Meta's Llama 4, the other dominant open model family, ships under the Llama 4 Community License. It provides access to weights but imposes real restrictions: applications exceeding 700 million monthly active users require a separate commercial agreement. The Acceptable Use Policy restricts entire categories of applications. The Open Source Initiative has repeatedly stated that Llama licensing does not meet the Open Source Definition.

For a startup building a consumer app, these restrictions might never matter. For an enterprise deploying AI agents across business operations, they create procurement friction that slows adoption by months.

Enterprise legal teams evaluate AI model licenses the same way they evaluate any software dependency. Can we modify it? Can we distribute derivative works? Are there usage thresholds we might hit? Does the acceptable use policy conflict with any of our business activities? With Llama, the answer to several of these questions is "it depends" or "check with Meta." With Apache 2.0, the answer is uniformly "yes, go ahead."

Google is not the first to use Apache 2.0 for AI models. Qwen 3.6 and Mistral Small 4 also use it. But Gemma 4 is the first Apache 2.0 model to rank in the top 3 globally. That combination of capability and licensing clarity did not exist before April 2.

What This Means for Enterprise AI Agent Teams

Three implications that matter for teams running agent workflows in production:

Fine-tuning without legal review. Apache 2.0 means you can fine-tune Gemma 4 on proprietary data and deploy the resulting model commercially without additional licensing. For enterprises building domain-specific agents for finance, HR, or procurement, this removes the legal overhead that made fine-tuning open models impractical. You own your fine-tuned model the same way you own any code you write.

Edge deployment becomes real. The E2B and E4B models are not just smaller versions of the big model. They are purpose-built for on-device inference with multimodal capabilities, offline operation, and memory footprints that fit on commodity hardware. For manufacturing, logistics, or field service operations, this means agent intelligence at the point of work. No cloud dependency, no latency, no data leaving the device.

Multi-model orchestration gets a new option. Most production agent systems already use multiple models for different tasks. A reasoning model for complex decisions, a fast model for classification, a code model for structured output. Gemma 4's range from 2B to 31B, all under the same license and architecture, means you can build a coherent multi-model stack from a single family. The 31B handles complex reasoning, the 26B MoE handles high-throughput classification, and the E4B handles edge inference. Same fine-tuning approach, same output format, same license across all of them.

The Speed Question

The community has flagged one concern worth acknowledging: inference speed. Early benchmarks show the 31B model running slower than comparably-sized competitors on some providers. The 26B MoE addresses this for latency-sensitive workloads by activating only 3.8B parameters per forward pass, but teams evaluating Gemma 4 should benchmark on their specific hardware before committing to production deployment.

This is a solvable problem. Quantization, provider optimization, and the MoE variant all offer paths to acceptable latency. But it is real, and pretending otherwise would not help anyone planning a deployment.

The Bigger Picture

The Gemma family has been downloaded over 400 million times, with more than 100,000 community variants. Gemma 4 under Apache 2.0 will accelerate this.

But the real story is not about one model release. It is about the gap closing between proprietary and open models. A year ago, choosing an open model meant accepting significantly worse performance. Today, a 31B parameter model you can run on your own hardware, fine-tune on your own data, and deploy without a license review outperforms models with 20 times its size.

For enterprise teams building AI agents, the question is no longer whether open models are good enough. It is whether the operational advantages of running your own models, data privacy, latency control, cost predictability, and full customization, outweigh the convenience of API-only access.

With Gemma 4, that calculation just tipped further toward running your own stack.

Start Today

Start building AI agents to automate processes

Join our platform and start building AI agents for various types of automations.

Start Today

Start building AI agents to automate processes

Join our platform and start building AI agents for various types of automations.

Platform

Solutions

Our Customers

Resources

About

Google Gemma 4 Under Apache 2.0: What Changes for Enterprise AI Agents

by

Fredrik Falk

Category

The AI World

Share the article

What Gemma 4 Actually Ships With

Native Function Calling Changes the Agent Equation

The License Gap Nobody Talks About

What This Means for Enterprise AI Agent Teams

The Speed Question

The Bigger Picture

Start building AI agents to automate processes

Start building AI agents to automate processes

Latest articles

GPT-5.6 Sol Hits 750 Tokens a Second. Agent Latency Just Became a Buying Decision

Beam vs Bullhorn Automation: Which One Actually Fits Modern Staffing Firms in 2026?

The 2026 BPO Automation Benchmark: Why the 25% Handling-Time Ceiling Is the Wrong Number

GPT-5.6 Sol Hits 750 Tokens a Second. Agent Latency Just Became a Buying Decision

Beam vs Bullhorn Automation: Which One Actually Fits Modern Staffing Firms in 2026?

The 2026 BPO Automation Benchmark: Why the 25% Handling-Time Ceiling Is the Wrong Number

What Is MCP? Model Context Protocol for AI Agents Explained