NVIDIA Launches Nemotron 3 Ultra: A 550-Billion-Parameter Open AI Model Built for Long-Running Agents

NVIDIA Launches Nemotron 3 Ultra: A 550-Billion-Parameter Open AI Model Built for Long-Running Agents

NVIDIA’s latest open model promises up to five times faster inference and 30% lower costs for complex AI agents handling coding, research and enterprise tasks.

There’s a familiar frustration in the world of enterprise AI right now. Businesses want to build AI systems that can actually *do things* — not just answer a question, but plan, reason, call tools, write and debug code, and keep going across long, complex workflows. The problem has been that the models capable of doing this well are either eye-wateringly expensive to run, locked behind proprietary APIs, or too slow to be practical. NVIDIA says it’s taken a serious swing at fixing that.

The company has announced the release of Nemotron 3 Ultra, the largest and most capable model in its Nemotron 3 family, sitting at the top of a range that also includes the smaller Nano and Super variants. It’s a big number: 550 billion total parameters. But here’s the clever bit — thanks to a design approach called Mixture-of-Experts (MoE), only 55 billion of those parameters are actually active at any given moment when the model is processing a request. Think of it like a very large team of specialists where only the relevant experts are called into the room for each job, rather than everyone turning up at once. That keeps things faster and cheaper than you’d expect for a model this size.

What Makes Nemotron 3 Ultra Different

The architecture behind Nemotron 3 Ultra is, frankly, a mouthful. NVIDIA describes it as a hybrid Mamba-Transformer design — combining the Mamba state-space approach with traditional attention mechanisms — and layers in several proprietary techniques. LatentMoE improves accuracy per active parameter. Multi-Token Prediction (MTP) lets the model predict several tokens at once in a single forward pass, boosting throughput. And NVFP4, a numerical format developed by NVIDIA, makes both training and inference more efficient.

At the same time, the result, according to NVIDIA’s technical report, is up to around 5.9 times higher inference throughput than GLM-5.1-754B-A40B, 4.8 times faster than Kimi-K2.6-1T-A32B, and 1.6 times faster than Qwen-3.5-397B-17B, all tested on an 8,000-token input and 64,000-token output benchmark, while maintaining comparable accuracy across reasoning and agentic benchmarks. NVIDIA’s public messaging rounds this to “up to 5× faster inference” and “up to 30% lower cost” for agentic tasks. These are vendor-supplied figures and haven’t been independently verified by UK regulatory or research bodies.

One number that stands out is the context window: up to one million tokens. In practical terms, that means an AI agent can hold an enormous amount of information in its working memory — entire codebases, long research logs, extended plans — without losing track of what it was doing. That’s chiefly relevant for the kind of multi-step, long-running tasks NVIDIA is targeting.

Genuinely Open — Weights, Data and All

What sets Nemotron 3 Ultra apart from many frontier-scale models is that NVIDIA is releasing it as fully open. Not just the model weights, but the training data and recipes too. It’s released under an Open Model, Weights and Data licence coordinated with the Linux Foundation’s OpenMDW framework.

That matters. Most of the most powerful AI models in the world — think GPT-4o or Claude Opus 4 — are closed. You can access them via an API, but you can’t inspect them, fine-tune them on your own infrastructure, or understand exactly how they were built. With Nemotron 3 Ultra, developers and researchers can do all of that. They can deploy it in the cloud, on-premises, or at the edge using NVIDIA’s platform stack, including NVIDIA AI Enterprise and NIM microservices.

Third-party analysis from independent AI benchmarking group Artificial Analysis has described Nemotron 3 Ultra as one of the strongest Western open-weight models currently available, scoring 48 on their Intelligence Index in early testing. Separately, pre-release developer testing reported output speeds of over 300 tokens per second on specific hardware configurations — though both figures are third-party and non-official.

Who Is It Actually For?

NVIDIA is clear that Nemotron 3 Ultra isn’t designed for casual chat. It’s built for agentic systems — AI pipelines that decompose a complex task into steps, call external tools and APIs, check their own outputs, and retry until the job is done. Code generation and debugging, scientific research workflows, document analysis, long-context planning, tool orchestration. That’s the target.

Independent developers who’ve tested the model early have flagged one caveat worth knowing about. Structured-output reliability — getting the model to consistently return data in a strict format, like a specific JSON schema — can still require careful prompting, external validation layers, and retry logic before it’s production-ready. It’s not a criticism unique to this model, but it’s a practical reality for teams building automated pipelines.

Some analysts have also pointed out that, open weights or not, running a 550-billion-parameter model is still a serious undertaking. You need hefty GPU infrastructure to do it on-premises. Most organisations will end up using it through cloud providers offering NVIDIA GPU instances, or through managed services. Smaller teams may find the engineering overhead significant.

What This Means for Kent Residents

For most people in Kent, the immediate impact is indirect — but it’s real. As open models like Nemotron 3 Ultra become available, businesses and public-sector organisations across the UK gain access to powerful AI tools without being locked into expensive proprietary contracts, which could eventually feed through into more efficient local services and lower costs. Tech companies, developers, and researchers based in Kent — including those connected to the University of Kent — could use the model’s open weights for AI research, software development tools, and teaching, provided they have access to sufficient computing resources, most likely through cloud platforms. Any adoption by public bodies such as Kent County Council or NHS Kent and Medway ICB would need to comply with UK data protection law and government AI assurance frameworks, and no local deployments have been announced.

Source: @nvidia

NVIDIA Launches Nemotron 3 Ultra: A 550-Billion-Parameter Open AI Model Built for Long-Running Agents Quiz

5 questions