OpenAI’s GPT-5.6 Sol Claims Top Spot on Terminal-Bench 2.1 Coding Benchmark

OpenAI has launched a limited preview of its GPT-5.6 model family, with the flagship Sol variant claiming the highest score yet recorded on a demanding command-line coding benchmark.

OpenAI announced the GPT-5.6 series in late June 2026, describing Sol as its “strongest model yet” and “most capable model yet for cybersecurity.” The launch covers three variants: Sol at the top end, Terra as a balanced mid-tier option, and Luna as a fast, lower-cost model for high-volume tasks.

Access is currently restricted to select customers and trusted partners. Broader general availability is expected within weeks.

What Is Terminal-Bench 2.1?

Terminal-Bench 2.1 is one of the most demanding public benchmarks for AI coding and agentic work. It tests how well a model handles complex, multi-step command-line workflows — running shell commands, editing files, recovering from errors, and coordinating tools across a long session. It’s not about completing a single line of code. It’s about whether an AI can plan, execute, debug, and iterate the way a software engineer would over hours of work.

GPT-5.6 Sol Ultra — a higher-performance mode of Sol — scores 91.9% on the benchmark, according to benchmark tables reproduced by independent technology publishers citing OpenAI’s evaluation data. Base Sol scores 88.8%. Both sit above the previous generation: GPT-5.5 scored 88.0% on the same test.

Anthropic’s Claude Mythos 5, Claude Fable 5, and Claude Opus 4.8, along with Google’s Gemini 3.1 Pro Preview, all trail Sol Ultra on this particular benchmark, according to comparative capability tables in technical analyses.

Beyond Coding: Cybersecurity and Biology

OpenAI’s announcement doesn’t stop at coding performance. The company says GPT-5.6 Sol leads on cybersecurity benchmarks including ExploitBench and ExploitGym, where Sol is reported to be competitive with Anthropic’s Mythos-class model while using markedly fewer output tokens to reach results.

Biology reasoning has also improved. Technical deep-dives citing OpenAI’s evaluation summaries report around a nine percentage-point average increase across virology, molecular biology, human pathogen, and broader biology capability tests compared with GPT-5.5. The exact sub-scores are unverified by any UK public body, and independent commentators have noted caveats around evaluation methodology.

OpenAI says GPT-5.6 Sol includes an enhanced safety stack designed to constrain misuse of its cybersecurity capabilities, even as those raw capabilities increase. The company is working with government partners on frontier risk management — a point it has emphasised in its own materials around the launch.

How Much Does It Cost?

Pricing is set per one million tokens. Sol costs around $5 (roughly £4) for input and $30 (around £24) for output. Terra comes in at $2.50 input and $15 output, while Luna is the cheapest at $1 input and $6 output. OpenAI has also introduced a prompt caching system: cache writes are billed at 1.25 times the standard input rate, while cache reads carry a 90% discount.

For businesses building on OpenAI’s API, choosing between Sol, Terra, and Luna will come down to balancing performance needs against cost.

Context window capacity has reportedly expanded too. Developer reports suggest GPT-5.6 can handle around 1.5 million tokens for complex tasks — up from roughly 400,000 tokens with GPT-5.5 — meaning far longer coding and research sessions without the model losing track of earlier context. That figure comes from developer commentary rather than an official regulatory document, so it should be treated as unverified for formal purposes.

Not Everyone Is Convinced

Some independent commentators and technical bloggers have pushed back on the benchmark numbers. Their concern is that high scores on specific evaluations don’t always translate to real-world reliability, and that the tasks chosen may favour certain model architectures. Others have raised broader questions about the risks that come with more capable cybersecurity and biology models, calling for stronger external auditing and regulatory oversight.

Sam Altman, OpenAI’s chief executive, said in materials accompanying the launch that GPT-5.6 represents a step forward in long-horizon agentic tasks — the kind of multi-hour, autonomous workflows that have long been a target for frontier AI development.

But critics argue that as raw capability climbs, the gap between what a model can do and what safeguards can reliably prevent grows wider. That tension is unlikely to go away.

What This Means for Kent Residents

For software developers, IT teams, and data engineers in Kent using OpenAI’s API, GPT-5.6 Sol’s stronger performance on terminal workflows and DevOps tasks could mean fewer manual steps and fewer errors once the model reaches general availability. Local businesses and start-ups building on OpenAI tools will need to weigh the new pricing tiers — Sol, Terra, and Luna — against their budgets, bearing in mind that OpenAI bills in US dollars. Kent organisations with security teams, including public-sector bodies and NHS Kent and Medway ICB, should also review their internal AI-use policies in light of GPT-5.6’s expanded cybersecurity capabilities, and keep an eye on guidance from the National Cyber Security Centre as frontier models grow more powerful.

Source: @OpenAI

⚡

OpenAI's GPT-5.6 Sol Claims Top Spot on Terminal-Bench 2.1 Coding Benchmark Quiz

5 questions

Source: @OpenAI

Published: 28 June 2026

Source: @OpenAI on X. This article has been researched and rewritten with editorial balance by Kent Local News.