Opus is the Claude you use when you want fewer compromises. It is expensive, but it feels built for serious knowledge work rather than casual throughput.
Model Tracker
Model release tracking, price shifts, and practical benchmark summaries.
Use this page intro to explain how you evaluate models and what readers should look for.
Fast picks for common decisions
High-autonomy agentic coding
Balanced production use where you want strong reasoning
Large-context reasoning
Complex reasoning
Compare practical model fit
Scores are Post Reboot Practical Scores for editorial comparison. They are not official benchmark claims unless a sourced benchmark is listed on the model profile.
No models selected.
| Model | Overall | Best fit | Context | Pricing | Speed | Reasoning | Coding | Access | Verified |
|---|---|---|---|---|---|---|---|---|---|
|
Claude Opus 4.7
Anthropic
|
90.0 |
Coding
Long context
Opus is the Claude you use when you want fewer compromises. It is expensive, but it feels built for serious knowledge work rather than casual throughput. |
1M tokens | In $5 / MTok Out $25 / MTok | 82.0 | 95.0 | 95.0 | API, Enterprise, Web app | May 2, 2026 |
|
Claude Opus 4.8
Anthropic
|
90.9 |
Agents
Coding
Opus 4.8 looks like Anthropic's current premium reliability play: less about raw novelty than making high-autonomy agent work steadier, sharper, and easier to trust. |
1M tokens | In $5 / 1M tokens Out $25 / 1M tokens | 83.0 | 96.0 | 96.0 | API, Enterprise, Web app | May 31, 2026 |
|
Claude Sonnet 4.6
Anthropic
|
90.1 |
Agents
Coding
Sonnet is often the pragmatic Anthropic choice. It gives you enough intelligence to do real work without immediately making every request feel expensive. |
1M tokens | In $3 / MTok Out $15 / MTok | 90.0 | 91.0 | 92.0 | API, Enterprise, Web app | May 2, 2026 |
|
DeepSeek V4 Pro
DeepSeek
|
87.4 |
Coding
Long context
DeepSeek keeps forcing the market to take price compression seriously. The model matters, but the strategic story is the economics. |
1M tokens | In $1.74 / 1M input tokens Out $3.48 / 1M output tokens | 88.0 | 89.0 | 89.0 | API | May 2, 2026 |
|
DeepSeek-V4-Flash
DeepSeek
|
84.3 |
Coding
Long context
V4 Flash is the budget pressure point in the current tracker set: if the official prices hold, it changes the cost floor for long-context API work. |
1M tokens | In $0.14 / 1M tokens cache miss Out $0.28 / 1M tokens | 92.0 | 84.0 | 85.0 | API | May 31, 2026 |
|
Devstral 2
Mistral
|
83.4 |
Agents
Coding
Devstral 2 matters because it is aimed directly at the coding-agent stack instead of trying to be another broad general-purpose flagship. |
256K tokens | In $0.40 / 1M tokens Out $2.00 / 1M tokens | 87.0 | 82.0 | 92.0 | API, Enterprise | May 12, 2026 |
|
Gemini 3.1 Pro Preview
Google
|
90.1 |
Coding
Long context
This looks like Google’s real flagship lane now: less about novelty, more about making Gemini feel dependable enough for serious agent and coding work. |
1M tokens | In $2.00 / 1M tokens (200k) Out $12.00 / 1M tokens (200k) | 85.0 | 94.0 | 95.0 | API, Enterprise, Web app | May 12, 2026 |
|
GPT-5.4 mini
OpenAI
|
87.5 |
Agents
Coding
This is the kind of model that actually changes unit economics. It makes a lot of agent patterns viable where flagship pricing still feels heavy. |
400K tokens | In $0.75 / 1M tokens Out $4.50 / 1M tokens | 92.0 | 86.0 | 90.0 | API | May 2, 2026 |
|
GPT-5.5
OpenAI
|
90.1 |
Coding
Long context
This is the kind of model you reach for when failure is expensive. It looks less like a casual chatbot and more like a premium cognitive engine for serious work. |
1.05M tokens | In $5 / 1M tokens Out $30 / 1M tokens | 84.0 | 96.0 | 97.0 | API | May 2, 2026 |
|
Grok 4.3
xAI
|
88.8 |
Coding
Generalist
The important change is less the name than the economics: xAI's official docs now position Grok 4.3 as a practical API model instead of a premium curiosity. |
1M tokens | In $1.25 / 1M tokens Out $2.50 / 1M tokens | 91.0 | 90.0 | 89.0 | API | May 31, 2026 |
|
Mistral Medium 3.5
Mistral
|
84.0 |
Coding
Generalist
Mistral’s role in the market is strategic as much as technical. It gives serious teams another commercial frontier path that is not just a copy of the biggest platforms. |
256K tokens | In $1.5 / 1M tokens Out $7.5 / 1M tokens | 86.0 | 84.0 | 88.0 | API, Enterprise | May 2, 2026 |
|
Mistral Small 4
Mistral AI
|
82.4 |
Agents
Coding
Small 4 is not the glamour entry, but its hybrid design and low price make it a practical model-tracker candidate for builders watching margins. |
256k tokens | In $0.15 / 1M tokens Out $0.60 / 1M tokens | 90.0 | 82.0 | 86.0 | API | May 31, 2026 |
|
Qwen3-235B-A22B
Alibaba / Qwen
|
86.1 |
Coding
Multilingual
Qwen matters because it keeps the open-weight lane credible at the high end. It is not just a hobbyist release; it is a real option for builders who want control. |
128K tokens | Local | 76.0 | 90.0 | 91.0 | API, Local, Open weights, Web app | May 2, 2026 |
|
Qwen3.6-Plus
Alibaba / Qwen
|
82.5 |
Generalist
Long context
Potential tracker replacement/refresh for Qwen3.5-Plus, but hold for pricing/context details and English/global docs alignment before publishing. |
See Alibaba Model Studio docs | In See Alibaba Model Studio pricing Out See Alibaba Model Studio pricing | 85.0 | 86.0 | 82.0 | API | May 16, 2026 |
Opus 4.8 looks like Anthropic's current premium reliability play: less about raw novelty than making high-autonomy agent work steadier, sharper, and easier to trust.
Sonnet is often the pragmatic Anthropic choice. It gives you enough intelligence to do real work without immediately making every request feel expensive.
DeepSeek keeps forcing the market to take price compression seriously. The model matters, but the strategic story is the economics.
V4 Flash is the budget pressure point in the current tracker set: if the official prices hold, it changes the cost floor for long-context API work.
Devstral 2 matters because it is aimed directly at the coding-agent stack instead of trying to be another broad general-purpose flagship.
This looks like Google’s real flagship lane now: less about novelty, more about making Gemini feel dependable enough for serious agent and coding work.
This is the kind of model that actually changes unit economics. It makes a lot of agent patterns viable where flagship pricing still feels heavy.
This is the kind of model you reach for when failure is expensive. It looks less like a casual chatbot and more like a premium cognitive engine for serious work.
The important change is less the name than the economics: xAI's official docs now position Grok 4.3 as a practical API model instead of a premium curiosity.
Mistral’s role in the market is strategic as much as technical. It gives serious teams another commercial frontier path that is not just a copy of the biggest platforms.
Small 4 is not the glamour entry, but its hybrid design and low price make it a practical model-tracker candidate for builders watching margins.
Qwen matters because it keeps the open-weight lane credible at the high end. It is not just a hobbyist release; it is a real option for builders who want control.
Potential tracker replacement/refresh for Qwen3.5-Plus, but hold for pricing/context details and English/global docs alignment before publishing.
No models match those filters yet.