Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

Eungyeup Kim, Chenchen Gu*, Vashisth Tiwari*, J. Zico Kolter (*co-second author)

📖 Paper: https://arxiv.org/pdf/2605.11209

🧑‍💻 Code: https://github.com/EungyeupKim/Five_Nines_Reliability

<aside> 💡

TL;DR

While achieving extremely high reliability, e.g., five-nines accuracy, is fundamental for reliable LLMs in real-world deployments, evaluating such a rare failure probability requires prohibitively large number of inferences for tight confidence bounds.

In this project, we observe that capable LLMs, such as Qwen2.5-Math-7B-Instruct, gpt-oss-20b, and Gemini2.5 Flash Lite, exhibit systematic failures, not random ones*,* when tested on some parameterized GSM problems. By sampling on failure-prone, we save up to 156.22$\times$ number of inferences compared to naive uniform sampling. The reliability evaluations enabled via such efficiency highlight that reliability is a distinct and measurable axis of model quality.

Find the details below. 🧵👇

</aside>

🤔 A 99.9% benchmark score looks like success, but it quietly hides a risk of 1,000 failures per million real-world queries.

Large language models (LLMs) show rapid progress not just in completing tasks with shorter duration, but in giving saturating performance in numerous benchmarks, e.g., in AIME 2025, GPT-5.2 Thinking and Gemini-3-Pro achieve a perfect score, and Claude Opus 4.6 reaches 99.8%.

These “near-perfect” performances seem indicating that models are reliable at first glance, they may instead obscure how genuinely reliable these models are in real-world deployments! As an example, in safety-critical settings such as refusing harmful requests, 99.9% vs. 99.999% in millions of queries translates into an order-of-magnitude difference in failures, e.g., 1000 vs. 10 failures.

Then.. why not just go ahead and evaluate reliability?

💵 Expensive LLM inference bottlenecks reliability evaluation.

The real problem is, LLM is expensive to evaluate..

For such rare failures, achieving the tight confidence bounds requires a prohibitively large sample size via Monte Carlo sampling under a uniform distribution P. For instance, to estimate a failure rate of $$p = 10^{-5}$ (equally 99.999% accuracy) with small standard error $\varepsilon$, the binomial estimator $\sqrt{p(1-p)/n}$ requires $n \approx p/\varepsilon^2 = 10^{-5}/(10^{-6})^2=10^7$ when $\varepsilon=10^{-6}$.

So the key question becomes:

Is there any way to estimate it without running millions of model calls?

💫 Surprisingly, LLM Failures are systematic, not random

We empirically observe that LLM failures aren’t just random noise, but exhibit structured fails. Specifically, across broad input space of parameterized GSM problems (i.e., GSM-Symbolic), only a tiny set of input values account for the majority of model failures. Examples are as below:

<aside> 📎

In “parameterized” GSM8K problem (Template 6), Gemini 2.5 Flash Lite fails disproportionately (~82% of entire failures) for certain input parameter values, e.g., multiplying $35\times 43$. Toggle below for details of each model failures on templates (red highlights potentially failure-prone input values and wrong reasoning):

Example failures of Gemini 2.5 Flash Lite
Example failures of Qwen2.5-Math-7B-Instruct
Example failures of gpt-oss-20b-low </aside>

While there is no priori that explains why this happens, we empirically observe the phenomenon across broad range of models and templates (Check below for addition results👇)

Additional results on systematic failures