AI Hallucination Risk Calculator

Score the relative risk that an LLM output will contain a hallucination, given the model class, task, grounding and verification you have in place. The score is a 0-100 sense-check, not a benchmark. The recommendation factors in how much hallucination would actually cost in your context.

Explain like I'm 5 (what even is this calculator?)

You tell it which model you are using, what you are using it for, whether the model can look things up, and who or what checks the answer before it goes anywhere. It returns a risk score from 0 (very safe) to 100 (alarming) and a sentence or two on what to do about it. The maths is a multiplication, not a black box, and the Prove-it panel shows every number.

Score your setup

Heuristic, browser-only. No API calls, no sign-up, nothing leaves the page. The maths is pure multiplication and you can audit every step in the Prove-it panel.

Frontier multiplier 1.0, mid 1.4, small 2.2, fine-tuned 0.7.

Base risk (out of 100) ranges from 10 for creative to 85 for citations.

Multiplier: none 1.0, web 0.7, RAG 0.5, knowledge graph 0.3.

Multiplier: none 1.0, human 0.4, second-pass 0.7, programmatic 0.5.

Stakes shape the recommendation, not the numeric score.

Prove it

The score is a single product of four numbers, then clamped to 100:

score = baseRisk(task) × multiplier(model) × multiplier(grounding) × multiplier(verification)

Worked example. Mid-tier model (1.4), factual lookup (base 55), web search grounding (0.7), human review (0.4):

55 × 1.4 × 0.7 × 0.4 = 21.56 ⇒ 22

If the raw product is over 100, the displayed score is 100. The cap is there because a "180" risk score is meaningless and would distract from the real point: anything above 75 is already in the "do not ship" band.

What this is not. The numbers are opinionated, informed by published evaluation rates (TruthfulQA, HaluEval, SimpleQA) and operational experience, but not derived from a formal model. Two reasonable practitioners could pick different multipliers within an order of magnitude. Use this to spot obvious problems in a design before review, not to certify a system. If you need a defensible figure, run the actual model against your own evaluation set.

Stakes is intentionally outside the score. A Moderate score is acceptable for an internal drafting bot and unacceptable for a clinical decision aid. Mixing stakes into the score would hide that difference. The recommendation is where context enters the picture.

Useful? Save this calculator: press Ctrl + D to bookmark it.

When to use this

Run the calculator before a design review for any LLM-touching feature, and again after the design has been agreed but before code freeze. The first pass surfaces the obvious mistakes (small model, no grounding, citation-heavy task, high stakes). The second confirms that the controls actually went in. It is a sense-check, the same way a 30-second pre-flight glance at the dials catches the worst-case errors without replacing the actual checklist.

Where it earns its keep is in conversations with people who do not work with LLMs day to day. Product managers, lawyers, clinical leads. They will reasonably ask "how risky is this?" and a number on a four-band label is more useful than a paragraph of caveats. Show them the breakdown so they can see what would push the number up or down, and they end up driving the design themselves.

Common mistakes that show up in the score

Three keep recurring. First, asking a small or mid-tier model to generate citations or numerical reasoning without grounding. The score lands in High or Critical and that is the right answer: the failure mode is too well documented to ignore. Second, treating "second-pass LLM check" as equivalent to human review. It is not. Two language models share most of their failure modes, so a second-pass check catches the obvious mistakes and rubber-stamps the subtle ones. The multiplier (0.7) reflects that. Third, assuming fine-tuning on a small domain corpus rescues a poor base model. It helps within the domain you tuned for, and almost not at all outside it. The 0.7 multiplier on the fine-tuned class is generous, and only valid in the narrow case.

Edge cases the heuristic does not cover

Multimodal hallucinations (a model misreading an image or chart) are not modelled here. The grounding multipliers assume the source itself is reliable; if your RAG corpus is full of marketing copy or out-of-date documentation, the model will quote that confidently and the score will look better than the reality. Adversarial inputs are also out of scope: a user trying to jailbreak the system around its grounding belongs in a security review, not a hallucination heuristic. And finally, agentic loops where the model takes actions: every additional turn compounds error rate, which the single-step score does not capture. If you are building an agent, treat the result here as the single-step floor and assume the multi-step rate is meaningfully higher.

How the multipliers were chosen

The model class numbers come from public hallucination evaluations on factual tasks. Frontier models score in the 5-15 percent range on hard factual benchmarks. Mid-tier models cluster around 20-30 percent on the same benchmarks, which is roughly 1.4× worse, hence the multiplier. Small open-weights models often double that again, hence 2.2. Fine-tuned domain models, when stayed within domain, beat much larger general models, with reductions of 30-50 percent commonly reported, hence 0.7. The grounding multipliers come from RAG evaluations: web search reduces hallucinations by roughly a third, RAG over a curated corpus by half, and a structured knowledge graph or database by two-thirds or more, because the failure mode collapses from generation to lookup. Verification multipliers are based on the operational reality that a qualified human reviewer catches most factual errors (0.4), a programmatic checker catches structured errors but misses semantic ones (0.5), and a second-pass LLM catches the obvious slips but shares the subtle blind spots of the first pass (0.7).

Related calculators

Risk is one axis. These cover the cost and the levers for pulling it down.

Frequently asked questions

Is this score based on a real benchmark?

No. The numbers are an opinionated heuristic. The base task weights are loosely informed by published evaluation rates (TruthfulQA, HaluEval, SimpleQA), and the multipliers reflect how much grounding and verification have been shown to reduce hallucination in practice. Treat the score as a sense-check before a design review, not as evidence in a regulator-facing report. If you need defensible numbers, run the model against your own evaluation set.

Why does grounding lower the score so much?

Because, in practice, it does. Most hallucinations come from the model filling gaps with plausible-sounding text. Give the model a verified source to quote from, and the failure mode shifts from inventing facts to occasionally misreading the source, which is a much smaller class of error. A verified knowledge graph is even stronger because the model is constrained to lookup rather than generation. The multipliers (1.0, 0.7, 0.5, 0.3) match the rough order of magnitude reported in retrieval-augmented evaluation studies.

Why does the recommendation change with stakes if the score does not?

A Moderate score is fine for an internal drafting assistant and unacceptable for a clinical decision aid. The score measures how likely the output is to contain a hallucination. The stakes measure how much that matters. Both feed the recommendation, but only the deployment context goes into the score, because we did not want a high-stakes setting to inflate the number and make a low-risk configuration look bad. The recommendation is where the two combine.

Why is citation generation rated highest?

Because LLMs are notorious for fabricated references. The model knows what a citation looks like (author, year, journal, DOI) and will happily produce one that does not exist, often for a real author writing in the right field. This pattern keeps appearing in legal filings, academic papers and policy briefs, with consequences ranging from awkward to professionally fatal. If you ask a model for citations without a retrieval step, expect about half of them to be wrong on a hard question.

Does fine-tuning really reduce hallucinations?

On the domain it was tuned for, usually yes. A small model fine-tuned on a focused domain corpus often beats a much larger general model on factual accuracy within that domain, because it has seen the right vocabulary, conventions and edge cases. Off-domain, fine-tuning helps less, and can even make things worse if the tuning data was thin. The 0.7 multiplier here assumes you stay within the domain you tuned for.