AI Fine-Tuning Cost Calculator

Work out what it costs to fine-tune an LLM, plus what the tuned model will cost to run afterwards. Pick a model, drop in your training data tokens (or word count, with auto-conversion), set the number of epochs, and the calculator returns the training bill and the per-1000-call hosted inference cost. USD, list price, browser-only.

Explain like I'm 5 (what even is this calculator?)

Fine-tuning is when you take a pre-trained model and teach it to be better at one specific job using your own examples. You pay twice: once to do the training (data tokens times number of passes through the data), and again every time you call the tuned model afterwards, at a slightly higher rate than the base model. This calculator works both numbers out so you do not get a surprise on the invoice.

Calculate

Training data

If you only have a word count, use the field below and the calculator will convert.

Converts at 1.33 tokens per word.

Number of full passes over the dataset. Three is a sensible starting point for most instruction-style sets.

Model
Inference (per call)

Press Calculate to see the bill.

Prove it

Training

Inference

Training cost is (data tokens × epochs) ÷ 1,000,000 × the model's training rate per million tokens. Inference cost per call is (input tokens × tuned input rate + output tokens × tuned output rate) ÷ 1,000,000, then multiplied by 1000 for the per-1000-call figure. Word counts are converted at 1.33 tokens per word, the published English heuristic. List prices only, no caching, no batch discount.

Useful? Save this calculator: press Ctrl + D to bookmark it.

What fine-tuning actually buys you

Fine-tuning takes a pre-trained model and shifts its behaviour on a narrow task using examples you supply. Done well, it gets a small model behaving like a much larger one on that one job. Done badly, it makes a capable model worse, more expensive to run, and locked into your provider's tuned-inference fleet. The maths on this page covers the bill. It does not cover whether you should be doing it at all.

Reach for fine-tuning when you have a stable task, a clear quality bar, and at least a few hundred high-quality examples. Style transfer (turn a generic model into one that writes like your house style), structured output that the base model gets close on but not consistently, and classification or extraction with a domain vocabulary the base model has not seen enough of. Avoid fine-tuning when the underlying problem is missing context. If the model needs facts it does not have, retrieval-augmented generation will beat fine-tuning on cost, freshness, and accuracy.

Reading the bill

Two costs sit in tension. Training is one-off (per training run): you pay for every token of data, multiplied by the number of epochs. Inference is the recurring cost: every call to the tuned model afterwards is billed at the provider's tuned-inference rate. On OpenAI, the tuned rate is meaningfully above base. On Together and Mistral, the gap is smaller. So the question is rarely "is the training cost worth it?" and almost always "can I absorb the higher inference rate at my expected call volume?"

A worked example. A 500,000-token dataset on GPT-4o-mini at 3 epochs is 1.5M training tokens at $3 per million, so $4.50 to train. The tuned inference rate is $0.30 input and $1.20 output per million. At 500 input tokens and 200 output tokens per call, that is $0.39 per 1000 calls. Cheap. Switch to GPT-4o and the same training run costs $112.50, and inference becomes $4.88 per 1000 calls, which is over twelve times more per call. Whether that is worth it depends entirely on whether the tuned mini hits your quality bar or you genuinely need the heavier model.

Common mistakes that cost real money

  • Counting words instead of tokens. The 1.33 ratio is rough. Code, JSON, and non-English text run higher. If your dataset is mostly schemas or French legal prose, expect a 30 to 50 percent under-estimate from word counts. Use a tokeniser if accuracy matters.
  • Training too many epochs. Each extra epoch is another full pass through the data at the same per-token rate. Past three or four, you are usually buying overfitting, not improvement. Run a short eval after each epoch and stop when quality plateaus.
  • Forgetting the inference premium. A tuned model that costs the same to train as a competitor but 2x to run will lose every cost comparison at any reasonable call volume. Always project the inference cost out to your monthly call rate before committing.
  • Tuning before trying a system prompt. A well-written system prompt with a few in-prompt examples will hit the quality bar for most tasks. Try that first. Tuning is the right answer when prompting plateaus, not before.

Edge cases worth thinking through

Multi-turn datasets are billed by total token count across the conversation, not per turn. If your training examples include long system prompts repeated across thousands of rows, factor that in. Some providers count the system prompt once per example, others per turn. Check the docs before you assume.

If you re-train every time the source data changes, the training cost becomes recurring. Decide upfront whether tuning is a one-shot exercise or a monthly job, because the answer changes the cost calculus completely. A weekly retrain on a 500,000-token dataset is fifty-two times the figure shown here, and at that point a static tuned model probably is not the right tool.

Related calculators

Tuning is one fix for a model. These cover the alternatives and the case for it.

Frequently asked questions

How is training cost worked out?

Total training tokens equals your data tokens multiplied by the number of epochs. The bill is then total training tokens divided by one million, multiplied by the provider's training rate per million tokens. So a 500,000-token dataset trained for 3 epochs is 1.5M training tokens, and at $3 per million you would pay $4.50.

Why is hosted inference more expensive on a tuned model?

Tuned models run on dedicated capacity, not the shared base-model fleet, so providers charge a premium. OpenAI's tuned GPT-4o-mini and GPT-4o list at roughly 2x and 1.5x the base inference rates respectively. Together and Mistral keep tuned and base inference closer in price. The calculator uses the live published tuned rates per provider, not a flat multiplier.

Should I count words or tokens?

Tokens, if you can. If you only have a word count, the calculator converts at 1.33 tokens per English word, which is the rough heuristic OpenAI publishes. For non-English text, code, or JSON, the ratio shifts higher, so under-budget your figure when working with those.

Why is three epochs the default?

Three is what most fine-tuning guides land on as a sensible starting point for instruction-style datasets. One epoch under-fits, ten over-fits. The right number depends on dataset size and quality, so treat the default as a budget anchor rather than a recommendation.

Does the calculator include data prep or evaluation cost?

No. List training and tuned inference rates only. Cleaning the data, writing eval scripts, and the LLM calls behind any LLM-as-judge evaluation harness are all extra. On a real project they often cost more than the training itself, especially for the first iteration.

Does this calculator send my numbers anywhere?

No. Everything runs in your browser. Nothing is uploaded.