JSONL Token Counter

updated 1 May 2026

100% client-side. No upload.

Count tokens

Model: Epochs:

Drop a .jsonl file here, or

Input

Per-record token counts

JSONL Token Counter

Estimate token totals and per-record token counts across a JSONL file, with a model picker and fine-tune cost estimate. Flags records that exceed the chosen model's context window before you waste a training run on them.

— S., [email protected]

How the estimate works

Real tokenizers (OpenAI's tiktoken, Anthropic's tokenizer) require shipping a ~2 MB WASM blob to the browser. To keep this tool fast and offline-capable, it uses a characters-per-token heuristic calibrated per model family:

GPT-family models — ~4.0 chars/token for typical English chat data.
Claude models — ~3.6 chars/token; Claude's tokenizer splits a bit more aggressively on punctuation.

In practice, the estimate is within a few percent of tiktoken on natural-language English. Code, non-Latin scripts, and heavy emoji use are tokenized more aggressively by every tokenizer — for those, treat the number as a lower bound and add ~30% headroom for budgeting.

Fine-tune cost estimate

For OpenAI models that have a published fine-tune training price, the tool multiplies total_tokens × epochs × $/1M. For models without published training pricing (Claude, GPT-4 Turbo currently), the cost row shows a note instead of a fake number.

Pricing reference (subject to change — check the provider's pricing page before committing):

GPT-4o: $25 / 1M training tokens
GPT-4o mini: $3 / 1M training tokens
GPT-3.5 Turbo: $8 / 1M training tokens

Context-window check

Each model's per-request context window is fixed (e.g., 128k for GPT-4o, 200k for Claude). Records longer than that will fail at training or inference time. The summary flags how many records in your file exceed the limit so you can split or trim them before submitting — the JSONL Splitter won't help here (records stay intact), but the JSONL Viewer can help you find the long ones.

Tips & common pitfalls

Estimate the message body, not the JSON wrapper. This tool counts the whole line as-is, which slightly overcounts because keys like "role", "content" aren't actually tokenized as part of the model input. For chat fine-tunes, the overcount is consistent (a few percent), so treat it as a safe upper bound.
Epochs matter for cost, not for context. Every epoch trains on the full file, so cost scales linearly with epochs. Default is 3 because that's what OpenAI's fine-tune jobs use unless you override it.
Run after dedup. Duplicate examples pay tokens twice and hurt fine-tune quality. Use the JSONL Deduplicator first.
Validate the structure first. If you're fine-tuning, confirm the file passes the OpenAI Fine-Tune Validator before estimating cost — invalid records get rejected without refund.

Example

Input — a small fine-tune dataset:

{"messages":[{"role":"user","content":"Hi"},{"role":"assistant","content":"Hello!"}]}
{"messages":[{"role":"user","content":"What is 2+2?"},{"role":"assistant","content":"4."}]}

At 3 epochs on GPT-4o mini, the summary will show ~50 tokens × 3 = 150 training tokens, costing fractions of a cent. Real datasets are millions of tokens; the estimate scales linearly.

Frequently asked questions

Why not use the real tokenizer?

It's a tradeoff. The real tokenizer is more accurate but adds ~2 MB of WASM to download. For budgeting fine-tunes the estimate is close enough — within a few percent on English chat data — and the tool stays instant on big files.

Does this work for inference cost (input/output token billing)?

Not directly — fine-tune training pricing and inference pricing differ. This tool focuses on training. For inference cost, multiply the input-token total by the published per-million input price for the model.

Why don't all Claude / GPT-4 Turbo entries show a price?

Anthropic doesn't publish a self-serve fine-tune price for Claude as of the date on this page; GPT-4 Turbo's fine-tune isn't generally available. Rather than make up numbers, those rows show "—".

Is my data sent to a server?

Never. Counting happens in your browser. See the privacy policy.

JSONL Token Counter

Count tokens

JSONL Token Counter

How the estimate works

Fine-tune cost estimate

Context-window check

Tips & common pitfalls

Example

Frequently asked questions

Related tools