LLM Fine-Tune JSONL Validator

updated 4 May 2026

Your training data never leaves this tab. OpenAI and Anthropic both upload your data when a job starts, but this pre-flight check is fully local — useful when the dataset contains anything you would rather not upload twice.

Validate

Format:

Drop a fine-tune .jsonl file here, or

LLM Fine-Tune JSONL Validator

Validate your fine-tune file against three common schemas before you pay for a training run: OpenAI chat (messages), Anthropic messages (top-level system + alternating user/assistant), and the legacy OpenAI prompt/completion format. Every line is parsed and every message checked for role, content, and shape. 100% in-browser.

— S., [email protected]

Before you start

You need a .jsonl or .ndjson file formatted for OpenAI's Chat Completion API. Every line must be a standalone JSON object containing a messages array. This tool is designed to catch schema errors locally so you don't waste time or money on failed training jobs.

Your training data never leaves your computer. I wrote this because I got tired of waiting for OpenAI's CLI to upload a 50 MB file only for it to error out on line 4,000 because of a simple typo in a role name. The validation happens entirely in your browser using JavaScript.

This validator specifically targets the {"messages": [...]} format. If you are using the legacy prompt/completion pairs, you'll need to convert them to the chat format first, as OpenAI has deprecated the old style for newer models like GPT-4o.

How to use it

Drop your .jsonl file into the dashed box, or paste the text directly into the input area.
Click Validate. I'll scan every line and verify the structure of every message.
Check the Summary section for total row counts and a rough token estimate (calculated as characters / 4).
If errors are found, look at the Error List to see the exact line number and what went wrong (e.g., a missing assistant turn).
If you just want the good stuff, click Download valid examples only to save a cleaned file with the broken lines stripped out.

Options explained

Validate vs. Download clean

Validate simply tells you what is wrong and where. Use this if you want to go back to your source script and fix the logic producing the errors.

Download valid examples only is a "quick fix" button. It iterates through your file, keeps only the lines that pass 100% of the schema checks, and lets you download the result as openai-fine-tune-clean.jsonl. It's a lifesaver when you have thousands of rows and just a few malformed entries you're happy to discard.

Example

A valid training line looks like this:

{"messages": [{"role": "system", "content": "You are a poet."}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello"}]}

Common mistakes I catch:

{"messages": [{"role": "usre", "content": "Hi"}]} 
// Error: Typo in role "usre" and missing "assistant" turn.

I'll also flag missing content fields, empty message arrays, and invalid tool/function call structures.

Tips & common pitfalls

Assistant turns are mandatory. Every example must have at least one assistant message. That's what the model is actually learning to generate.
Role typos are common. I check for system, user, assistant, tool, function, and developer. Typos like "asistant" are the #1 cause of job failures.
JSONL is not a JSON array. Your file should not start with [ or end with ]. It should be one object per line. If you have an array, use the "JSONL ↔ JSON Array" tool below to flip it.
Empty content is usually an error. Unless you are providing tool_calls, a message with "content": "" will likely be rejected by the fine-tuning API.
The "developer" role. I've updated the validator to support the developer role used in newer models (like o1), which often replaces the traditional system role.

Troubleshooting

The tool says "Unexpected token" or "Invalid JSON".

This means the line itself is malformed JSON. Check for trailing commas at the end of your objects or unescaped newlines inside your strings. Every line must be a single, valid JSON string.

My file is 100 MB and the browser tab is lagging.

Pasting 100 MB of text into a textarea is very hard on browser memory. Use the File Drop zone instead; it reads the file in chunks and is significantly faster and more stable for large datasets.

Everything is valid but OpenAI still rejects the file.

OpenAI sometimes adds new constraints (like max total tokens per line). While I check the schema and basic requirements, always check their latest docs if you're hitting exotic edge cases with very large individual examples.

Related tools

See also: if you need to do something adjacent on this site, try Validator to check each line of a JSONL file for syntax errors, Formatter to pretty-print or minify each JSONL record, or JSONL to CSV to flatten JSONL into a CSV with dotted keys.

Frequently asked questions

Does this support the legacy prompt/completion format?

Yes — pick OpenAI prompt/completion (legacy) in the format dropdown. It checks for non-empty prompt and completion strings and flags the missing leading-space convention. The format is deprecated for newer models, but is still around in older datasets.

Does this validate Anthropic / Claude fine-tune files?

Yes — pick Anthropic (system + messages). The validator enforces Anthropic's actual shape: an optional top-level system string (not a message with role: "system"), a messages array starting with a user turn, and strict user / assistant alternation. Multi-modal content arrays with text and image parts are supported.

What about vision or multi-modal data?

Supported. If your content is an array of parts (text and image_url), I'll verify that each part has the required type and structure.

Is my data uploaded to your servers?

Absolutely not. I don't even have a backend for this project. Everything runs in your browser's memory. You can even turn off your internet after the page loads and it will still work.

Why is the token count an estimate?

Precise tokenization requires the tiktoken library and the specific model's encoding (like cl100k_base). I use a character-based heuristic (chars / 4) to give you a ballpark figure without bloating the page size.