jsonlkit.com
JSONL (JSON Lines) utilities, in the browser
Say hi →

Mistral Fine-Tune JSONL Validator — Online Alternative to validate_data.py

updated 17 May 2026 · Mistral AI Studio · La Plateforme · Azure AI Foundry · mistral-finetune · separate pages for OpenAI, Anthropic, Gemini, Llama

Mistral fine-tune JSONL validator. Same checks as Mistral's official mistralai/mistral-finetune/utils/validate_data.py — no pip install mistral-finetune, no YAML, no upload. Validates Mistral AI Studio, La Plateforme, and Azure AI Foundry SFT datasets in four shapes: default Instruct (messages), pretrain (text), function-calling Instruct with the 9-character id/tool_call_id rule, and Pixtral image content. Catches UnrecognizedRoleError, InvalidAssistantMessageException, ToolCallFormatError, and the minimum-conversation/byte checks before La Plateforme rejects your file.

Your training data never leaves this tab. Mistral pulls your file when a job starts; this pre-flight check is fully local.

⌨ Prefer the terminal? jsonlkit validate --mistral data.jsonl — same checks, in a pipe.

Validate

Drop a fine-tune .jsonl file here, or

Mistral Fine-Tune JSONL Validator

Mistral's official validator is utils/validate_data.py in the mistralai/mistral-finetune repo — a Python script that requires the full mistral-finetune install, the v3 tokenizer files, and a YAML config. This page implements the same structural rules in your browser: the four supported formats (Instruct, pretrain, function-calling, Pixtral image), the role and shape checks, the 9-character tool-call ID rule, and the minimum-size guards that fail upload.

Validating a different provider? OpenAI, Anthropic (Claude), Google Gemini, Llama / ShareGPT / Alpaca.

Where can I fine-tune Mistral in 2026?

Four paths, same JSONL format under the hood:

The four Mistral fine-tune formats

1. Default Instruct (messages)

{"messages":[
  {"role":"system","content":"You are a terse code reviewer."},
  {"role":"user","content":"Review: print('hi')"},
  {"role":"assistant","content":"Looks fine.","weight":1}
]}

Roles: system, user, assistant, tool. No developer role. weight: 0 on an assistant turn skips it from the training loss (useful when you want context without teaching imitation); weight: 1 is the default.

2. Pretrain

{"text": "Raw corpus document here..."}

Used for continued pretraining rather than supervised fine-tuning. One document per line.

3. Function-calling Instruct

Mistral enforces an exact-9-character ID rule for id and tool_call_id: must match the regex [A-Za-z0-9]{9}. Top-level tools is required. Arguments must be a JSON string, not an object.

{"messages":[
  {"role":"system","content":"Use tools when helpful."},
  {"role":"user","content":"Weather in Paris?"},
  {"role":"assistant","tool_calls":[{
    "id":"aB3kL9mNz",
    "type":"function",
    "function":{"name":"get_weather","arguments":"{\"city\":\"Paris\"}"}
  }]},
  {"role":"tool","tool_call_id":"aB3kL9mNz","name":"get_weather","content":"{\"temp_c\":17}"},
  {"role":"assistant","content":"It's 17 C in Paris."}
],
"tools":[{"type":"function","function":{
  "name":"get_weather",
  "description":"Get current weather",
  "parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}
}}]}

4. Pixtral / vision content

For Pixtral fine-tuning, images live inside a typed content array. Base64 data URIs must be prefixed with data:image/jpeg;base64, (or the corresponding MIME).

{"messages":[
  {"role":"user","content":[
    {"type":"text","text":"What's in this image?"},
    {"type":"image_url","image_url":"data:image/jpeg;base64,/9j/4AAQ..."}
  ]},
  {"role":"assistant","content":"A red barn in a wheat field."}
]}

Mistral fine-tune limits (from validate_data.py)

ItemValue
Minimum conversations10
Minimum file size1,000 bytes
Maximum lines10,000,000
Maximum file size~10 GB
Tool-call ID formatexactly 9 characters, [A-Za-z0-9]{9}
Last role in training datamust be assistant (loss is computed on assistant tokens only)
Tokenizer compatibilityv3 (vocab 32,768) for mistral-finetune; Tekken accepted on Mistral AI Studio for Nemo / Pixtral / Ministral / Small 3 / Magistral / Mistral 3

Note the contradiction with inference: training data must end on assistant, but at inference time the last role must be user or tool. The error Expected last role User or Tool ... but got assistant only fires at serve time, not training.

What this validator checks

Common mistakes this validator catches

{"messages":[{"role":"user","content":"Hi"}]}
// Error: missing assistant turn — model has no target to learn.

{"messages":[{"role":"developer","content":"..."}]}
// Error: Mistral does not support 'developer' role (use 'system').

{"messages":[
  {"role":"assistant","tool_calls":[{"function":{"name":"f"}}]}
]}
// Error: tool_calls missing 'id', 'type', and 9-char ID format.

{"messages":[
  {"role":"assistant","content":"hello","tool_calls":[...]}
]}
// Error: InvalidAssistantMessageException — content AND tool_calls together.

{"messages":[
  {"role":"user","content":"Hi"},
  {"role":"assistant","tool_calls":[{"id":"call_abc","type":"function",...}]}
]}
// Error: tool_call id 'call_abc' is 8 chars — must be exactly 9.

Real error strings from La Plateforme and validate_data.py

ErrorSourceFix
Invalid file format La Plateforme upload File is not strict JSONL — one record per line, no array, no pretty-print, no blank lines.
has only N conversation which is less than the minimum amount validate_data.py Need ≥ 10 conversations. Add more rows.
has only N bytes which is less than the minimum validate_data.py Need ≥ 1,000 bytes. Add more content.
has N conversations which is more than the maximum validate_data.py Cap is 10,000,000 lines. Split into multiple files.
UnrecognizedRoleError validate_data.py Roles must be system | user | assistant | tool. Fix human, bot, developer, typos.
InvalidAssistantMessageException: Assistant message must have either content or tool_calls, but not both mistral_common Pick one. Never set content:"" alongside tool_calls.
ToolCallFormatError / FunctionFormatError validate_data.py id and tool_call_id must be exactly 9 chars; arguments must be a JSON string (not an object).
could not be tokenized validate_data.py Non-UTF-8 character, or v3-tokenizer-incompatible model selected.
Expected last role User or Tool (or Assistant with prefix True) for serving but got assistant Runtime / inference This is the inference rule, not training. Training data ends on assistant intentionally.

How to use it

  1. Drop a .jsonl file or paste records directly.
  2. Click Validate. Each line is parsed and checked against the same rules validate_data.py enforces.
  3. Inspect the Error List — every issue maps to either a La Plateforme upload error or a validate_data.py exception name.
  4. Download valid examples only rebuilds a clean file with broken lines stripped.
  5. Upload to Mistral AI Studio or run mistral-finetune locally.

Tips & common pitfalls

Troubleshooting

I'm getting Invalid file format from La Plateforme.

Three common causes: (1) the file is a JSON array [ {…}, {…} ] instead of JSONL; (2) records are pretty-printed and span multiple lines; (3) there's a blank line somewhere in the file. Minify each record to one line and remove blank lines.

UnrecognizedRoleError — what does Mistral accept?

Exactly four roles: system, user, assistant, tool. No developer, no human, no bot, no function.

Why does validate_data.py say my tool-call ID is wrong?

Mistral requires id and tool_call_id to be exactly 9 characters matching [A-Za-z0-9]{9}. OpenAI-style call_abc123 IDs (which vary in length) will fail.

How do I run validate_data.py myself?

git clone https://github.com/mistralai/mistral-finetune; cd mistral-finetune; pip install -r requirements.txt; python -m utils.validate_data --train_yaml example.yaml. This page is the no-install equivalent.

Can I fine-tune Codestral / Ministral / Pixtral / Nemo?

Yes, all on Mistral AI Studio. Azure AI Foundry covers Large 2411, Nemo, and Ministral 3B. Codestral, Pixtral, and Magistral are AI Studio only.

Is my data uploaded?

Never. Everything runs in your browser. See the privacy policy.

Frequently asked questions

Which Mistral models support fine-tuning in 2026?

Mistral AI Studio: Mistral Small (25.x), Mistral Large 2411, Mistral Nemo, Ministral 3B / 8B, Codestral (25.08), Pixtral (12B / Large), Magistral. Azure AI Foundry: Mistral Large 2411, Mistral Nemo, Ministral 3B. Self-hosted (mistral-finetune): any v3-tokenizer weight.

Is Mistral fine-tune JSONL the same as OpenAI?

Mostly. The messages shape is identical, roles are the same minus developer. The differences: tool-call IDs must be exactly 9 chars; no weight field on OpenAI side (Mistral has it for assistant-turn loss masking); content arrays for vision use image_url as a raw URL string with the data: prefix, not the OpenAI {"url": "..."} object.

How do I run validate_data.py?

Clone mistralai/mistral-finetune, install requirements, run python -m utils.validate_data --train_yaml your.yaml. Or use this page to get the same checks without the install.

Is there an online Mistral JSONL validator?

This is it. No login, no upload, no install.

La Plateforme vs Mistral AI Studio vs Azure AI Foundry vs mistral-finetune — which should I use?

Mistral AI Studio is the active managed path for almost every Mistral model. Azure AI Foundry is the right answer if you're already on Azure and need a subset of models. mistral-finetune (the open-source repo) is for self-hosted LoRA on Mistral 7B / Mixtral / Nemo / Large v2 / Codestral 22B with v3 tokenizer.

What's the minimum number of examples?

10 conversations and 1,000 bytes per validate_data.py. Practical recommendation: ≥ 100 examples to see signal, thousands for production.

Do id and tool_call_id really have to be 9 characters?

Yes — Mistral enforces this in validate_data.py with the regex [A-Za-z0-9]{9}. OpenAI-style IDs (call_abc123) will fail.

How do I fine-tune Pixtral on images?

Use Mistral AI Studio (not mistral-finetune — Pixtral needs Tekken tokenizer). Each user turn's content is an array of {"type": "text", "text": ...} and {"type": "image_url", "image_url": "data:image/jpeg;base64,..."} parts. Assistant turns stay plain strings.

Related tools