Mistral Fine-Tune JSONL Validator — Online Alternative to validate_data.py
Mistral fine-tune JSONL validator. Same checks as Mistral's official mistralai/mistral-finetune/utils/validate_data.py — no pip install mistral-finetune, no YAML, no upload. Validates Mistral AI Studio, La Plateforme, and Azure AI Foundry SFT datasets in four shapes: default Instruct (messages), pretrain (text), function-calling Instruct with the 9-character id/tool_call_id rule, and Pixtral image content. Catches UnrecognizedRoleError, InvalidAssistantMessageException, ToolCallFormatError, and the minimum-conversation/byte checks before La Plateforme rejects your file.
Your training data never leaves this tab. Mistral pulls your file when a job starts; this pre-flight check is fully local.
Validate
Mistral Fine-Tune JSONL Validator
Mistral's official validator is utils/validate_data.py in the mistralai/mistral-finetune repo — a Python script that requires the full mistral-finetune install, the v3 tokenizer files, and a YAML config. This page implements the same structural rules in your browser: the four supported formats (Instruct, pretrain, function-calling, Pixtral image), the role and shape checks, the 9-character tool-call ID rule, and the minimum-size guards that fail upload.
Validating a different provider? OpenAI, Anthropic (Claude), Google Gemini, Llama / ShareGPT / Alpaca.
Where can I fine-tune Mistral in 2026?
Four paths, same JSONL format under the hood:
- Mistral AI Studio — Mistral's hosted fine-tuning, successor to the La Plateforme fine-tune endpoint (now flagged deprecated in the public docs). Supports Mistral Small (25.x), Mistral Large 2411, Mistral Nemo, Ministral 3B / 8B, Codestral (25.08), Pixtral (12B / Large), Magistral. $4 minimum job fee, $2/month per stored model.
- Azure AI Foundry — Microsoft's managed Mistral fine-tuning. Supports Mistral Large 2411, Mistral Nemo, Ministral 3B.
- mistral-finetune (self-hosted LoRA) — the open-source repo at
github.com/mistralai/mistral-finetune. Works with any v3-tokenizer-compatible weight: Mistral 7B v0.3, Mixtral 8x7B (after vocab extension), Mistral Nemo, Mistral Large v2, Codestral 22B. - Third-party (community weights) — Together, Fireworks, RunPod, Unsloth, Axolotl all accept the same Mistral
messagesshape. For ShareGPT-style datasets see the Llama validator.
The four Mistral fine-tune formats
1. Default Instruct (messages)
{"messages":[
{"role":"system","content":"You are a terse code reviewer."},
{"role":"user","content":"Review: print('hi')"},
{"role":"assistant","content":"Looks fine.","weight":1}
]}
Roles: system, user, assistant, tool. No developer role. weight: 0 on an assistant turn skips it from the training loss (useful when you want context without teaching imitation); weight: 1 is the default.
2. Pretrain
{"text": "Raw corpus document here..."}
Used for continued pretraining rather than supervised fine-tuning. One document per line.
3. Function-calling Instruct
Mistral enforces an exact-9-character ID rule for id and tool_call_id: must match the regex [A-Za-z0-9]{9}. Top-level tools is required. Arguments must be a JSON string, not an object.
{"messages":[
{"role":"system","content":"Use tools when helpful."},
{"role":"user","content":"Weather in Paris?"},
{"role":"assistant","tool_calls":[{
"id":"aB3kL9mNz",
"type":"function",
"function":{"name":"get_weather","arguments":"{\"city\":\"Paris\"}"}
}]},
{"role":"tool","tool_call_id":"aB3kL9mNz","name":"get_weather","content":"{\"temp_c\":17}"},
{"role":"assistant","content":"It's 17 C in Paris."}
],
"tools":[{"type":"function","function":{
"name":"get_weather",
"description":"Get current weather",
"parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}
}}]}
4. Pixtral / vision content
For Pixtral fine-tuning, images live inside a typed content array. Base64 data URIs must be prefixed with data:image/jpeg;base64, (or the corresponding MIME).
{"messages":[
{"role":"user","content":[
{"type":"text","text":"What's in this image?"},
{"type":"image_url","image_url":"data:image/jpeg;base64,/9j/4AAQ..."}
]},
{"role":"assistant","content":"A red barn in a wheat field."}
]}
Mistral fine-tune limits (from validate_data.py)
| Item | Value |
|---|---|
| Minimum conversations | 10 |
| Minimum file size | 1,000 bytes |
| Maximum lines | 10,000,000 |
| Maximum file size | ~10 GB |
| Tool-call ID format | exactly 9 characters, [A-Za-z0-9]{9} |
| Last role in training data | must be assistant (loss is computed on assistant tokens only) |
| Tokenizer compatibility | v3 (vocab 32,768) for mistral-finetune; Tekken accepted on Mistral AI Studio for Nemo / Pixtral / Ministral / Small 3 / Magistral / Mistral 3 |
Note the contradiction with inference: training data must end on assistant, but at inference time the last role must be user or tool. The error Expected last role User or Tool ... but got assistant only fires at serve time, not training.
What this validator checks
- Each line is valid JSON, one record per line.
- Top-level
messagesis present and is a non-empty array (ortextfor pretrain mode). - Every role is one of
system,user,assistant,tool. Nodeveloper, nohuman/bot. - At least one
assistantturn; last message must be assistant in training data. - Non-empty
content(or validtool_callsif content is null/missing). tool_callsshape: 9-characterid,type: "function",function.name,function.argumentsas a JSON string.toolrole messages have a matching 9-charactertool_call_id.- Assistant turn cannot have both
contentandtool_callspopulated (theInvalidAssistantMessageExceptionrule). - File size and conversation count flagged against the 1,000-byte / 10-conversation minima.
Common mistakes this validator catches
{"messages":[{"role":"user","content":"Hi"}]}
// Error: missing assistant turn — model has no target to learn.
{"messages":[{"role":"developer","content":"..."}]}
// Error: Mistral does not support 'developer' role (use 'system').
{"messages":[
{"role":"assistant","tool_calls":[{"function":{"name":"f"}}]}
]}
// Error: tool_calls missing 'id', 'type', and 9-char ID format.
{"messages":[
{"role":"assistant","content":"hello","tool_calls":[...]}
]}
// Error: InvalidAssistantMessageException — content AND tool_calls together.
{"messages":[
{"role":"user","content":"Hi"},
{"role":"assistant","tool_calls":[{"id":"call_abc","type":"function",...}]}
]}
// Error: tool_call id 'call_abc' is 8 chars — must be exactly 9.
Real error strings from La Plateforme and validate_data.py
| Error | Source | Fix |
|---|---|---|
Invalid file format |
La Plateforme upload | File is not strict JSONL — one record per line, no array, no pretty-print, no blank lines. |
has only N conversation which is less than the minimum amount |
validate_data.py |
Need ≥ 10 conversations. Add more rows. |
has only N bytes which is less than the minimum |
validate_data.py |
Need ≥ 1,000 bytes. Add more content. |
has N conversations which is more than the maximum |
validate_data.py |
Cap is 10,000,000 lines. Split into multiple files. |
UnrecognizedRoleError |
validate_data.py |
Roles must be system | user | assistant | tool. Fix human, bot, developer, typos. |
InvalidAssistantMessageException: Assistant message must have either content or tool_calls, but not both |
mistral_common |
Pick one. Never set content:"" alongside tool_calls. |
ToolCallFormatError / FunctionFormatError |
validate_data.py |
id and tool_call_id must be exactly 9 chars; arguments must be a JSON string (not an object). |
could not be tokenized |
validate_data.py |
Non-UTF-8 character, or v3-tokenizer-incompatible model selected. |
Expected last role User or Tool (or Assistant with prefix True) for serving but got assistant |
Runtime / inference | This is the inference rule, not training. Training data ends on assistant intentionally. |
How to use it
- Drop a
.jsonlfile or paste records directly. - Click Validate. Each line is parsed and checked against the same rules
validate_data.pyenforces. - Inspect the Error List — every issue maps to either a La Plateforme upload error or a
validate_data.pyexception name. - Download valid examples only rebuilds a clean file with broken lines stripped.
- Upload to Mistral AI Studio or run
mistral-finetunelocally.
Tips & common pitfalls
- The 9-character tool-call ID is non-negotiable. OpenAI's
call_abc123-style IDs will fail Mistral validation. Use a generator that emits exactly 9 alphanumeric characters. - Arguments are strings, not objects.
"arguments": "{\"city\":\"Paris\"}"✓ —"arguments": {"city": "Paris"}✗. - Content XOR tool_calls on assistant turns. Never both, never neither.
- The deprecated docs banner is real. Mistral's
docs.mistral.ai/guides/finetuning/page has a deprecation banner; fine-tuning now lives on Mistral AI Studio and Azure AI Foundry. - Pixtral vs v3 tokenizer. Self-hosted
mistral-finetuneonly supports v3-tokenizer weights. Pixtral uses the Tekken tokenizer; train it on Mistral AI Studio, not the self-hosted repo. - Base64 needs the data URI prefix. Pixtral expects
data:image/jpeg;base64,..., not raw base64.
Troubleshooting
I'm getting Invalid file format from La Plateforme.
Three common causes: (1) the file is a JSON array [ {…}, {…} ] instead of JSONL; (2) records are pretty-printed and span multiple lines; (3) there's a blank line somewhere in the file. Minify each record to one line and remove blank lines.
UnrecognizedRoleError — what does Mistral accept?
Exactly four roles: system, user, assistant, tool. No developer, no human, no bot, no function.
Why does validate_data.py say my tool-call ID is wrong?
Mistral requires id and tool_call_id to be exactly 9 characters matching [A-Za-z0-9]{9}. OpenAI-style call_abc123 IDs (which vary in length) will fail.
How do I run validate_data.py myself?
git clone https://github.com/mistralai/mistral-finetune; cd mistral-finetune; pip install -r requirements.txt; python -m utils.validate_data --train_yaml example.yaml. This page is the no-install equivalent.
Can I fine-tune Codestral / Ministral / Pixtral / Nemo?
Yes, all on Mistral AI Studio. Azure AI Foundry covers Large 2411, Nemo, and Ministral 3B. Codestral, Pixtral, and Magistral are AI Studio only.
Is my data uploaded?
Never. Everything runs in your browser. See the privacy policy.
Frequently asked questions
Which Mistral models support fine-tuning in 2026?
Mistral AI Studio: Mistral Small (25.x), Mistral Large 2411, Mistral Nemo, Ministral 3B / 8B, Codestral (25.08), Pixtral (12B / Large), Magistral. Azure AI Foundry: Mistral Large 2411, Mistral Nemo, Ministral 3B. Self-hosted (mistral-finetune): any v3-tokenizer weight.
Is Mistral fine-tune JSONL the same as OpenAI?
Mostly. The messages shape is identical, roles are the same minus developer. The differences: tool-call IDs must be exactly 9 chars; no weight field on OpenAI side (Mistral has it for assistant-turn loss masking); content arrays for vision use image_url as a raw URL string with the data: prefix, not the OpenAI {"url": "..."} object.
How do I run validate_data.py?
Clone mistralai/mistral-finetune, install requirements, run python -m utils.validate_data --train_yaml your.yaml. Or use this page to get the same checks without the install.
Is there an online Mistral JSONL validator?
This is it. No login, no upload, no install.
La Plateforme vs Mistral AI Studio vs Azure AI Foundry vs mistral-finetune — which should I use?
Mistral AI Studio is the active managed path for almost every Mistral model. Azure AI Foundry is the right answer if you're already on Azure and need a subset of models. mistral-finetune (the open-source repo) is for self-hosted LoRA on Mistral 7B / Mixtral / Nemo / Large v2 / Codestral 22B with v3 tokenizer.
What's the minimum number of examples?
10 conversations and 1,000 bytes per validate_data.py. Practical recommendation: ≥ 100 examples to see signal, thousands for production.
Do id and tool_call_id really have to be 9 characters?
Yes — Mistral enforces this in validate_data.py with the regex [A-Za-z0-9]{9}. OpenAI-style IDs (call_abc123) will fail.
How do I fine-tune Pixtral on images?
Use Mistral AI Studio (not mistral-finetune — Pixtral needs Tekken tokenizer). Each user turn's content is an array of {"type": "text", "text": ...} and {"type": "image_url", "image_url": "data:image/jpeg;base64,..."} parts. Assistant turns stay plain strings.