Train / Val / Test Splitter
100% client-side. No upload.
Split
Train / Val / Test Splitter
The reproducible split every ML pipeline needs. Drop in a JSONL file, set the ratios (80/10/10 by default), choose a seed so your results are repeatable, and optionally stratify by a label key to keep the class distribution identical across all three splits. Three named files come out: train.jsonl, val.jsonl, test.jsonl. 100% in-browser.
Random split
Shuffles the input with a seeded PRNG (so the same input + same seed always gives the same three files), then takes the first train% for train, the next val% for validation, and the rest for test. The seed defaults to 42 — change it if you want to try a different shuffle.
Stratified split
Set a key (typically the label or class field — label, category,
intent) and the splitter keeps each class's proportion the same in all three
files. Critical when classes are imbalanced: a pure random split can put 0 examples of a
rare class into val/test and silently destroy your evaluation.
Why a separate test set?
Standard ML hygiene: use train for fitting, val for hyperparameter tuning and early-stopping, test for the final unbiased evaluation. If you tune on test, your reported metrics will be optimistic and your model will underperform in production.
Ratios that aren't 80/10/10
- Tiny datasets (< 500 rows): 70/15/15 or even 60/20/20 to keep eval sets statistically meaningful.
- Large datasets (> 1M): 98/1/1 or 90/5/5 — when val and test are big enough in absolute terms, give the train set more.
- Fine-tune only (no eval): 90/10/0 — set test to 0 and you get only train and val files.
Tips & common pitfalls
- Sums must be 100. The splitter complains if your three percentages don't add up.
- Dedupe before splitting. A duplicate that lands in both train and test inflates your eval scores. Run the JSONL Deduplicator first.
- Group leakage. If your dataset has groups (e.g. multiple rows per user), a random split can put rows from the same user in both train and test. The stratify mode here splits within each class — for true group-aware splitting, pre-group first then split each group separately.