Split JSONL into Train, Validation, Test
Train, validation and test splitter. Split a JSONL (JSON Lines) dataset into train, val and test sets. Random with seed, optional stratify by key, configurable ratios. Three downloads in one click. Browser-only.
100% client-side. No upload.
Split
Train / Val / Test Splitter
The reproducible split every ML pipeline needs. Drop in a JSONL file, set the ratios (80/10/10 by default), choose a seed so your results are repeatable, and optionally stratify by a label key to keep the class distribution identical across all three splits. Three named files come out: train.jsonl, val.jsonl, test.jsonl. 100% in-browser.
Random split
Shuffles the input with a seeded PRNG (so the same input + same seed always gives the same three files), then takes the first train% for train, the next val% for validation, and the rest for test. The seed defaults to 42 — change it if you want to try a different shuffle.
Stratified split
Set a key (typically the label or class field — label, category,
intent) and the splitter keeps each class's proportion the same in all three
files. Critical when classes are imbalanced: a pure random split can put 0 examples of a
rare class into val/test and silently destroy your evaluation.
Why a separate test set?
Standard ML hygiene: use train for fitting, val for hyperparameter tuning and early-stopping, test for the final unbiased evaluation. If you tune on test, your reported metrics will be optimistic and your model will underperform in production.
Ratios that aren't 80/10/10
- Tiny datasets (< 500 rows): 70/15/15 or even 60/20/20 to keep eval sets statistically meaningful.
- Large datasets (> 1M): 98/1/1 or 90/5/5 — when val and test are big enough in absolute terms, give the train set more.
- Fine-tune only (no eval): 90/10/0 — set test to 0 and you get only train and val files.
Tips & common pitfalls
- Sums must be 100. The splitter complains if your three percentages don't add up.
- Dedupe before splitting. A duplicate that lands in both train and test inflates your eval scores. Run the JSONL Deduplicator first.
- Group leakage. If your dataset has groups (e.g. multiple rows per user), a random split can put rows from the same user in both train and test. The stratify mode here splits within each class — for true group-aware splitting, pre-group first then split each group separately.
Before you start
You need a single JSONL file representing your full dataset. The splitter shuffles it (using your seed) and produces three files: train, val and test.
How to use it
- Drop your JSONL or paste it.
- Set the ratios — default 80/10/10 (train/val/test). Any three numbers that sum to 100 work.
- Pin a Seed for reproducible splits.
- Optional: enable Stratify by key to preserve a label distribution across splits.
- Click Split, then download each file.
Stratified split
For classification or labelled data, stratify by the label key (e.g. label, category). The split keeps the proportion of each label roughly equal in all three sets — important when a class is rare.
Tips & common pitfalls
- Dedupe first. Run the prompt deduper before splitting; otherwise duplicate prompts leak from train into val/test and inflate eval numbers.
- Pin the seed for every paper or experiment. Random splits between runs make results unreplicable.
- Don't split, then shuffle each part. The shuffle happens before splitting; reshuffling per-split changes the contract.
Frequently asked questions
Can I do a 90/5/5 or 70/15/15?
Yes — any three positive numbers summing to 100.
Can I skip the test set?
Set test ratio to 0. The tool will produce just train and val.
k-fold cross-validation?
Not yet — on the roadmap.