Shuffle, Sample, or Split a JSONL Training Set
JSONL sampler. Take a random sample of N records from a JSONL (JSON Lines) file, or grab the first or last N. Stratified sampling by key. Reservoir sampling for large files. Up to 1 GB, in your browser.
100% client-side. No upload.
Sample
JSONL Sampler
Take a representative slice of a big JSONL file without writing a one-liner: random sample (uses reservoir sampling so a 100M-line file still works), head, tail, every Nth record, or a stratified sample that keeps the same per-category ratio. Seed-controlled for reproducibility. 100% in-browser.
Sampling modes
Random N (reservoir)
Picks N records uniformly at random across the whole file using Algorithm R reservoir sampling. Single pass, constant memory — handles files much larger than the result. Each record has equal probability of being kept.
First N / Last N
Grab the first or last N records. Useful for spot-checking, smoke tests, or "show me the most recent log lines."
Stratified by key
Keeps the same per-category ratio as the input. Set the key path (e.g. category
or user.tier) and a target N; the sampler partitions records by that key and
takes a proportional random sample from each group. Categories smaller than their share
are taken in full. Useful when one category dominates the file and a pure random sample
would miss the rare classes.
Every Nth record
Deterministic systematic sampling: keep record 1, N+1, 2N+1, … Good for downsampling time-ordered logs where you want even temporal coverage rather than random spikes.
Seed
Random and stratified modes use a seeded PRNG so the same input + seed always produces the same sample. Leave the seed blank for a fresh non-deterministic sample on each run.
Tips & common pitfalls
- Reservoir vs. shuffle. Reservoir sampling is the standard for unknown stream size and is much faster than a full shuffle on big files.
- Stratify needs the key to exist on every record. Records missing the key are bucketed under
__missing__. - Output order. Random mode preserves the original order of the kept records; stratified mode interleaves by group.
Before you start
The sampler reads your JSONL and returns a smaller subset — useful when you need a representative slice for inspection, prototyping or sharing.
How to use it
- Drop a file or paste JSONL.
- Pick a mode: Random, Head (first N), Tail (last N), or Stratified (by a key).
- Set the sample size N.
- For stratified, set the stratify key (e.g.
plan,country). - Optionally pin a Seed for reproducible random samples.
- Click Sample, then Copy or Download.
Modes explained
Random
Uses reservoir sampling — single pass over the file, memory-efficient even for huge inputs. Every record has equal probability.
Head / Tail
The first or last N lines. Fast and exact.
Stratified
Keeps the distribution of a chosen key. If 80% of your rows have plan: free and 20% plan: pro, a stratified sample of 100 returns roughly 80 free + 20 pro records.
Tips & common pitfalls
- Pin the seed when sharing a sample so the recipient can reproduce it.
- Don't head-sample a sorted file — you'll get a biased slice (e.g. only the oldest signups).
- Stratified beats random when you want to be sure rare categories appear in the sample.
Frequently asked questions
Is random sampling truly uniform?
Yes — reservoir sampling gives each row an equal probability, even when the file is too large to fit in memory.
Can I stratify by multiple keys?
Not directly. Concatenate the keys upstream (plan_country) and stratify on the combined value.