jsonlkit CLI
Everything jsonlkit.com does, on the command line. Validate, fix, convert, dedupe, and transform JSONL with one tool. Streams line-by-line — handles files larger than RAM. No telemetry, no network calls.
Same logic, on big data. The browser tools are great for one-off work up to ~1 GB. The CLI is what you reach for when your training file is 50 GB or you want to wire JSONL processing into a shell pipeline.
Install
# from GitHub (current — until npm publish)
npm install -g github:tinytoolkit-org/jsonlkit-cli
# one-off, no install
npx github:tinytoolkit-org/jsonlkit-cli --help
# from source
git clone https://github.com/tinytoolkit-org/jsonlkit-cli
cd jsonlkit-cli && npm test && npm link
Requires Node 18+. Single binary, zero runtime dependencies.
Recipes — copy and paste
Validate an OpenAI fine-tune file
jsonlkit validate --openai training.jsonl
Catches every error the OpenAI cookbook checks: bad roles, missing assistant turns, wrong tool_calls shape, extra keys, examples over 16,385 tokens. Exit code 1 if anything's off — wires into CI.
Same flag-per-provider as the web tool:
jsonlkit validate --anthropic data.jsonl # Claude on Bedrock
jsonlkit validate --gemini data.jsonl # Vertex AI
jsonlkit validate --llama data.jsonl # ChatML
jsonlkit validate --sharegpt data.jsonl # conversations + from/value
jsonlkit validate --alpaca data.jsonl # instruction/input/output
jsonlkit validate --mistral data.jsonl # with the 9-char tool_call id rule
The web versions for reference: OpenAI · Anthropic · Gemini · Llama · Mistral.
Spreadsheet → fine-tune JSONL
jsonlkit from csv dataset.csv > training.jsonl
Auto-types numbers and booleans. Use --nest dot to expand user.name-style columns into nested objects. Streaming RFC 4180 parser — works on huge CSVs. Web version: /csv-to-jsonl.
Repair a broken JSONL
jsonlkit fix broken.jsonl > clean.jsonl
Fixes trailing commas, single quotes, smart quotes, BOMs, NaN, // comments, unquoted keys, and the SSE data: prefix. Drops lines it can't repair and reports them on stderr. Web version: /fix-jsonl.
Peek at the last records of a multi-GB log
jsonlkit tail -n 10 server.log.jsonl
Seeks from the end of the file — runs in ~50 ms on a 50 GB log instead of parsing the whole thing. Same -n flag for head and sample. Web version: /viewer.
Dedupe a fine-tune corpus
# by full record
jsonlkit dedupe data.jsonl > deduped.jsonl
# by a single key (supports dot.notation for nested)
jsonlkit dedupe --key user.email data.jsonl
Stores 40-byte sha1 digests in memory — handles tens of millions of unique records. Web version: /deduplicator.
Split a text corpus for RAG
# one record per paragraph (default)
jsonlkit from txt article.txt > chunks.jsonl
# fixed 500-char chunks for embeddings
jsonlkit from txt corpus.txt --split chars --size 500
# one line, one record, with a numeric id field
jsonlkit from txt prompts.txt --split line --add-id
Streams the input — works on 10+ GB log files. Web version: /txt-to-jsonl.
Stats on a huge file
jsonlkit stats huge.jsonl
records: 1,200,000
valid: 1,199,987
invalid: 13
bytes: 4.2 GB
unique keys: 12
max depth: 4
top keys (coverage · types):
id 100.0% number×1199987
messages 100.0% array×1199987
metadata 87.4% object×1048217
Streams the file — ~50 MB of RAM regardless of file size. Add --json for machine-readable output, --progress for live records/sec on stderr. Web version: /stats.
JSONL → CSV for spreadsheet review
jsonlkit to csv data.jsonl > review.csv
Flattens nested keys with dots (user.name etc.). Discovers the header from the first record. Web version: /jsonl-to-csv.
The headline pipeline
cat raw.csv \
| jsonlkit from csv \
| jsonlkit dedupe --key prompt \
| jsonlkit validate --openai --quiet \
> clean.jsonl
CSV in → validated, deduped fine-tune JSONL out. Streams end-to-end.
All commands
| Command | What it does | Web equivalent |
|---|---|---|
validate | Schema-check JSONL line by line. 7 fine-tune formats + JSON Schema | openai-fine-tune-validator et al. |
fix | Auto-repair common breakages | fix-jsonl |
format / minify | Pretty-print or minify each record | formatter / jsonl-minifier |
count | Fast record count | — |
stats | Records, bytes, key coverage, types, depth | stats |
head / tail / sample | Slice. tail seeks from end | viewer / jsonl-sampler |
dedupe | By full record or by --key | deduplicator |
from csv\|json\|txt | → JSONL on stdout | csv / json / txt |
to csv\|json | ← JSONL on stdout | to csv / to json |
Run jsonlkit <command> --help for full flag listings.
Streaming & memory
Everything streams line-by-line and writes with backpressure awareness. A 10 GB file uses ~50 MB of RAM.
| Command | Streams? | Note |
|---|---|---|
validate, count, stats, format, fix, head, sample, from csv, from txt, to csv, to json | ✅ | O(1) records buffered |
dedupe --keep first | ✅ | 40-byte sha1 per unique record |
tail | ✅ | Seeks from end — constant time on real files |
dedupe --keep last | ⚠️ | Buffers every unique record |
from json | ❌ | JSON arrays need the whole file — use jq -c '.[]' input.json for huge arrays |
from txt --split whole|regex | ❌ | Needs the whole input |
Add --progress to validate / count / stats / dedupe for live records/sec on stderr.
Tested under load
The repo ships JSONLKIT_HUGE=1 node --test test/huge.test.js which generates 200 MB fixtures and runs each streaming command under a 192 MB Node heap. M-series Mac numbers:
validate 988 ms
stats 537 ms
dedupe 3.5 s
from csv 5.9 s
tail -n 5 46 ms (seek from end)
head -n 5 131 ms (early exit)
FAQ
Browser tool or CLI — which should I use?
Browser for one-off work up to ~1 GB and when you want a visual table. CLI when you need to wire JSONL into a pipeline, when files are bigger than RAM, or when you want exit codes for CI. The validation logic is identical — same checks, same error names.
Is my data ever sent anywhere?
No. The CLI has no network code. Same privacy promise as the website. MIT licensed, source is on GitHub.
Does it work on Windows?
Yes — Node 18+, plain JS, no native deps. PowerShell and Git Bash both work.
How big a file can it actually handle?
For the streaming commands, the limit is disk read speed, not memory. We've tested up to 200 MB in CI and it works on multi-GB files in practice. The non-streaming commands (from json, from txt --split whole, dedupe --keep last) need the whole input in memory.
What about JSONL queries (jq, SQL)?
Planned for v0.2 — jsonlkit jq '<expr>', jsonlkit sql '<query>' via DuckDB-WASM, plus filter / sort / flatten / schema infer. The web has jq and SQL today.
Source & license
github.com/tinytoolkit-org/jsonlkit-cli — MIT. Bug reports and PRs welcome.