jsonlkit CLI

updated 17 May 2026 · github · MIT · Node 18+

Everything jsonlkit.com does, on the command line. Validate, fix, convert, dedupe, and transform JSONL with one tool. Streams line-by-line — handles files larger than RAM. No telemetry, no network calls.

Same logic, on big data. The browser tools are great for one-off work up to ~1 GB. The CLI is what you reach for when your training file is 50 GB or you want to wire JSONL processing into a shell pipeline.

Install

# from GitHub (current — until npm publish)
npm install -g github:tinytoolkit-org/jsonlkit-cli

# one-off, no install
npx github:tinytoolkit-org/jsonlkit-cli --help

# from source
git clone https://github.com/tinytoolkit-org/jsonlkit-cli
cd jsonlkit-cli && npm test && npm link

Requires Node 18+. Single binary, zero runtime dependencies.

Recipes — copy and paste

Validate an OpenAI fine-tune file

jsonlkit validate --openai training.jsonl

Catches every error the OpenAI cookbook checks: bad roles, missing assistant turns, wrong tool_calls shape, extra keys, examples over 16,385 tokens. Exit code 1 if anything's off — wires into CI.

Same flag-per-provider as the web tool:

jsonlkit validate --anthropic data.jsonl   # Claude on Bedrock
jsonlkit validate --gemini    data.jsonl   # Vertex AI
jsonlkit validate --llama     data.jsonl   # ChatML
jsonlkit validate --sharegpt  data.jsonl   # conversations + from/value
jsonlkit validate --alpaca    data.jsonl   # instruction/input/output
jsonlkit validate --mistral   data.jsonl   # with the 9-char tool_call id rule

The web versions for reference: OpenAI · Anthropic · Gemini · Llama · Mistral.

Spreadsheet → fine-tune JSONL

jsonlkit from csv dataset.csv > training.jsonl

Auto-types numbers and booleans. Use --nest dot to expand user.name-style columns into nested objects. Streaming RFC 4180 parser — works on huge CSVs. Web version: /csv-to-jsonl.

Repair a broken JSONL

jsonlkit fix broken.jsonl > clean.jsonl

Fixes trailing commas, single quotes, smart quotes, BOMs, NaN, // comments, unquoted keys, and the SSE data: prefix. Drops lines it can't repair and reports them on stderr. Web version: /fix-jsonl.

Peek at the last records of a multi-GB log

jsonlkit tail -n 10 server.log.jsonl

Seeks from the end of the file — runs in ~50 ms on a 50 GB log instead of parsing the whole thing. Same -n flag for head and sample. Web version: /viewer.

Dedupe a fine-tune corpus

# by full record
jsonlkit dedupe data.jsonl > deduped.jsonl

# by a single key (supports dot.notation for nested)
jsonlkit dedupe --key user.email data.jsonl

Stores 40-byte sha1 digests in memory — handles tens of millions of unique records. Web version: /deduplicator.

Split a text corpus for RAG

# one record per paragraph (default)
jsonlkit from txt article.txt > chunks.jsonl

# fixed 500-char chunks for embeddings
jsonlkit from txt corpus.txt --split chars --size 500

# one line, one record, with a numeric id field
jsonlkit from txt prompts.txt --split line --add-id

Streams the input — works on 10+ GB log files. Web version: /txt-to-jsonl.

Stats on a huge file

jsonlkit stats huge.jsonl

records:      1,200,000
  valid:      1,199,987
  invalid:    13
bytes:        4.2 GB
unique keys:  12
max depth:    4

top keys (coverage · types):
  id           100.0%   number×1199987
  messages     100.0%   array×1199987
  metadata     87.4%    object×1048217

Streams the file — ~50 MB of RAM regardless of file size. Add --json for machine-readable output, --progress for live records/sec on stderr. Web version: /stats.

JSONL → CSV for spreadsheet review

jsonlkit to csv data.jsonl > review.csv

Flattens nested keys with dots (user.name etc.). Discovers the header from the first record. Web version: /jsonl-to-csv.

The headline pipeline

cat raw.csv \
  | jsonlkit from csv \
  | jsonlkit dedupe --key prompt \
  | jsonlkit validate --openai --quiet \
  > clean.jsonl

CSV in → validated, deduped fine-tune JSONL out. Streams end-to-end.

All commands

Command	What it does	Web equivalent
`validate`	Schema-check JSONL line by line. 7 fine-tune formats + JSON Schema	openai-fine-tune-validator et al.
`fix`	Auto-repair common breakages	fix-jsonl
`format` / `minify`	Pretty-print or minify each record	formatter / jsonl-minifier
`count`	Fast record count	—
`stats`	Records, bytes, key coverage, types, depth	stats
`head` / `tail` / `sample`	Slice. `tail` seeks from end	viewer / jsonl-sampler
`dedupe`	By full record or by `--key`	deduplicator
`from csv\\|json\\|txt`	→ JSONL on stdout	csv / json / txt
`to csv\\|json`	← JSONL on stdout	to csv / to json

Run jsonlkit <command> --help for full flag listings.

Streaming & memory

Everything streams line-by-line and writes with backpressure awareness. A 10 GB file uses ~50 MB of RAM.

Command	Streams?	Note
`validate`, `count`, `stats`, `format`, `fix`, `head`, `sample`, `from csv`, `from txt`, `to csv`, `to json`	✅	O(1) records buffered
`dedupe --keep first`	✅	40-byte sha1 per unique record
`tail`	✅	Seeks from end — constant time on real files
`dedupe --keep last`	⚠️	Buffers every unique record
`from json`	❌	JSON arrays need the whole file — use `jq -c '.[]' input.json` for huge arrays
`from txt --split whole\|regex`	❌	Needs the whole input

Add --progress to validate / count / stats / dedupe for live records/sec on stderr.

Tested under load

The repo ships JSONLKIT_HUGE=1 node --test test/huge.test.js which generates 200 MB fixtures and runs each streaming command under a 192 MB Node heap. M-series Mac numbers:

validate        988 ms
stats           537 ms
dedupe          3.5 s
from csv        5.9 s
tail  -n 5      46 ms     (seek from end)
head  -n 5      131 ms    (early exit)

FAQ

Browser tool or CLI — which should I use?

Browser for one-off work up to ~1 GB and when you want a visual table. CLI when you need to wire JSONL into a pipeline, when files are bigger than RAM, or when you want exit codes for CI. The validation logic is identical — same checks, same error names.

Is my data ever sent anywhere?

No. The CLI has no network code. Same privacy promise as the website. MIT licensed, source is on GitHub.

Does it work on Windows?

Yes — Node 18+, plain JS, no native deps. PowerShell and Git Bash both work.

How big a file can it actually handle?

For the streaming commands, the limit is disk read speed, not memory. We've tested up to 200 MB in CI and it works on multi-GB files in practice. The non-streaming commands (from json, from txt --split whole, dedupe --keep last) need the whole input in memory.

What about JSONL queries (jq, SQL)?

Planned for v0.2 — jsonlkit jq '<expr>', jsonlkit sql '<query>' via DuckDB-WASM, plus filter / sort / flatten / schema infer. The web has jq and SQL today.

Source & license

github.com/tinytoolkit-org/jsonlkit-cli — MIT. Bug reports and PRs welcome.