jsonlkit.com
JSONL (JSON Lines) utilities, in the browser
Say hi →

jsonlkit CLI

updated 17 May 2026 · github · MIT · Node 18+

Everything jsonlkit.com does, on the command line. Validate, fix, convert, dedupe, and transform JSONL with one tool. Streams line-by-line — handles files larger than RAM. No telemetry, no network calls.

Same logic, on big data. The browser tools are great for one-off work up to ~1 GB. The CLI is what you reach for when your training file is 50 GB or you want to wire JSONL processing into a shell pipeline.

Install

# from GitHub (current — until npm publish)
npm install -g github:tinytoolkit-org/jsonlkit-cli

# one-off, no install
npx github:tinytoolkit-org/jsonlkit-cli --help

# from source
git clone https://github.com/tinytoolkit-org/jsonlkit-cli
cd jsonlkit-cli && npm test && npm link

Requires Node 18+. Single binary, zero runtime dependencies.

Recipes — copy and paste

Validate an OpenAI fine-tune file

jsonlkit validate --openai training.jsonl

Catches every error the OpenAI cookbook checks: bad roles, missing assistant turns, wrong tool_calls shape, extra keys, examples over 16,385 tokens. Exit code 1 if anything's off — wires into CI.

Same flag-per-provider as the web tool:

jsonlkit validate --anthropic data.jsonl   # Claude on Bedrock
jsonlkit validate --gemini    data.jsonl   # Vertex AI
jsonlkit validate --llama     data.jsonl   # ChatML
jsonlkit validate --sharegpt  data.jsonl   # conversations + from/value
jsonlkit validate --alpaca    data.jsonl   # instruction/input/output
jsonlkit validate --mistral   data.jsonl   # with the 9-char tool_call id rule

The web versions for reference: OpenAI · Anthropic · Gemini · Llama · Mistral.

Spreadsheet → fine-tune JSONL

jsonlkit from csv dataset.csv > training.jsonl

Auto-types numbers and booleans. Use --nest dot to expand user.name-style columns into nested objects. Streaming RFC 4180 parser — works on huge CSVs. Web version: /csv-to-jsonl.

Repair a broken JSONL

jsonlkit fix broken.jsonl > clean.jsonl

Fixes trailing commas, single quotes, smart quotes, BOMs, NaN, // comments, unquoted keys, and the SSE data: prefix. Drops lines it can't repair and reports them on stderr. Web version: /fix-jsonl.

Peek at the last records of a multi-GB log

jsonlkit tail -n 10 server.log.jsonl

Seeks from the end of the file — runs in ~50 ms on a 50 GB log instead of parsing the whole thing. Same -n flag for head and sample. Web version: /viewer.

Dedupe a fine-tune corpus

# by full record
jsonlkit dedupe data.jsonl > deduped.jsonl

# by a single key (supports dot.notation for nested)
jsonlkit dedupe --key user.email data.jsonl

Stores 40-byte sha1 digests in memory — handles tens of millions of unique records. Web version: /deduplicator.

Split a text corpus for RAG

# one record per paragraph (default)
jsonlkit from txt article.txt > chunks.jsonl

# fixed 500-char chunks for embeddings
jsonlkit from txt corpus.txt --split chars --size 500

# one line, one record, with a numeric id field
jsonlkit from txt prompts.txt --split line --add-id

Streams the input — works on 10+ GB log files. Web version: /txt-to-jsonl.

Stats on a huge file

jsonlkit stats huge.jsonl
records:      1,200,000
  valid:      1,199,987
  invalid:    13
bytes:        4.2 GB
unique keys:  12
max depth:    4

top keys (coverage · types):
  id           100.0%   number×1199987
  messages     100.0%   array×1199987
  metadata     87.4%    object×1048217

Streams the file — ~50 MB of RAM regardless of file size. Add --json for machine-readable output, --progress for live records/sec on stderr. Web version: /stats.

JSONL → CSV for spreadsheet review

jsonlkit to csv data.jsonl > review.csv

Flattens nested keys with dots (user.name etc.). Discovers the header from the first record. Web version: /jsonl-to-csv.

The headline pipeline

cat raw.csv \
  | jsonlkit from csv \
  | jsonlkit dedupe --key prompt \
  | jsonlkit validate --openai --quiet \
  > clean.jsonl

CSV in → validated, deduped fine-tune JSONL out. Streams end-to-end.

All commands

CommandWhat it doesWeb equivalent
validateSchema-check JSONL line by line. 7 fine-tune formats + JSON Schemaopenai-fine-tune-validator et al.
fixAuto-repair common breakagesfix-jsonl
format / minifyPretty-print or minify each recordformatter / jsonl-minifier
countFast record count
statsRecords, bytes, key coverage, types, depthstats
head / tail / sampleSlice. tail seeks from endviewer / jsonl-sampler
dedupeBy full record or by --keydeduplicator
from csv\|json\|txt→ JSONL on stdoutcsv / json / txt
to csv\|json← JSONL on stdoutto csv / to json

Run jsonlkit <command> --help for full flag listings.

Streaming & memory

Everything streams line-by-line and writes with backpressure awareness. A 10 GB file uses ~50 MB of RAM.

CommandStreams?Note
validate, count, stats, format, fix, head, sample, from csv, from txt, to csv, to jsonO(1) records buffered
dedupe --keep first40-byte sha1 per unique record
tailSeeks from end — constant time on real files
dedupe --keep last⚠️Buffers every unique record
from jsonJSON arrays need the whole file — use jq -c '.[]' input.json for huge arrays
from txt --split whole|regexNeeds the whole input

Add --progress to validate / count / stats / dedupe for live records/sec on stderr.

Tested under load

The repo ships JSONLKIT_HUGE=1 node --test test/huge.test.js which generates 200 MB fixtures and runs each streaming command under a 192 MB Node heap. M-series Mac numbers:

validate        988 ms
stats           537 ms
dedupe          3.5 s
from csv        5.9 s
tail  -n 5      46 ms     (seek from end)
head  -n 5      131 ms    (early exit)

FAQ

Browser tool or CLI — which should I use?

Browser for one-off work up to ~1 GB and when you want a visual table. CLI when you need to wire JSONL into a pipeline, when files are bigger than RAM, or when you want exit codes for CI. The validation logic is identical — same checks, same error names.

Is my data ever sent anywhere?

No. The CLI has no network code. Same privacy promise as the website. MIT licensed, source is on GitHub.

Does it work on Windows?

Yes — Node 18+, plain JS, no native deps. PowerShell and Git Bash both work.

How big a file can it actually handle?

For the streaming commands, the limit is disk read speed, not memory. We've tested up to 200 MB in CI and it works on multi-GB files in practice. The non-streaming commands (from json, from txt --split whole, dedupe --keep last) need the whole input in memory.

What about JSONL queries (jq, SQL)?

Planned for v0.2 — jsonlkit jq '<expr>', jsonlkit sql '<query>' via DuckDB-WASM, plus filter / sort / flatten / schema infer. The web has jq and SQL today.

Source & license

github.com/tinytoolkit-org/jsonlkit-cli — MIT. Bug reports and PRs welcome.