JSONL Best Practices
JSONL is a simple format with a small number of recurring pitfalls. This page is the field guide for using it well — covering streaming, large files, compression choices, encoding traps, schema evolution, and the canonical library for each language.
1. Always stream, never slurp
The whole point of JSONL is that you can read one record at a time. Don't load the file into memory.
Python
# Good — streams line-by-line, constant memory
import json
with open('events.jsonl', 'r', encoding='utf-8') as f:
for line in f:
record = json.loads(line)
process(record)
# Bad — reads everything, blows up on a 50 GB file
with open('events.jsonl') as f:
records = [json.loads(l) for l in f.readlines()]
Node.js
// Good — node:readline streams
import { createReadStream } from 'fs';
import { createInterface } from 'readline';
const rl = createInterface({
input: createReadStream('events.jsonl', 'utf-8'),
crlfDelay: Infinity,
});
for await (const line of rl) {
if (!line) continue;
const record = JSON.parse(line);
process(record);
}
Go
// Good — bufio.Scanner streams; bump buffer for long lines
f, _ := os.Open("events.jsonl")
defer f.Close()
s := bufio.NewScanner(f)
s.Buffer(make([]byte, 0, 64*1024), 16*1024*1024) // up to 16 MB lines
for s.Scan() {
var rec map[string]any
if err := json.Unmarshal(s.Bytes(), &rec); err != nil {
log.Printf("line %d: %v", lineNum, err)
continue
}
process(rec)
}
Rust
use std::io::{BufRead, BufReader};
use std::fs::File;
let f = File::open("events.jsonl")?;
for line in BufReader::new(f).lines() {
let line = line?;
let rec: serde_json::Value = serde_json::from_str(&line)?;
process(rec);
}
2. Compress for storage and transport
JSONL compresses extremely well because keys repeat every record. Typical ratios:
| Codec | Ratio | Decompress speed | When to use |
|---|---|---|---|
| gzip (level 6) | 5–10× | Medium | Universal compatibility, default everywhere |
| zstd (level 3) | 5–12× | Fast | Modern stacks (DuckDB, Pandas, ClickHouse); strongly recommended |
| zstd (level 19) | 8–20× | Fast | Archival; small files, cold storage |
| brotli | 6–15× | Medium | HTTP-static delivery (Cloudflare, browsers) |
| xz / LZMA | 10–25× | Slow | Cold archives where compute is cheap |
Recommendation: Use zstd for working files, gzip for cross-tool compatibility, xz only for long-term cold archives. Most tooling reads .jsonl.gz and .jsonl.zst natively without decompressing the whole file first.
Concatenation trick: Both gzip and zstd support multi-frame concatenation:
# Concatenate compressed JSONL files without decompressing
cat 2026-05-21-*.jsonl.gz > day.jsonl.gz
# Consumers see one continuous stream. Works for zstd too.
3. Set the right MIME type for HTTP
| Header | Use case |
|---|---|
Content-Type: application/x-ndjson | HTTP requests and responses carrying JSONL |
Content-Type: application/x-ndjson; charset=utf-8 | Explicit, the most defensive choice |
Transfer-Encoding: chunked | Streaming — flush after each record |
Content-Encoding: gzip / br | Compressed-on-the-wire transport |
Never send JSONL with Content-Type: application/json — clients will assume it's a single document and fail at the second record.
4. Handle malformed lines gracefully
Real-world JSONL files often have a few bad lines, especially when produced by ad-hoc loggers or by tools that crashed mid-write. The robust pattern is:
def read_jsonl(path):
errors = []
for line_num, line in enumerate(open(path, encoding='utf-8'), start=1):
line = line.strip()
if not line: # blank line — skip silently
continue
try:
yield json.loads(line)
except json.JSONDecodeError as e:
errors.append((line_num, str(e), line[:200]))
if errors:
# decide: warn, log, or fail
log_errors(errors)
Don't blow up the whole import on a single bad row. Collect errors with line numbers, decide policy explicitly (warn / log / fail-fast). If you've inherited a file with widespread corruption, run it through the auto-fixer first — repairs trailing commas, single quotes, smart quotes, BOMs, comments, and the dozen other things that break naive parsers.
5. Encoding traps
- BOM at the start of the file. Tools that emit a UTF-8 BOM (Notepad, some Windows exports) leave the bytes
EF BB BFat the very start. Many parsers treat the BOM as part of the first character of line 1. Strip on read; never emit on write. - Mixed encodings. JSONL is UTF-8. If you receive a file claiming to be UTF-8 but containing Windows-1252 bytes, you'll see "garbage" characters like
’where'should be. Diagnose withfile --mime-encoding events.jsonland convert withiconv. - Surrogate pairs. Emoji and other supplementary-plane characters use UTF-16 surrogate pairs in
\uXXXXescapes ("😀"for 😀). Most parsers handle this correctly, but some hand-rolled ones don't. - Invisible whitespace. Non-breaking space (U+00A0), zero-width joiner (U+200D), and friends are valid inside JSON strings but invisible in editors. If a key lookup mysteriously fails, copy the literal bytes and inspect them.
6. Schema evolution
JSONL is schemaless by design, which is great for prototyping and brutal for production unless you have a strategy:
- Backward-compatible changes — always safe. Adding a new field; making a required field optional; adding allowed enum values.
- Backward-incompatible changes — never silent. Renaming a field; changing a type (string → int); removing a field; tightening enum values.
- Version with a
schema_versionfield on every record. Consumers can branch on it. Cheap insurance. - Validate at the boundary. When you receive a file, run it through the schema validator against an explicit JSON Schema. Catch drift before it propagates downstream.
- Infer first, then commit. The schema inferrer produces a draft from real data — review and tighten, then check it in. Re-run weekly to catch drift early.
7. Field naming conventions
- Stick to one case style.
snake_caseis most common in Python / data engineering;camelCasein JavaScript / TypeScript. Mixing in one file causes consumer-side bugs. - Avoid leading underscores for normal fields — many systems use
_id,_meta,_indexfor system fields (MongoDB, Elasticsearch). Leave that namespace alone. - Avoid dots in key names. Many query languages (jq, JSONPath, Mongo dotted paths) interpret
.as nesting. A key like"user.name"with a literal dot will be unaddressable. - Reserve types. Don't make a field hold strings sometimes and arrays other times — pick one, even if it means
"tags": []for tagless records.
8. Sort keys for stable diffs
JSON object keys are unordered, but many sources of value (git diffs, SHA hashes for cache busting, reproducible builds) depend on deterministic byte output. Sort keys recursively at write time when stability matters:
# Python
json.dumps(record, sort_keys=True, ensure_ascii=False)
# Node
JSON.stringify(record, Object.keys(record).sort())
# jq filter applied to existing file
jq -c 'walk(if type == "object" then to_entries | sort_by(.key) | from_entries else . end)' \
input.jsonl > sorted.jsonl
9. Handle big files with the right tool
| File size | Tool | Why |
|---|---|---|
| < 100 MB | This site (browser-based) | Loads in seconds, no install, all features available |
| 100 MB – 1 GB | This site or jq / DuckDB locally | Browser memory usually fits; jq for filters, DuckDB for SQL-style |
| 1 GB – 50 GB | jsonlkit CLI or jq | Streaming, line-by-line, never loads whole file |
| > 50 GB | DuckDB, Spark, or partition by date | Native parallel readers; consider Parquet for analytics |
10. Canonical libraries by language
| Language | Read / write | Validate | Query |
|---|---|---|---|
| Python | stdlib json + line iteration; orjson for speed | jsonschema, pydantic | jq (subprocess), jsonpath-ng, duckdb |
| JavaScript / Node | readline + JSON.parse; ndjson npm package | ajv | node-jq, jsonpath-plus |
| Go | encoding/json + bufio.Scanner | gojsonschema | itchyny/gojq |
| Rust | serde_json + BufRead | jsonschema crate | jaq (jq in Rust) |
| Java / Kotlin | Jackson JsonFactory + line stream | everit-json-schema | JsonPath (Jayway) |
| Shell | cat, head, tail, wc -l | jq -e 'empty' per line; our validator | jq |
11. Useful shell one-liners
# Count records
wc -l events.jsonl
# First and last record
head -1 events.jsonl ; tail -1 events.jsonl
# Pretty-print one record
head -1 events.jsonl | jq
# Filter by field
jq -c 'select(.user_id == 4287)' events.jsonl > user-4287.jsonl
# Extract one field across all records (TSV output)
jq -r '[.ts, .user_id, .event_type] | @tsv' events.jsonl
# Validate every line
jq -e -c . events.jsonl > /dev/null && echo "all valid"
# Sort by a field
jq -s 'sort_by(.ts) | .[]' events.jsonl > sorted.jsonl
# Dedupe by full line
sort -u events.jsonl
# Dedupe by a key
jq -s 'unique_by(.event_id) | .[]' events.jsonl
# Random sample of 1000 lines
shuf -n 1000 events.jsonl
# Split into 10 equal parts
split -n l/10 -d events.jsonl events_part_
12. Privacy: scrub PII before sharing
JSONL files often pick up personal data — emails, IP addresses, names, account IDs. Before sharing externally:
- Run through the anonymizer to redact obvious PII (emails, IPs, phone numbers, credit-card-shaped numbers, common token formats).
- Consider keyed hashing (HMAC-SHA-256) for IDs that must remain joinable across datasets but not reversible to the original value.
- For statistical sharing, use the sampler to ship a sample with k-anonymity rather than the full file.
13. Producer-consumer contracts
Document these explicitly between teams:
- Schema (a checked-in JSON Schema, ideally Draft 2020-12).
- Line ending policy (always
\n). - Sort policy (sorted keys for stable diffs? sorted records by some key?).
- Compression (which codec; concatenation policy).
- Error policy on the consumer (skip bad lines? fail fast? quarantine?).
- Versioning (a
schema_versionfield). - Trailing-newline policy (we recommend yes, so
wc -lmatches).
14. Common mistakes
Treating .jsonl as a JSON array
The mistake: JSON.parse(fileContents) on a JSONL file. The fix: read line-by-line, parse each. Almost every "invalid JSON" error on a JSONL file is this.
Pretty-printing JSONL
Pretty-printing introduces newlines inside records, which breaks the format. JSONL records are always on one line each. Use the formatter to flip between pretty JSON and JSONL.
Trailing commas (LLM output)
LLMs love to insert , after the last property. Strict JSON rejects this. Run through the auto-fixer or strip with sed.
Missing newline before EOF
Common with naively concatenated files. Symptom: wc -l is off-by-one; the last record may be silently dropped by some consumers. Always end with a \n.
Mixed record shapes
Spec-legal but consumer-hostile. Stick to one shape per file. If you genuinely need heterogeneous records, add a discriminator field ("event_type":"...") and document it.
Using application/json for the MIME type
Clients will try to parse the whole body as one document. Use application/x-ndjson instead.
BOM in the file
EF BB BF at the start of byte 0 breaks parsing of the first record on naive parsers. Don't write a BOM; do strip it on read.
15. Where to go from here
- Read the formal specification for the edge-case rules.
- Compare against other formats in JSONL vs JSON vs NDJSON.
- See real-world JSONL shapes in examples.
- Try the tools — every one runs in your browser, no upload.
- For shell pipelines on multi-GB files: jsonlkit CLI.
— S., [email protected]