JSONL Best Practices

Performance, gotchas, libraries · updated 21 May 2026 · overview · spec · vs JSON · examples

JSONL is a simple format with a small number of recurring pitfalls. This page is the field guide for using it well — covering streaming, large files, compression choices, encoding traps, schema evolution, and the canonical library for each language.

1. Always stream, never slurp

The whole point of JSONL is that you can read one record at a time. Don't load the file into memory.

Python

# Good — streams line-by-line, constant memory
import json
with open('events.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        record = json.loads(line)
        process(record)

# Bad — reads everything, blows up on a 50 GB file
with open('events.jsonl') as f:
    records = [json.loads(l) for l in f.readlines()]

Node.js

// Good — node:readline streams
import { createReadStream } from 'fs';
import { createInterface } from 'readline';

const rl = createInterface({
  input: createReadStream('events.jsonl', 'utf-8'),
  crlfDelay: Infinity,
});

for await (const line of rl) {
  if (!line) continue;
  const record = JSON.parse(line);
  process(record);
}

Go

// Good — bufio.Scanner streams; bump buffer for long lines
f, _ := os.Open("events.jsonl")
defer f.Close()
s := bufio.NewScanner(f)
s.Buffer(make([]byte, 0, 64*1024), 16*1024*1024)  // up to 16 MB lines
for s.Scan() {
    var rec map[string]any
    if err := json.Unmarshal(s.Bytes(), &rec); err != nil {
        log.Printf("line %d: %v", lineNum, err)
        continue
    }
    process(rec)
}

Rust

use std::io::{BufRead, BufReader};
use std::fs::File;
let f = File::open("events.jsonl")?;
for line in BufReader::new(f).lines() {
    let line = line?;
    let rec: serde_json::Value = serde_json::from_str(&line)?;
    process(rec);
}

2. Compress for storage and transport

JSONL compresses extremely well because keys repeat every record. Typical ratios:

Codec	Ratio	Decompress speed	When to use
gzip (level 6)	5–10×	Medium	Universal compatibility, default everywhere
zstd (level 3)	5–12×	Fast	Modern stacks (DuckDB, Pandas, ClickHouse); strongly recommended
zstd (level 19)	8–20×	Fast	Archival; small files, cold storage
brotli	6–15×	Medium	HTTP-static delivery (Cloudflare, browsers)
xz / LZMA	10–25×	Slow	Cold archives where compute is cheap

Recommendation: Use zstd for working files, gzip for cross-tool compatibility, xz only for long-term cold archives. Most tooling reads .jsonl.gz and .jsonl.zst natively without decompressing the whole file first.

Concatenation trick: Both gzip and zstd support multi-frame concatenation:

# Concatenate compressed JSONL files without decompressing
cat 2026-05-21-*.jsonl.gz > day.jsonl.gz
# Consumers see one continuous stream. Works for zstd too.

3. Set the right MIME type for HTTP

Header	Use case
`Content-Type: application/x-ndjson`	HTTP requests and responses carrying JSONL
`Content-Type: application/x-ndjson; charset=utf-8`	Explicit, the most defensive choice
`Transfer-Encoding: chunked`	Streaming — flush after each record
`Content-Encoding: gzip` / `br`	Compressed-on-the-wire transport

Never send JSONL with Content-Type: application/json — clients will assume it's a single document and fail at the second record.

4. Handle malformed lines gracefully

Real-world JSONL files often have a few bad lines, especially when produced by ad-hoc loggers or by tools that crashed mid-write. The robust pattern is:

def read_jsonl(path):
    errors = []
    for line_num, line in enumerate(open(path, encoding='utf-8'), start=1):
        line = line.strip()
        if not line:  # blank line — skip silently
            continue
        try:
            yield json.loads(line)
        except json.JSONDecodeError as e:
            errors.append((line_num, str(e), line[:200]))
    if errors:
        # decide: warn, log, or fail
        log_errors(errors)

Don't blow up the whole import on a single bad row. Collect errors with line numbers, decide policy explicitly (warn / log / fail-fast). If you've inherited a file with widespread corruption, run it through the auto-fixer first — repairs trailing commas, single quotes, smart quotes, BOMs, comments, and the dozen other things that break naive parsers.

5. Encoding traps

BOM at the start of the file. Tools that emit a UTF-8 BOM (Notepad, some Windows exports) leave the bytes EF BB BF at the very start. Many parsers treat the BOM as part of the first character of line 1. Strip on read; never emit on write.
Mixed encodings. JSONL is UTF-8. If you receive a file claiming to be UTF-8 but containing Windows-1252 bytes, you'll see "garbage" characters like â€™ where ' should be. Diagnose with file --mime-encoding events.jsonl and convert with iconv.
Surrogate pairs. Emoji and other supplementary-plane characters use UTF-16 surrogate pairs in \uXXXX escapes ("😀" for 😀). Most parsers handle this correctly, but some hand-rolled ones don't.
Invisible whitespace. Non-breaking space (U+00A0), zero-width joiner (U+200D), and friends are valid inside JSON strings but invisible in editors. If a key lookup mysteriously fails, copy the literal bytes and inspect them.

6. Schema evolution

JSONL is schemaless by design, which is great for prototyping and brutal for production unless you have a strategy:

Backward-compatible changes — always safe. Adding a new field; making a required field optional; adding allowed enum values.
Backward-incompatible changes — never silent. Renaming a field; changing a type (string → int); removing a field; tightening enum values.
Version with a schema_version field on every record. Consumers can branch on it. Cheap insurance.
Validate at the boundary. When you receive a file, run it through the schema validator against an explicit JSON Schema. Catch drift before it propagates downstream.
Infer first, then commit. The schema inferrer produces a draft from real data — review and tighten, then check it in. Re-run weekly to catch drift early.

7. Field naming conventions

Stick to one case style. snake_case is most common in Python / data engineering; camelCase in JavaScript / TypeScript. Mixing in one file causes consumer-side bugs.
Avoid leading underscores for normal fields — many systems use _id, _meta, _index for system fields (MongoDB, Elasticsearch). Leave that namespace alone.
Avoid dots in key names. Many query languages (jq, JSONPath, Mongo dotted paths) interpret . as nesting. A key like "user.name" with a literal dot will be unaddressable.
Reserve types. Don't make a field hold strings sometimes and arrays other times — pick one, even if it means "tags": [] for tagless records.

8. Sort keys for stable diffs

JSON object keys are unordered, but many sources of value (git diffs, SHA hashes for cache busting, reproducible builds) depend on deterministic byte output. Sort keys recursively at write time when stability matters:

# Python
json.dumps(record, sort_keys=True, ensure_ascii=False)

# Node
JSON.stringify(record, Object.keys(record).sort())

# jq filter applied to existing file
jq -c 'walk(if type == "object" then to_entries | sort_by(.key) | from_entries else . end)' \
   input.jsonl > sorted.jsonl

9. Handle big files with the right tool

File size	Tool	Why
< 100 MB	This site (browser-based)	Loads in seconds, no install, all features available
100 MB – 1 GB	This site or jq / DuckDB locally	Browser memory usually fits; jq for filters, DuckDB for SQL-style
1 GB – 50 GB	jsonlkit CLI or jq	Streaming, line-by-line, never loads whole file
> 50 GB	DuckDB, Spark, or partition by date	Native parallel readers; consider Parquet for analytics

10. Canonical libraries by language

Language	Read / write	Validate	Query
Python	stdlib `json` + line iteration; `orjson` for speed	`jsonschema`, `pydantic`	`jq` (subprocess), `jsonpath-ng`, `duckdb`
JavaScript / Node	`readline` + `JSON.parse`; `ndjson` npm package	`ajv`	`node-jq`, `jsonpath-plus`
Go	`encoding/json` + `bufio.Scanner`	`gojsonschema`	`itchyny/gojq`
Rust	`serde_json` + `BufRead`	`jsonschema` crate	`jaq` (jq in Rust)
Java / Kotlin	Jackson `JsonFactory` + line stream	`everit-json-schema`	JsonPath (Jayway)
Shell	`cat`, `head`, `tail`, `wc -l`	`jq -e 'empty'` per line; our validator	`jq`

11. Useful shell one-liners

# Count records
wc -l events.jsonl

# First and last record
head -1 events.jsonl ; tail -1 events.jsonl

# Pretty-print one record
head -1 events.jsonl | jq

# Filter by field
jq -c 'select(.user_id == 4287)' events.jsonl > user-4287.jsonl

# Extract one field across all records (TSV output)
jq -r '[.ts, .user_id, .event_type] | @tsv' events.jsonl

# Validate every line
jq -e -c . events.jsonl > /dev/null && echo "all valid"

# Sort by a field
jq -s 'sort_by(.ts) | .[]' events.jsonl > sorted.jsonl

# Dedupe by full line
sort -u events.jsonl

# Dedupe by a key
jq -s 'unique_by(.event_id) | .[]' events.jsonl

# Random sample of 1000 lines
shuf -n 1000 events.jsonl

# Split into 10 equal parts
split -n l/10 -d events.jsonl events_part_

12. Privacy: scrub PII before sharing

JSONL files often pick up personal data — emails, IP addresses, names, account IDs. Before sharing externally:

Run through the anonymizer to redact obvious PII (emails, IPs, phone numbers, credit-card-shaped numbers, common token formats).
Consider keyed hashing (HMAC-SHA-256) for IDs that must remain joinable across datasets but not reversible to the original value.
For statistical sharing, use the sampler to ship a sample with k-anonymity rather than the full file.

13. Producer-consumer contracts

Document these explicitly between teams:

Schema (a checked-in JSON Schema, ideally Draft 2020-12).
Line ending policy (always \n).
Sort policy (sorted keys for stable diffs? sorted records by some key?).
Compression (which codec; concatenation policy).
Error policy on the consumer (skip bad lines? fail fast? quarantine?).
Versioning (a schema_version field).
Trailing-newline policy (we recommend yes, so wc -l matches).

14. Common mistakes

Treating .jsonl as a JSON array

The mistake: JSON.parse(fileContents) on a JSONL file. The fix: read line-by-line, parse each. Almost every "invalid JSON" error on a JSONL file is this.

Pretty-printing JSONL

Pretty-printing introduces newlines inside records, which breaks the format. JSONL records are always on one line each. Use the formatter to flip between pretty JSON and JSONL.

Trailing commas (LLM output)

LLMs love to insert , after the last property. Strict JSON rejects this. Run through the auto-fixer or strip with sed.

Missing newline before EOF

Common with naively concatenated files. Symptom: wc -l is off-by-one; the last record may be silently dropped by some consumers. Always end with a \n.

Mixed record shapes

Spec-legal but consumer-hostile. Stick to one shape per file. If you genuinely need heterogeneous records, add a discriminator field ("event_type":"...") and document it.

Using application/json for the MIME type

Clients will try to parse the whole body as one document. Use application/x-ndjson instead.

BOM in the file

EF BB BF at the start of byte 0 breaks parsing of the first record on naive parsers. Don't write a BOM; do strip it on read.

15. Where to go from here

Read the formal specification for the edge-case rules.
Compare against other formats in JSONL vs JSON vs NDJSON.
See real-world JSONL shapes in examples.
Try the tools — every one runs in your browser, no upload.
For shell pipelines on multi-GB files: jsonlkit CLI.

— S., [email protected]