JSONL vs JSON vs NDJSON vs CSV vs Parquet
Five formats that look interchangeable but solve different problems. JSON is for documents. JSONL and NDJSON are streamable record sets. CSV is for spreadsheets. Parquet is for analytical warehouses. Picking right matters — converting later is cheap with tools like ours, but discovering you picked wrong after writing 50 GB is painful.
The 30-second version
- JSON — one document, often hierarchical. Configs, API responses, single-record exchanges.
- JSONL / NDJSON — many independent records, one per line. Logs, datasets, streams, ML training data.
- CSV — tabular rows, all the same flat shape. Spreadsheets, simple exchange, BI imports.
- Parquet — typed columnar storage with metadata. Analytical workloads at scale.
If your data has nested structure and you want to stream/append, JSONL wins. If it's a single rich document, regular JSON. If it's a flat table for non-developers, CSV. If it's millions of rows for analytics, Parquet.
JSONL vs JSON
| Aspect | JSON | JSONL |
|---|---|---|
| Top-level shape | One document (object, array, value) | Sequence of independent values, one per line |
| Streamable | No — must read entire document | Yes — parse line by line |
| Appendable | No — would need to rewrite or hack ] | Yes — append a line |
| Splittable for parallel processing | No | Yes — any \n is a safe boundary |
| Memory cost | Whole document in RAM | One record at a time |
| Resilient to corruption | One bad byte = whole file unparseable | One bad line = skip and continue |
| Diff-friendly (git) | Poor — formatting changes look like content changes | Excellent — line-based diffs work natively |
| Pretty-printing | Multi-line, indented | One record per line; pretty-printing breaks the format |
| Used for | Configs, API responses, single records | Datasets, logs, ML training, ETL streams |
Round-trip both ways with JSON → JSONL and JSONL → JSON.
When to pick JSON over JSONL
- You're sending a single response over HTTP that fits comfortably in memory.
- The data is a document, not a set — e.g. a configuration file, a search response with nested facets, an OpenAPI spec.
- You need to cross-reference between records inside the same payload (a JSON array makes this easy; a JSONL stream doesn't).
- The consumer is a browser or a tool that expects a single document (
JSON.parseon the whole body).
When to pick JSONL over JSON
- The data is a stream or batch of records, all the same shape.
- The file may not fit in memory.
- You want to append records over time without rewriting the file.
- You want line-based Unix tooling (
head,tail,grep,wc,jq) to work directly. - You want partial failure tolerance: one bad record shouldn't poison the rest.
- You're feeding an ML training pipeline, log aggregator, or warehouse importer — these almost universally expect JSONL.
JSONL vs NDJSON
Same format, different names. NDJSON is the spelling used in JavaScript / Node circles and pushed by ndjson.org. JSONL is the spelling used in Python, ML, and data engineering, and standardised by jsonlines.org. There is no functional difference.
If you receive .ndjson and your tool expects .jsonl (or vice versa), just rename the file. Both refer to "one JSON value per line, separated by \n." See the overview's naming section for history.
JSONL vs CSV
| Aspect | JSONL | CSV |
|---|---|---|
| Schema | Self-describing per record (keys present) | External — first row is usually the header |
| Nested data | Native — objects and arrays inside records | Not supported — must flatten via dot-keys or JSON-in-cells |
| Types | Distinguishes string / number / bool / null | Everything is a string; types must be re-parsed |
| Encoding | UTF-8 | UTF-8 in theory, often Windows-1252 or weird locales in practice |
| Quoting rules | Strict JSON quoting | RFC 4180, but in practice quoting is everyone's footgun |
| Heterogeneous shapes | Possible (not recommended) | Impossible — every row must have the same column count |
| File size | Larger — keys repeat every record | Smaller — header once, then values |
| Spreadsheet-friendly | No — Excel can't open JSONL directly | Yes — Excel, Sheets, Numbers all open CSV |
| Streamable / appendable | Yes | Yes |
Round-trip with CSV → JSONL and JSONL → CSV.
When to pick CSV
- Your audience is non-developers who'll open the file in Excel or Sheets.
- The data is genuinely flat (no nested objects, no arrays inside cells).
- The file is going into a BI tool that prefers CSV imports.
- File size matters more than self-description.
When to pick JSONL over CSV
- You have nested data and don't want to flatten lossily.
- You need type distinctions (a number that looks like a leading-zero string ID — CSV can't represent both safely).
- Your downstream is a programming language, not a spreadsheet.
- You want resilience to one row being malformed.
JSONL vs Parquet
| Aspect | JSONL | Parquet |
|---|---|---|
| Storage | Row-oriented text | Column-oriented binary |
| Human-readable | Yes — open in any text editor | No — need a Parquet reader |
| Schema | Implicit per record | Embedded in file metadata |
| Compression ratio | Good with gzip/zstd | Excellent — column compression + dictionary encoding |
| Read pattern | Full file scan | Read only needed columns |
| Append | Trivial — write a new line | Hard — file is structured; multi-file partitions instead |
| Streaming | Natural fit | Designed for files at rest, not streams |
| Used for | Logs, fine-tune datasets, ETL transport | Analytical queries (DuckDB, Spark, Athena, BigQuery) |
Convert with JSONL ↔ Parquet (DuckDB-WASM in your browser).
The typical workflow
Many data pipelines use both: JSONL for the landing zone / staging area / streaming transport (append-friendly, debuggable in a text editor), then Parquet for the queryable storage after a daily or hourly compaction (columnar reads, dictionary compression). Tools like dbt, Airbyte, and Snowflake's external tables understand this pattern out of the box.
Decision flowchart
Is the data ONE document (config, single API response)?
└── Yes → JSON
└── No → continue
Are records flat with no nesting?
└── Yes → audience non-developers?
└── Yes → CSV
└── No → continue
Is the data at-rest for analytical queries (warehouse, BI)?
└── Yes → Parquet
└── No → JSONL
Side-by-side: the same data in each format
JSON
{
"users": [
{"id": 1, "name": "Ada", "active": true, "tags": ["math", "code"]},
{"id": 2, "name": "Babbage", "active": false, "tags": ["engine"]}
]
}
JSONL
{"id":1,"name":"Ada","active":true,"tags":["math","code"]}
{"id":2,"name":"Babbage","active":false,"tags":["engine"]}
CSV (lossy — arrays flattened)
id,name,active,tags
1,Ada,true,"math|code"
2,Babbage,false,engine
Parquet (binary; logical schema shown)
id : int64
name : utf8
active : bool
tags : list<utf8>
Row 0: 1, "Ada", true, ["math", "code"]
Row 1: 2, "Babbage", false, ["engine"]
Performance: what you can expect
Rough numbers on a modern laptop (M-series Mac, 16 GB RAM) for a 1 GB dataset of nested user records:
| Format | File size | Read 100M rows | Filter on one column |
|---|---|---|---|
| JSONL (raw) | 1.0 GB | ~12 s | ~12 s (full scan) |
| JSONL + gzip | ~150 MB | ~16 s | ~16 s |
| JSONL + zstd -3 | ~120 MB | ~10 s | ~10 s |
| Parquet (snappy) | ~180 MB | ~6 s | ~0.3 s (column-pruned) |
The headline: Parquet wins for analytical filters because it can skip most of the file. JSONL wins for streaming, debuggability, and append-only workloads.
FAQ
Can I use JSONL as an HTTP response body?
Yes — set Content-Type: application/x-ndjson and use chunked transfer encoding. Each chunk should end at a line boundary so the client can parse complete records as they arrive. This is the standard streaming pattern for AI APIs (OpenAI, Anthropic) when returning many results.
Do databases support JSONL natively?
Many do — BigQuery, Snowflake, DuckDB, and Athena all import JSONL directly. Postgres and MySQL don't have a native loader but accept it via COPY with a JSON-per-line input.
Is JSONL slower than Parquet?
For analytical scans, yes — Parquet's column pruning and dictionary encoding give big wins on selective queries. For streaming, appending, debugging, and small-to-medium files, JSONL is faster end-to-end because it skips serialization overhead.
Can I store binary data (images, audio) in JSONL?
Only if you base64-encode it. JSON has no binary type. For real binary blobs, store them outside the JSONL and reference by path or URL — base64 adds 33% overhead and breaks streaming.
— S., [email protected]