JSONL Examples

Real-world JSONL files, annotated · updated 21 May 2026 · overview · spec · vs JSON · best practices

Eight real-world examples covering the situations where JSONL shows up: application logs, LLM fine-tuning datasets for every major provider, search engine bulk indexing, data warehouse imports, HTTP streaming, event sourcing, telemetry, and dataset releases. Every snippet is valid JSONL you can paste into the viewer, validator, or jq playground.

1. Structured application logs

Modern loggers (pino, winston, slog, zap, structlog) emit JSONL by default. Each line is one event with timestamp, level, message, and context. Log aggregators (Datadog, Loki, ELK) consume this format directly without re-parsing.

{"ts":"2026-05-21T08:14:02.114Z","level":"info","msg":"server started","port":8080,"env":"prod","pid":4287}
{"ts":"2026-05-21T08:14:05.881Z","level":"info","msg":"request","method":"GET","path":"/api/users","status":200,"duration_ms":12,"trace_id":"5e7c1a"}
{"ts":"2026-05-21T08:14:06.224Z","level":"warn","msg":"slow query","duration_ms":482,"query":"SELECT * FROM orders WHERE ...","trace_id":"5e7c1a"}
{"ts":"2026-05-21T08:14:07.001Z","level":"error","msg":"db connection lost","err":"connection refused","retry":1,"trace_id":"5e7c1a"}

Conventions: ISO-8601 timestamps in UTC (sortable as strings), level as a lowercase string, msg as a short human description, all other fields as typed context.

What to do with it: filter by trace_id to follow a request through the system: jsonlkit-query: select(.trace_id == "5e7c1a"). See the jq query playground.

2. OpenAI fine-tuning dataset (chat format)

OpenAI fine-tuning expects one conversation per line. Each conversation is a messages array with role/content turns. This is the most common JSONL shape in ML today.

{"messages":[{"role":"system","content":"You are a helpful customer-support agent."},{"role":"user","content":"How do I reset my password?"},{"role":"assistant","content":"Click 'Forgot password' on the login page, enter your email, and follow the link we send. The link expires in 30 minutes."}]}
{"messages":[{"role":"system","content":"You are a helpful customer-support agent."},{"role":"user","content":"Where can I download my invoices?"},{"role":"assistant","content":"Open Settings → Billing → Invoices. You can download any invoice as PDF from the list."}]}
{"messages":[{"role":"system","content":"You are a helpful customer-support agent."},{"role":"user","content":"Do you offer student discounts?"},{"role":"assistant","content":"Yes — we offer a 50% discount on all paid plans for verified students. Apply at /students with your .edu email."}]}

Gotchas: system prompt should be the first message and consistent across examples; assistant turns must be present (not empty); role names are system, user, assistant (and tool/function if you use those features).

Validate it: drop the file into the OpenAI fine-tune validator — catches role-name typos, missing system, alternation issues, and content-array mistakes.

3. Anthropic fine-tuning dataset

Anthropic's Messages API format. The system prompt is a top-level string, not a message — the most common mistake when migrating from OpenAI.

{"system":"You are a helpful customer-support agent.","messages":[{"role":"user","content":"How do I reset my password?"},{"role":"assistant","content":"Click 'Forgot password' on the login page, enter your email, and follow the link we send. The link expires in 30 minutes."}]}
{"system":"You are a helpful customer-support agent.","messages":[{"role":"user","content":"Where can I download my invoices?"},{"role":"assistant","content":"Open Settings → Billing → Invoices."}]}

Conventions: only user and assistant roles inside messages; alternation must be strict (no two user turns in a row); content can be a string or an array of content blocks (for multi-modal).

Validate it: Anthropic fine-tune validator.

4. Google Gemini fine-tuning

Gemini uses parts and the role model instead of assistant.

{"contents":[{"role":"user","parts":[{"text":"Summarize the French Revolution in one sentence."}]},{"role":"model","parts":[{"text":"The French Revolution (1789–1799) overthrew the monarchy, redistributed power and land, and gave rise to Napoleonic France."}]}]}
{"contents":[{"role":"user","parts":[{"text":"What's the capital of Bhutan?"}]},{"role":"model","parts":[{"text":"Thimphu."}]}]}

Gotchas: role is model, not assistant; content lives under parts (an array, because Gemini supports multi-modal blocks); system instructions are typically a separate system_instruction field, not a turn.

Validate it: Gemini fine-tune validator.

5. Llama / ShareGPT format

Used by HuggingFace datasets and most open-weight model training pipelines (Llama, Mistral, Qwen, Yi).

{"conversations":[{"from":"human","value":"Translate 'hello world' to French."},{"from":"gpt","value":"Bonjour le monde."}]}
{"conversations":[{"from":"system","value":"You are a precise translator."},{"from":"human","value":"Translate 'good night' to Japanese."},{"from":"gpt","value":"おやすみなさい (oyasumi nasai)."}]}

Conventions: roles are human / gpt / system (historical, from the ShareGPT dataset); content lives in value; conversations are wrapped in conversations. Validate with the Llama / ShareGPT validator.

6. Elasticsearch / OpenSearch _bulk indexing

The _bulk API uses a quirky form of JSONL: each document is preceded by an action line. Two lines per indexed doc.

{"index":{"_index":"products","_id":"1"}}
{"name":"Mechanical keyboard","price":129.99,"stock":42,"tags":["electronics","peripheral"]}
{"index":{"_index":"products","_id":"2"}}
{"name":"USB-C cable, 2m","price":12.99,"stock":315,"tags":["electronics","cable"]}
{"delete":{"_index":"products","_id":"obsolete-99"}}
{"update":{"_index":"products","_id":"1"}}
{"doc":{"price":119.99}}

Conventions: alternating action / document lines; index and update need a following doc, delete does not. Get this wrong and Elasticsearch returns a confusing "missing action" error.

Send it: curl -X POST 'https://es.example/_bulk' -H 'Content-Type: application/x-ndjson' --data-binary @bulk.jsonl.

7. BigQuery / Snowflake bulk import

Both warehouses accept JSONL with a fixed schema as a native bulk-load format. One record per line, schema enforced from the table definition.

{"event_id":"evt_0001","user_id":4287,"event_type":"page_view","page":"/pricing","ts":"2026-05-21T08:00:14Z","session_id":"s_5e7c1a","properties":{"referrer":"google","plan":"trial"}}
{"event_id":"evt_0002","user_id":4287,"event_type":"button_click","page":"/pricing","ts":"2026-05-21T08:00:42Z","session_id":"s_5e7c1a","properties":{"button":"start_trial"}}
{"event_id":"evt_0003","user_id":4287,"event_type":"sign_up_complete","page":"/signup","ts":"2026-05-21T08:01:55Z","session_id":"s_5e7c1a","properties":{"method":"google_oauth"}}

Loading commands:

# BigQuery
bq load --source_format=NEWLINE_DELIMITED_JSON \
    analytics.events events.jsonl schema.json

# Snowflake
COPY INTO analytics.events
FROM @stage/events.jsonl
FILE_FORMAT = (TYPE = JSON);

8. HTTP streaming (NDJSON over chunked transfer)

The canonical pattern for an API that returns many results without making the client wait. Each chunk ends at a \n boundary; the client parses records as they arrive.

HTTP/1.1 200 OK
Content-Type: application/x-ndjson
Transfer-Encoding: chunked

{"id":"r_001","title":"Result one","score":0.94}
{"id":"r_002","title":"Result two","score":0.91}
{"id":"r_003","title":"Result three","score":0.88}
{"id":"r_004","title":"Result four","score":0.86}

JavaScript client pattern:

const res = await fetch('/search?q=jsonl');
const reader = res.body.pipeThrough(new TextDecoderStream()).getReader();
let buffer = '';
for (;;) {
  const { value, done } = await reader.read();
  if (done) break;
  buffer += value;
  const lines = buffer.split('\n');
  buffer = lines.pop();  // last (possibly incomplete) line
  for (const line of lines) {
    if (!line) continue;
    const record = JSON.parse(line);
    handle(record);
  }
}
if (buffer.trim()) handle(JSON.parse(buffer));

The same pattern works for OpenAI / Anthropic streaming completions, and for any server that wants to push results as a stream rather than a single big response.

9. Event sourcing / append-only log

Every state change in the system is appended as a JSONL record. The current state is the fold of all events.

{"event_id":"evt_0001","aggregate":"order","aggregate_id":"o_42","type":"OrderCreated","ts":"2026-05-21T10:00:00Z","data":{"customer_id":"c_7","items":[{"sku":"SKU-A","qty":2}]}}
{"event_id":"evt_0002","aggregate":"order","aggregate_id":"o_42","type":"ItemAdded","ts":"2026-05-21T10:01:14Z","data":{"sku":"SKU-B","qty":1}}
{"event_id":"evt_0003","aggregate":"order","aggregate_id":"o_42","type":"OrderPaid","ts":"2026-05-21T10:05:42Z","data":{"amount":54.99,"currency":"USD","method":"card_visa"}}
{"event_id":"evt_0004","aggregate":"order","aggregate_id":"o_42","type":"OrderShipped","ts":"2026-05-22T09:15:00Z","data":{"carrier":"DHL","tracking":"JD0001AB"}}

Why JSONL: events are appended over time, never edited; one bad event shouldn't break the whole log; replaying the log to rebuild state requires streaming, not random access.

10. Telemetry / OpenTelemetry logs export

OpenTelemetry's Logs Data Model maps cleanly to JSONL. Each span / log record is one line.

{"timestamp":"2026-05-21T08:00:14.114Z","severity":"INFO","body":"User logged in","trace_id":"d28b71f2","span_id":"5b1f3a","attributes":{"service.name":"auth","user.id":"u_4287","method":"oauth_google"}}
{"timestamp":"2026-05-21T08:00:14.231Z","severity":"INFO","body":"Session created","trace_id":"d28b71f2","span_id":"6c2a4b","attributes":{"service.name":"sessions","session.id":"s_5e7c1a","ttl_s":3600}}
{"timestamp":"2026-05-21T08:00:15.001Z","severity":"WARN","body":"Rate limit warning","trace_id":"d28b71f2","span_id":"7d3b5c","attributes":{"service.name":"api","limit":"100/min","current":94}}

11. HuggingFace dataset row format

A typical HF dataset shipped as train.jsonl. Each row is one training example with whatever fields the task needs.

{"text":"Inception is a 2010 science fiction film written and directed by Christopher Nolan.","label":"film","meta":{"source":"wikipedia","year":2010}}
{"text":"The Suez Canal is an artificial sea-level waterway in Egypt connecting the Mediterranean to the Red Sea.","label":"geography","meta":{"source":"wikipedia","year":2026}}
{"text":"Ada Lovelace is regarded as the first computer programmer.","label":"person","meta":{"source":"wikipedia","year":1843}}

Tools to try with these examples

Paste any of the snippets above into:

Viewer — see records laid out with tree view and field-by-field inspection
Validator — confirm every line is well-formed
Schema inferrer — auto-generate a JSON Schema describing the dataset's shape
jq query — slice, filter, reshape with jq syntax
JSONL → CSV — flatten for spreadsheet use
Anonymizer — redact emails, IPs, tokens, and other PII before sharing
Token counter — estimate fine-tuning cost across providers

Downloadable sample files

The Sample datasets page has downloadable JSONL files in each of these shapes — logs, OpenAI fine-tune format, ShareGPT, ndjson event streams — so you can experiment with the tools without finding your own data first.

FAQ

Why is OpenAI's format different from Anthropic's?

Each provider designed its own training schema before any standard emerged. OpenAI puts the system prompt inside the messages array; Anthropic makes it a top-level string. Practically, you need a provider-specific validator before uploading. We have one per provider (OpenAI, Anthropic, Gemini, Llama, Mistral).

What's the difference between application/x-ndjson and application/jsonl as MIME types?

Functionally none. application/x-ndjson is the older, more widely-recognised type and what most HTTP clients understand out of the box. application/jsonl is gaining ground in newer APIs. Some clients treat both as opaque — set whichever your downstream expects.

Are line numbers in error messages 0-indexed or 1-indexed?

Most JSONL tools (including ours, jq, jsonlines.org's references, and most editors) report 1-indexed line numbers because that matches how editors and grep -n show them. The first record is line 1. Always check your tool's convention if mixing reports between systems.

— S., [email protected]