TXT to JSONL Converter
TXT to JSONL converter. Turn a plain .txt file into JSONL — one {"text": "…"} object per record. Split by paragraph, line, sentence, or fixed chunks. Built for LLM fine-tune corpora and RAG ingestion. Up to 1 GB, runs in your browser, nothing uploaded.
100% client-side. Your text stays in your browser.
Convert
TXT to JSONL Converter
Plain-text in, JSONL out. Pick how you want the file sliced — paragraph, line, sentence, fixed-size chunk — and every piece becomes one {"text": "…"} record on its own line. Ready for OpenAI / Anthropic fine-tune uploads, HuggingFace datasets, or a vector-DB bulk import.
Before you start
You need a plain-text file or a snippet you can paste — anything from a transcribed interview to an Apache log. The tool doesn't care about encoding as long as your browser can read it (UTF-8 is safest; Windows-1252 generally works too). There's no header row like CSV: the file is interpreted byte-for-byte as one stream that gets sliced.
Decide upfront what one record means for your data. For a book or article, a paragraph is usually right. For a list of one-liners (prompts, queries, log lines), one line per record makes sense. For a long-form document headed to a RAG index, fixed-size chunks of 300–800 characters are typical. The split mode is the most important setting on the page — everything else is a tweak.
Why convert TXT to JSONL?
Three patterns drive almost every visit to this page:
- Fine-tune corpora. OpenAI, HuggingFace, and the open-source training scripts (axolotl, llama-factory) all consume JSONL where each line is one training example. If your source is a blob of prose, you split it into paragraphs and wrap each one as
{"text": "…"}. - RAG / embeddings ingestion. Pinecone, Weaviate, Qdrant, and pgvector all bulk-load NDJSON. You chunk your document into 300–800 char pieces, optionally with overlap, and produce a JSONL where each line gets embedded.
- Annotation prep. Label Studio, Argilla, and Prodigy expect one item per JSON line. Plain text → JSONL is the missing step between "I have a doc" and "I have a project I can label."
How to use it
- Paste your text into the Input pane, or drop a
.txt/.md/.logfile. - Pick Split by — the most important choice. Paragraph (blank line) is the default and matches what most fine-tune corpora want.
- Set the Field key if your downstream tool wants something other than
text. Common alternatives:prompt,content,input. - If you picked every N characters or every N words, set the Chunk size. If you picked custom regex, paste a JavaScript regex into Separator regex (it's used as the argument to
String.split). - Decide whether to trim each record (strips leading/trailing whitespace), whether to skip empty records, and whether to add an id field (handy if you'll later re-order or sample and want to trace records back).
- Click Convert. Copy or Download .jsonl.
Example: the canonical case
Input (two paragraphs separated by a blank line):
A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so.
These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics.
Output (split=paragraph, key=text):
{"text":"A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so."}
{"text":"These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics."}
That's the most-asked TXT→JSONL question on Stack Overflow, solved in a click.
Split modes
Paragraph (default)
Splits on one or more blank lines (a "blank line" is any line that's empty after trim). Internal line breaks inside a paragraph are preserved or collapsed to spaces depending on trim whitespace. This is the right default for prose corpora: articles, transcripts, books, emails.
Line
Every \n ends a record. Use this when your input is already one-thought-per-line: chat logs, prompt lists, search queries, log lines, scraped headlines.
Sentence
Splits on . ! ? followed by whitespace. The sentence-ending punctuation stays attached to the preceding sentence. Caveats: this is a regex, not an NLP sentence segmenter, so abbreviations like "Dr. Smith" or "e.g. this" will produce false splits. Good enough for clean prose, not for legal or scientific writing dense with abbreviations.
Every N characters / Every N words
Greedy fixed-size chunking. Every N characters takes the next N characters until the file is consumed. Every N words does the same with whitespace-delimited words. Use these for RAG when you want predictable chunk sizes for embedding budgets. Neither mode is sentence-aware — they will cut mid-sentence — which is fine for vector search but bad for fine-tune training.
Custom regex
Whatever pattern you paste is used as the split delimiter. \n---\n is the default (Markdown horizontal rules); \f for form-feed separators is also common in old-school text dumps. The regex is JavaScript-flavored.
Whole file
The entire input becomes one record: {"text": "<entire file>"}. Use this when you're batching one document per file into a multi-file JSONL via shell concatenation.
Options explained
Field key
The JSON key that holds each text chunk. text is the convention (HuggingFace datasets, OpenAI completion fine-tunes, most embeddings APIs). Switch to prompt if you're building a prompt list, or content for Anthropic-style message bodies.
Trim whitespace
Strips leading/trailing whitespace from each record. With paragraph mode, this also collapses the internal single-newline-inside-a-paragraph into a space, which is what you almost always want — most fine-tune trainers treat a paragraph as one continuous string.
Skip empty
After splitting (and after trimming if enabled), records that are empty get dropped. Turn this off only if a downstream consumer expects exactly N records and you want padding preserved.
Add id field
Adds a "id": N integer to every record, counting from 1. Useful if you'll later sort, sample, or filter and want to trace each surviving record back to the original document position.
Fine-tune workflow
If you're prepping a completion-style fine-tune (raw prose, no chat structure), the recipe is short:
- Put your corpus in a single
.txtfile, paragraphs separated by blank lines. - Paste here, split by paragraph, key =
text, trim + skip empty on. Convert and download. - Train with axolotl / llama-factory / Hugging Face
Trainer— they all accept{"text": "…"}as the canonical "raw-text" record.
For chat-style fine-tunes (OpenAI messages, Anthropic), TXT is the wrong starting point — you need structured turns. Use CSV → JSONL with a user,assistant sheet, then validate with the OpenAI Fine Tune Validator.
RAG / embeddings workflow
- Drop in your document.
- Split by every N characters, 500–800 is the sweet spot for OpenAI
text-embedding-3-small(≈ 200 tokens per chunk). - Turn on add id field so you can map a search hit back to a document offset.
- Download and feed into your vector DB's bulk import.
If you need overlap between chunks, do it in a small post-processing pass — most vector DB SDKs ship a chunker with overlap; this tool deliberately stays simple.
Tips & common pitfalls
- Paragraph mode and Windows line endings. CRLF line endings are normalised to LF before splitting, so
\r\n\r\nworks as a paragraph break. - Sentence mode and abbreviations. "Dr. Smith said the U.S. economy…" produces three records, not one. If that bites you, use paragraph or line mode and accept the larger chunks.
- Chunking by characters cuts mid-word. By design — predictable chunk size matters more than word boundaries for vector search. If you need word boundaries, switch to every N words.
- Custom regex is JavaScript regex. No
\A,\Z, or look-behind groups in older browsers. Test in your console first. - Field key collides with id. If you set add id field and pick
idas the field key, the id wins (added second). Don't do this — pick a different key. - Big files. The whole file is loaded into a JS string then split — about 200 MB is comfortable, beyond that use a CLI pipeline like
awk 'BEGIN{RS=""} {gsub(/\n/, " "); print "{\"text\":" json_escape($0) "}"}' file.txt.
Troubleshooting
I picked paragraph mode but got one giant record.
Your file probably doesn't have blank lines between paragraphs — it might be hard-wrapped to 80 columns with single newlines between every line. Switch to line mode, or pre-process with fmt -w 99999 to merge the wraps.
Sentence mode is splitting on "Dr." and "e.g."
That's a limitation of regex-based sentence splitting. For research-quality sentence segmentation, run your text through spaCy's sents pipeline first, write one sentence per line, then use line mode here.
Output records have stray backslash-n inside them.
Those are real newlines preserved inside each record (they show up as \n in the JSON string because that's how JSON escapes a newline). Turn on trim whitespace to collapse them into spaces.
How do I keep the original line numbers?
Turn on add id field. The id is the 1-based index of the record after splitting; it isn't the source line number per se, but combined with the chosen split mode it's enough to map back to the source.
Related tools
See also: Formatter to pretty-print the result, Viewer to scan the records, or OpenAI Fine Tune Validator if you're prepping a chat-completion dataset (you'll need to reshape from text to messages first).
Frequently asked questions
What's the equivalent Python one-liner?
For paragraph splitting:
import json
text = open("input.txt", encoding="utf8").read()
with open("output.jsonl", "w", encoding="utf8") as f:
for para in text.split("\n\n"):
para = para.strip()
if para:
f.write(json.dumps({"text": para}) + "\n")
That's the snippet I'd give as an answer to the canonical Stack Overflow question — this tool just spares you the script.
Can I use a different field name than text?
Yes — type whatever you want into the Field key input. prompt, content, input, body are all valid JSON keys.
Will the output be valid JSON?
Each line is valid JSON. The file as a whole is JSONL (also called NDJSON) — not a single JSON document. If you need a JSON array, run the result through JSONL to JSON.
How are special characters escaped?
The standard JSON.stringify rules: quotes become \", backslashes \\, newlines \n, tabs \t, and control characters get \uXXXX escapes. The output is always safe to parse with any JSON parser.
Is my text uploaded?
No — the conversion runs entirely in your browser. Good for sensitive corpora, drafts you haven't published, or internal documents.