Deduplicate Fine-Tune Prompts in JSONL
Fine-tune prompt deduplicator. Find duplicate user prompts in an OpenAI, Anthropic or Gemini fine-tune JSONL. The same prompt with different assistant outputs is a data-quality issue. Browser-only.
100% client-side. No upload.
Dedupe
System-prompt Deduplicator
Find rows in a fine-tune dataset that share the same user prompt but produce different assistant outputs — a quiet but expensive data-quality bug. Two examples with "Translate Hello to French" mapping to Hola in one row and Bonjour in another teach the model to be inconsistent. This tool finds them, groups them, and lets you keep the first, the last, or just report and leave the file unchanged. Works for OpenAI, Anthropic, Gemini, and ShareGPT shapes. 100% in-browser.
What "duplicate" means here
This is not whole-line deduplication — for that, use the JSONL Deduplicator. This tool dedupes on the prompt side of each example, ignoring assistant outputs. Three ways to define "same prompt":
- First user turn — the most common signal of duplication for single-turn fine-tunes.
- All user turns concatenated — for multi-turn data, considers the whole user-side conversation.
- System prompt + first user turn — when the same user prompt under different system prompts is intentional.
Normalization
Tiny whitespace differences will hide otherwise-identical prompts. Trim & collapse normalizes runs of whitespace to a single space and strips ends — the safest default. Lowercase goes further and is useful if you have inconsistent casing across rows.
Report mode
Sometimes you want to see what's duplicated before deciding to drop it — maybe the duplicates are intentional (e.g. paraphrased outputs for diversity). Pick Report only to get a list of duplicate groups with line numbers and their assistant outputs, without mutating the file.
Tips & common pitfalls
- Don't dedupe paraphrase datasets. If your training plan deliberately has multiple outputs per prompt for diversity, this tool will flatten that. Pick "Report only" to inspect first.
- System prompt collisions. If many of your rows share the same system prompt, "System + first user" reduces to "first user" in effect.
- Check token spend before/after. Pipe the deduped file through the token counter to see how many tokens you saved.
Before you start
Use this on a chat-format fine-tune JSONL (OpenAI / Anthropic / Gemini shape). It looks for duplicate user prompts — same input, different assistant output — which is a data-quality issue: the model sees conflicting targets and learns nothing useful.
How to use it
- Drop your fine-tune JSONL or paste it.
- Pick a normalisation strategy (whitespace-normalised exact match is the default).
- Click Find duplicates. The report shows each group of duplicate prompts and how many different assistant answers exist for each.
- Optional: click Drop duplicates to keep only the first occurrence of each prompt, then download.
Why this matters
When the same prompt has two different "correct" answers in your training set, the model averages the gradients — you pay tokens for noise. Worse, if your eval set leaks into your train set (same prompt in both), eval numbers are inflated. The deduper finds both problems.
Tips & common pitfalls
- Inspect before dropping. Sometimes "same prompt different answer" is intentional (a sampling-temperature dataset). Read the report before bulk-deleting.
- Run this before train/val/test split, otherwise duplicates get spread across the splits and leak.
- Combine with the matching OpenAI / Anthropic / Gemini validator as a pre-flight before paying for a training run.
Frequently asked questions
What counts as "the user prompt"?
The last user-turn content (or the concatenation of all user content, depending on the strategy). System prompts are handled separately by configuration.
Does it treat case differences as duplicates?
Yes when whitespace-normalised matching is selected; case-sensitive matching is also available.