System-prompt Deduplicator
100% client-side. No upload.
Dedupe
System-prompt Deduplicator
Find rows in a fine-tune dataset that share the same user prompt but produce different assistant outputs — a quiet but expensive data-quality bug. Two examples with "Translate Hello to French" mapping to Hola in one row and Bonjour in another teach the model to be inconsistent. This tool finds them, groups them, and lets you keep the first, the last, or just report and leave the file unchanged. Works for OpenAI, Anthropic, Gemini, and ShareGPT shapes. 100% in-browser.
What "duplicate" means here
This is not whole-line deduplication — for that, use the JSONL Deduplicator. This tool dedupes on the prompt side of each example, ignoring assistant outputs. Three ways to define "same prompt":
- First user turn — the most common signal of duplication for single-turn fine-tunes.
- All user turns concatenated — for multi-turn data, considers the whole user-side conversation.
- System prompt + first user turn — when the same user prompt under different system prompts is intentional.
Normalization
Tiny whitespace differences will hide otherwise-identical prompts. Trim & collapse normalizes runs of whitespace to a single space and strips ends — the safest default. Lowercase goes further and is useful if you have inconsistent casing across rows.
Report mode
Sometimes you want to see what's duplicated before deciding to drop it — maybe the duplicates are intentional (e.g. paraphrased outputs for diversity). Pick Report only to get a list of duplicate groups with line numbers and their assistant outputs, without mutating the file.
Tips & common pitfalls
- Don't dedupe paraphrase datasets. If your training plan deliberately has multiple outputs per prompt for diversity, this tool will flatten that. Pick "Report only" to inspect first.
- System prompt collisions. If many of your rows share the same system prompt, "System + first user" reduces to "first user" in effect.
- Check token spend before/after. Pipe the deduped file through the token counter to see how many tokens you saved.