jsonlkit.com
JSONL (JSON Lines) utilities, in the browser
Say hi →

Deduplicate Fine-Tune Prompts in JSONL

updated 16 May 2026 · for fine-tune dataset hygiene

Fine-tune prompt deduplicator. Find duplicate user prompts in an OpenAI, Anthropic or Gemini fine-tune JSONL. The same prompt with different assistant outputs is a data-quality issue. Browser-only.

100% client-side. No upload.

Dedupe

Drop a .jsonl file here, or

System-prompt Deduplicator

Find rows in a fine-tune dataset that share the same user prompt but produce different assistant outputs — a quiet but expensive data-quality bug. Two examples with "Translate Hello to French" mapping to Hola in one row and Bonjour in another teach the model to be inconsistent. This tool finds them, groups them, and lets you keep the first, the last, or just report and leave the file unchanged. Works for OpenAI, Anthropic, Gemini, and ShareGPT shapes. 100% in-browser.

What "duplicate" means here

This is not whole-line deduplication — for that, use the JSONL Deduplicator. This tool dedupes on the prompt side of each example, ignoring assistant outputs. Three ways to define "same prompt":

Normalization

Tiny whitespace differences will hide otherwise-identical prompts. Trim & collapse normalizes runs of whitespace to a single space and strips ends — the safest default. Lowercase goes further and is useful if you have inconsistent casing across rows.

Report mode

Sometimes you want to see what's duplicated before deciding to drop it — maybe the duplicates are intentional (e.g. paraphrased outputs for diversity). Pick Report only to get a list of duplicate groups with line numbers and their assistant outputs, without mutating the file.

Tips & common pitfalls

Before you start

Use this on a chat-format fine-tune JSONL (OpenAI / Anthropic / Gemini shape). It looks for duplicate user prompts — same input, different assistant output — which is a data-quality issue: the model sees conflicting targets and learns nothing useful.

How to use it

  1. Drop your fine-tune JSONL or paste it.
  2. Pick a normalisation strategy (whitespace-normalised exact match is the default).
  3. Click Find duplicates. The report shows each group of duplicate prompts and how many different assistant answers exist for each.
  4. Optional: click Drop duplicates to keep only the first occurrence of each prompt, then download.

Why this matters

When the same prompt has two different "correct" answers in your training set, the model averages the gradients — you pay tokens for noise. Worse, if your eval set leaks into your train set (same prompt in both), eval numbers are inflated. The deduper finds both problems.

Tips & common pitfalls

Frequently asked questions

What counts as "the user prompt"?

The last user-turn content (or the concatenation of all user content, depending on the strategy). System prompts are handled separately by configuration.

Does it treat case differences as duplicates?

Yes when whitespace-normalised matching is selected; case-sensitive matching is also available.

Related tools