How to Remove Duplicate Lines from Text, CSV, and Log Files

Q: How do I remove duplicate lines but keep the original order?

Most command-line tools like sort -u will alphabetize your lines, changing the order. To preserve order, use awk '!seen[$0]++' on Linux/Mac, or paste your text into Article Formatter's Remove Duplicate Lines tool which keeps the first occurrence of each line in its original position.

You have a text file with 500 lines and you know at least half of them are duplicates. Maybe it's a list of email addresses exported from two different systems. Maybe it's server log entries where the same error fires every 30 seconds. Maybe you merged two CSV files and now the data has overlapping rows.

Whatever the source, the fix is the same: strip out the repeated lines and keep one copy of each. This sounds simple, and it usually is - but there are a few decisions to make. Do you want to keep the first occurrence or the last? Should "Hello" and "hello" count as the same line? Does whitespace matter? And do you care about preserving the original order?

Here's every practical way to remove duplicate lines, from the fastest browser-based option to the command-line tools that handle million-line files without flinching.

The Quick Way: Paste and Click

If you have a chunk of text with duplicates and just want it cleaned up right now, the Remove Duplicate Lines tool does it in your browser. No account, no download, nothing leaves your machine.

How it works:

Paste your text into the input area
Choose whether to preserve original order or sort the results
Click Remove Duplicates
Copy the deduplicated output

It handles thousands of lines instantly and keeps the first occurrence of each unique line by default.

This works best for quick, one-off tasks - cleaning up a mailing list, deduplicating a list of URLs, trimming a set of keywords. If you need to process files programmatically or handle very large datasets, the methods below give you more control.

Where Duplicate Lines Come From

Before jumping into solutions, it helps to know why duplicates appear in the first place. The source affects which deduplication method works best.

Merged data exports

You exported contacts from Mailchimp and also from HubSpot. You combined them into one file. Now "[email protected]" appears three times because she was in both systems and one of them had her twice. This is the most common source of duplicates in CSV data.

Log file repetition

A failing API call retries every 10 seconds and writes the same error message each time. After a few hours, your error log has 800 copies of "Connection refused: api.example.com:443." You need one of those, not 800.

Web scraping results

Your scraper crawled a paginated listing and the last item on page 3 is the same as the first item on page 4. Or the site changed its pagination mid-crawl. Duplicate entries are almost inevitable in scraped datasets.

Copy-paste accumulation

You've been collecting product names, keywords, or notes in a plain text file over weeks. You pasted the same batch twice without realizing it, or you added items individually that were already there. The file grew organically and now it's full of repeats.

Command-Line Methods (Linux, Mac, WSL)

If you're comfortable with the terminal, there are fast, powerful tools built right into your operating system. These work on files of any size - even gigabytes - because they stream data line by line rather than loading everything into memory at once.

sort -u (simplest, but changes order)

The sort command with the -u (unique) flag is the most common one-liner for deduplication. It sorts lines alphabetically and removes exact duplicates in one step.

# Remove duplicates (output is sorted alphabetically)
sort -u input.txt > output.txt

# Case-insensitive deduplication
sort -uf input.txt > output.txt

# Deduplicate and count occurrences of each line
sort input.txt | uniq -c | sort -rn

The catch: sort -u sorts your output alphabetically. If you need to keep lines in their original order, this isn't the right tool. But if order doesn't matter (like a list of email addresses you plan to import), it's fast and simple.

awk (preserves original order)

This classic awk one-liner is probably the most-used deduplication command among developers. It removes duplicates while keeping lines in the order they first appeared.

# Remove duplicates, preserve order (keep first occurrence)
awk '!seen[$0]++' input.txt > output.txt

How it works: awk reads each line. The expression !seen[$0]++ checks if this exact line has been seen before. If it hasn't, it prints it and marks it as seen. If it has, it skips it. The $0 represents the entire line. Clean, fast, and works on files of any size.

For case-insensitive matching, convert to lowercase before comparing:

# Case-insensitive dedup (preserves original casing of first occurrence)
awk '!seen[tolower($0)]++' input.txt > output.txt

uniq (only removes adjacent duplicates)

The uniq command only removes consecutive duplicate lines - it won't catch duplicates separated by other lines. This is a common source of confusion. If your file has:

apple
banana
apple
cherry

Running uniq alone won't remove the second "apple" because there's a "banana" between them. You'd need to sort first, then pipe to uniq:

# Sort then deduplicate (same result as sort -u)
sort input.txt | uniq > output.txt

# Show only lines that appear more than once
sort input.txt | uniq -d

# Show only lines that appear exactly once (unique lines)
sort input.txt | uniq -u

# Count how many times each line appears
sort input.txt | uniq -c

Where uniq really shines is with -c for counting occurrences. Pipe the output through sort -rn to see the most-repeated lines first - great for identifying which log entries fire most often.

Removing Duplicates from CSV Files

CSV files add a wrinkle that plain text doesn't have: structure. A CSV line like Jane,Doe,[email protected] and Jane,Doe,[email protected] are obviously duplicates. But what about Jane,Doe,[email protected] and Jane Marie,Doe,[email protected]? Same email, different first name. Whether those are "duplicates" depends on your use case.

Whole-row duplicates (exact matches)

If you just want to remove rows that are completely identical across all columns, any of the methods above work. Paste your CSV into the Remove Duplicate Lines tool, or use sort -u on the command line. Just remember to handle the header row:

# Keep header, deduplicate body, recombine
head -1 data.csv > output.csv
tail -n +2 data.csv | sort -u >> output.csv

Column-based deduplication

When you need to deduplicate based on a specific column (like keeping only the first row for each email address), you need a tool that understands CSV structure. Python handles this well:

import csv

seen = set()
with open('input.csv') as infile, open('output.csv', 'w', newline='') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)

    header = next(reader)
    writer.writerow(header)

    email_col = header.index('email')  # Find column by name

    for row in reader:
        key = row[email_col].lower().strip()
        if key not in seen:
            seen.add(key)
            writer.writerow(row)

This keeps the first row for each unique email and discards later duplicates. You can change email_col to any column index or name. For visual inspection of your CSV data before and after, the CSV to Table Converter can show it in a readable table format.

Spreadsheet method

If you prefer a visual approach, both Excel and Google Sheets have built-in duplicate removal. In Excel, select your data and go to Data > Remove Duplicates. In Google Sheets, use Data > Data cleanup > Remove duplicates. Both let you choose which columns to compare.

The advantage of spreadsheets is that you can preview what will be removed. The disadvantage is that they choke on very large files - anything over 100,000 rows gets slow in Google Sheets, and Excel's limit is about a million rows.

Deduplicating Log Files

Server logs have a specific challenge: each line usually contains a timestamp, so technically every line is "unique" even if the message is identical. You need to strip the timestamp before comparing, then keep the full original line.

# Deduplicate log lines ignoring the first 20 characters (timestamp)
awk '!seen[substr($0, 21)]++' server.log > deduped.log

# Count unique error messages (ignoring timestamps)
awk '{$1=$2=$3=""; msg=substr($0,4)} !seen[msg]++ {print msg}' \
  server.log | sort | uniq -c | sort -rn | head -20

The second command is particularly useful for incident investigation. It shows you the 20 most frequent unique log messages, helping you spot the real issues buried in noise. Pipe those cleaned-up lines through Article Formatter if you need to paste them into a report or Slack thread with clean formatting.

Handling Near-Duplicates

Exact duplicates are easy. But what about lines that are almost the same? This comes up all the time in real data:

John Smith, [email protected]
John  Smith, [email protected]
john smith, [email protected]
John Smith , [email protected]

These are all "the same" to a human, but different to a computer. To catch them, you need to normalize each line before comparing. Here's a practical approach:

Trim whitespace. Remove leading and trailing spaces from each line, and collapse multiple internal spaces to one. This catches "John Smith" vs "John Smith" and " John Smith " vs "John Smith".

Normalize case. Convert to lowercase for comparison purposes, but keep the original casing in the output. "John SMITH" and "john smith" become the same key.

Strip punctuation differences. If "John Smith" and "John Smith," (with a trailing comma) should be the same, strip non-alphanumeric characters from the comparison key.

In awk, combining whitespace normalization and case-insensitive matching:

# Normalize whitespace and case for comparison, keep original line
awk '{
  key = tolower($0)
  gsub(/^[[:space:]]+|[[:space:]]+$/, "", key)  # trim
  gsub(/[[:space:]]+/, " ", key)                 # collapse spaces
} !seen[key]++' input.txt > output.txt

Python and JavaScript Solutions

When you need more control than command-line one-liners offer, a short script handles edge cases better. Here are ready-to-use solutions in both languages.

Python

def remove_duplicates(text, case_sensitive=True, preserve_order=True):
    lines = text.splitlines()
    seen = set()
    result = []

    for line in lines:
        key = line if case_sensitive else line.lower().strip()
        if key not in seen:
            seen.add(key)
            result.append(line)

    if not preserve_order:
        result.sort()

    return '\n'.join(result)

# Usage
with open('input.txt') as f:
    text = f.read()

cleaned = remove_duplicates(text, case_sensitive=False)

with open('output.txt', 'w') as f:
    f.write(cleaned)

JavaScript (Node.js)

const fs = require('fs');

function removeDuplicates(text, caseSensitive = true) {
  const lines = text.split('\n');
  const seen = new Set();
  const result = [];

  for (const line of lines) {
    const key = caseSensitive ? line : line.toLowerCase().trim();
    if (!seen.has(key)) {
      seen.add(key);
      result.push(line);
    }
  }

  return result.join('\n');
}

const input = fs.readFileSync('input.txt', 'utf-8');
const output = removeDuplicates(input, false);
fs.writeFileSync('output.txt', output);

Both scripts preserve original line order and keep the first occurrence. For a quick test without writing a script, just use the Remove Duplicate Lines tool in your browser.

Performance: What About Huge Files?

For files under 100,000 lines, any method works fine. But when you're dealing with millions of lines - a month of access logs, a full database export, a massive word list - performance matters.

Rough performance comparison (1 million lines, average 50 chars per line):

Method	Time	Memory
`sort -u`	~2 seconds	Low (streams to disk)
`awk '!seen[$0]++'`	~1 second	Moderate (hash in memory)
Python (set-based)	~3 seconds	Moderate (set in memory)
Browser tool	~5 seconds	Limited by browser tab

For truly massive files (10+ million lines), sort -u is the most reliable because it uses temporary files when memory runs low. The awk approach is faster but stores every unique line in memory - if your file has 10 million unique lines averaging 100 characters each, that's roughly 1 GB of RAM just for the hash table.

Common Mistakes and Edge Cases

A few things that trip people up when removing duplicates:

Trailing newline at end of file

Some files end with a blank line, some don't. If your deduplication tool treats empty strings as lines, you might lose (or gain) a trailing newline. Check the last line of your output if precision matters.

Windows vs Unix line endings

A line ending in \r\n (Windows) and the same line ending in \n (Unix) are different strings to most tools. If your data came from mixed sources, normalize line endings first: sed -i 's/\r$//' input.txt

Unicode normalization

The character "e" can be represented as a single Unicode codepoint or as "e" + a combining acute accent - two different byte sequences that look identical on screen. If your data contains accented characters from mixed sources, consider normalizing to NFC form before deduplication. In Python: import unicodedata; key = unicodedata.normalize('NFC', line)

BOM (Byte Order Mark)

Some Windows-created text files start with a hidden BOM character (U+FEFF). The first line of the file will have this invisible prefix, making it different from an identical line elsewhere in the file. Strip it first: sed -i '1s/^\xEF\xBB\xBF//' input.txt. For more on encoding issues, see Windows-1252 vs UTF-8.

Frequently Asked Questions

How do I remove duplicate lines but keep the original order?

Most command-line tools like sort -u will alphabetize your lines, changing the order. To preserve order, use awk '!seen[$0]++' on Linux/Mac, or paste your text into the Remove Duplicate Lines tool which keeps the first occurrence of each line in its original position.

What's the difference between removing duplicates and deduplicating?

They mean the same thing in the context of text processing. Removing duplicates, deduplicating, and deduping all refer to keeping only unique lines (or the first occurrence of each repeated line) and discarding the rest. The term "deduplication" is more common in database and storage contexts.

Can I remove duplicate lines from a CSV without breaking the data?

Yes, but be careful. Simple line-by-line duplicate removal works if each row is on its own line and you want to remove rows that are completely identical. If your CSV has quoted fields containing newlines, or if you only want to deduplicate based on one column (like email addresses), you'll need a tool that understands CSV structure - like Python's csv module or a spreadsheet application.

How do I remove duplicate lines that aren't exactly the same?

Lines that differ only by whitespace or capitalization need preprocessing before deduplication. Normalize the lines first - trim whitespace and convert to lowercase - then compare. In the Remove Duplicate Lines tool on Article Formatter, the case-insensitive option treats "Hello World" and "hello world" as duplicates.