Remove (Delete) Duplicate Email Addresses in Text Files — 5 Simple Ways
1) Use sort + uniq (Linux/macOS)
- Command:
sort emails.txt | uniq > deduped.txt - Preserves one instance of each exact line. Use
sort -uto combine steps. - To keep original order, use
awk/perlmethods below.
2) awk to preserve first occurrence order
- Command:
awk ‘!seen[$0]++’ emails.txt > deduped.txt - Keeps the first appearance of each exact line and removes later duplicates.
3) Python script for flexible parsing
- Example (handles emails within larger text and normalizes case):
Code
import re with open(‘emails.txt’) as f:text = f.read() emails = re.findall(r’[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}‘, text) seen, out = set(), [] for e in emails:k = e.lower() if k not in seen: seen.add(k) out.append(e)with open(‘deduped.txt’,‘w’) as f:
f.write("”.join(out))
4) PowerShell (Windows)
- Command:
Get-Content emails.txt | Sort-Object -Unique | Set-Content deduped.txt - To preserve first occurrence order:
Code
\(seen = @{} </span>Get-Content emails.txt | ForEach-Object { if (-not \)seen.ContainsKey(\(_)) { \)seen[\(_] = \)true; $_ } } | Set-Content deduped.txt
5) Text editors / spreadsheet tools
- Use editors with regex find/replace (e.g., VS Code) or import into Excel/Sheets and use “Remove duplicates”.
- Good for small files and visual review; prone to manual error on large files.
Tips & considerations
- Normalization: lowercase emails, trim whitespace, remove surrounding punctuation before deduping.
- Email parsing: use robust regex or libraries for complex text; avoid naive patterns that capture invalid strings.
- Large files: use streaming approaches (awk, Python iterator, or external tools) to avoid high memory use.
- Back up original file before changes.
- If you need a ready-to-run script for your platform or want handling for emails embedded in paragraphs, tell me your OS and file sample.
Leave a Reply
You must be logged in to post a comment.