You write an article in Word, paste it into your blog, and hit publish. Everything looks fine on your end. But your readers see something like this: “Hello world†instead of "Hello world". Curly quotes became gibberish. Dashes turned into â€". Apostrophes became ’.
This isn't random corruption. It's a specific, predictable problem caused by two encoding systems interpreting the same bytes differently. Once you understand why it happens, you can fix it in seconds and prevent it from ever happening again.
What Is Character Encoding?
Computers store text as numbers. The letter "A" is stored as the number 65. The letter "z" is stored as 122. A character encoding is just the lookup table that maps numbers to characters.
The original standard, ASCII, defined 128 characters - the English alphabet (uppercase and lowercase), digits 0-9, punctuation marks, and some control characters. Every computer on Earth agrees on these first 128 mappings. The letter "A" is always 65, no matter what system you're on.
But 128 characters weren't enough. People needed accented letters, curly quotes, em dashes, copyright symbols, and characters from other languages. That's where things went sideways.
Windows-1252: Microsoft's Extension
In the early 1990s, Microsoft created Windows-1252 (also called CP-1252) to extend ASCII with 128 additional characters. Byte values 128-255 were assigned to useful Western European characters like curly quotes (" " ' '), em dashes (—), ellipses (…), and accented characters (é ñ ü).
This worked fine as long as both the writer and the reader were on Windows. Microsoft Word, Outlook, and Internet Explorer all used Windows-1252 by default. For years, most of the English-speaking web ran on this encoding without anyone thinking about it.
The problem? Windows-1252 is a single-byte encoding. It can only represent 256 total characters. That's enough for English and Western European languages, but it can't handle Chinese, Arabic, Hindi, Japanese, Korean, or even the full set of mathematical symbols. And the way it uses byte positions 128-159 conflicts with other encoding standards.
UTF-8: The Universal Standard
UTF-8 was designed to solve the "too many languages, not enough bytes" problem. It's a variable-length encoding that uses 1 to 4 bytes per character, supporting over 1.1 million characters from every writing system on Earth. It was invented in 1992 and has steadily replaced Windows-1252 across the web.
Today, over 98% of all websites use UTF-8. WordPress uses UTF-8. Google expects UTF-8. Your email provider uses UTF-8. It's the standard, and everything else is legacy.
Here's the clever part: UTF-8 is backwards compatible with ASCII. The first 128 characters are identical to ASCII. Plain English text looks exactly the same in both encodings. But for characters beyond the ASCII range - position 128 and above - UTF-8 uses a completely different scheme than Windows-1252. And that's where the garbled text comes from.
Where the Collision Happens
The core issue is that Windows-1252 and UTF-8 use positions 128-159 differently. In Windows-1252, these positions hold useful characters. In UTF-8, these byte values are used as continuation bytes in multi-byte sequences.
When you type a curly right single quote in Word, it stores byte value 0x92 (146 in decimal). In Windows-1252, that's the ' character. But in UTF-8, the proper curly right single quote is the three-byte sequence 0xE2 0x80 0x99.
When a browser expects UTF-8 but receives the Windows-1252 byte 0x92, it doesn't find a valid UTF-8 sequence. Depending on the browser, it either shows a replacement character (�) or, more commonly, interprets the byte using a fallback that produces garbled multi-character sequences like ’.
| Character | Windows-1252 Byte | UTF-8 Bytes | Garbled Output |
|---|---|---|---|
' Right single quote | 0x92 | 0xE2 0x80 0x99 | ’ |
" Left double quote | 0x93 | 0xE2 0x80 0x9C | “ |
" Right double quote | 0x94 | 0xE2 0x80 0x9D | â€\x9D |
— Em dash | 0x97 | 0xE2 0x80 0x94 | â€" |
… Ellipsis | 0x85 | 0xE2 0x80 0xA6 | … |
• Bullet | 0x95 | 0xE2 0x80 0xA2 | • |
™ Trademark | 0x99 | 0xE2 0x84 0xA2 | â„¢ |
See the pattern? Every Windows-1252 special character turns into a sequence starting with †when misinterpreted as UTF-8. If you see †scattered through your article, you have a Windows-1252 to UTF-8 encoding mismatch.
Common Scenarios That Cause This Problem
This encoding collision shows up in a few predictable situations:
Pasting from Microsoft Word
Word's "smart quotes" feature automatically replaces straight quotes with curly quotes using Windows-1252. When you paste into a UTF-8 CMS like WordPress, those curly quotes become garbled. This is the single most common cause of encoding issues on blogs.
Copying from Outlook or email clients
Email clients still produce Windows-1252 text, especially for forwarded messages and replies to older threads. Pasting email content into a web form or CMS triggers the same encoding mismatch.
Importing old database content
If you're migrating content from an older website or database that stored text in Windows-1252 (or Latin-1, its close relative), importing that data into a modern UTF-8 database without conversion produces garbled text across your entire site.
RSS feeds and content syndication
When content passes through an RSS feed or content aggregator, encoding can get mangled at each hop. If any system in the chain treats Windows-1252 bytes as UTF-8, the output will contain garbled characters.
How to Fix It Right Now
If you already have garbled text on your site, the fastest fix is to run it through Article Formatter. Paste your broken text, make sure the "Fix Word/Office Characters" option is checked, and click Format. It converts all those garbled sequences back to the correct characters.
For a single blog post, this takes about 30 seconds. For a large batch of articles, you'll need a programmatic approach.
For developers: command-line conversion
If you have files encoded in Windows-1252 that need to be converted to UTF-8, the iconv command handles it cleanly:
iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt
In PHP, you can use mb_convert_encoding:
$clean = mb_convert_encoding($text, 'UTF-8', 'Windows-1252'); In Python:
clean = broken_text.encode('windows-1252').decode('utf-8')
In JavaScript (Node.js), the iconv-lite package works well:
const iconv = require('iconv-lite');
const clean = iconv.decode(Buffer.from(text, 'binary'), 'win1252'); For WordPress users: database-level fix
If your entire WordPress database has encoding issues, you can fix it with a SQL query. But back up your database first - this is a destructive operation if something goes wrong.
UPDATE wp_posts
SET post_content = CONVERT(CAST(CONVERT(post_content USING latin1) AS BINARY) USING utf8mb4)
WHERE post_content LIKE '%â€%';
This converts the stored content from its current encoding through latin1 (which is nearly identical to Windows-1252 for most characters) and back to proper UTF-8. The WHERE clause limits the update to rows that actually have garbled text.
How to Prevent Encoding Problems
Fixing broken text is fine for existing content, but prevention is better. Here are the habits that keep your articles clean:
Paste as plain text. In WordPress, use Ctrl+Shift+V (or Cmd+Shift+V on Mac) to paste without formatting. This strips the Windows-1252 characters before they reach your editor. You'll lose bold and italic formatting, but your text will be clean.
Run text through Article Formatter first. Before pasting into your CMS, paste your content here with the "Fix Word/Office Characters" option enabled. This converts problematic characters to safe UTF-8 equivalents while preserving your intended formatting.
Write in Google Docs instead of Word. Google Docs uses UTF-8 natively. Text copied from Google Docs into WordPress won't have encoding issues. If your workflow allows it, this is the simplest long-term fix.
Turn off "smart quotes" in Word. Go to File > Options > Proofing > AutoCorrect Options > AutoFormat As You Type, then uncheck "Straight quotes with smart quotes." This prevents Word from creating Windows-1252 characters in the first place.
Verify your site's encoding. Make sure your web server sends the Content-Type: text/html; charset=utf-8 header, and that your HTML includes <meta charset="UTF-8">. If either is missing or set to something else, browsers might fall back to Windows-1252 and misinterpret your content.
Why This Matters for SEO
Garbled characters don't just look bad to readers - they can hurt your search rankings too. Here's how:
- Crawl quality: Google's crawler sees the garbled bytes, not the intended characters. If your title tag or meta description contains encoding errors, your search snippet might display incorrectly.
- Bounce rate: Readers who see garbled text assume the page is broken or untrustworthy. They leave immediately, and that high bounce rate signals low quality to search engines.
- Rich snippets: If your JSON-LD structured data contains encoding errors, Google may reject it entirely. No rich snippets means lower click-through rates in search results.
- Indexing gaps: In severe cases, encoding errors can prevent search engines from correctly parsing your content, leading to incomplete indexing.
Using your browser's word counter tool to check your published content can help you spot encoding issues - garbled characters often inflate the character count unexpectedly.
The Technical Deep Dive: Why ’ Specifically?
If you're curious about exactly what's happening under the hood, here's the byte-level explanation for the most common garbled sequence.
Take the right single quote ('). In Unicode, this character is U+2019. In UTF-8, U+2019 is encoded as three bytes: 0xE2 0x80 0x99.
When a system that expects Windows-1252 encounters those three bytes, it looks up each one individually in the Windows-1252 table:
0xE2= â (Latin small letter a with circumflex)0x80= € (Euro sign)0x99= ™ (Trade mark sign)
But in common garbled output, you see ’ rather than ’. That's because the misinterpretation often goes through multiple encoding/decoding steps. The exact garbled output depends on which encoding fallbacks the software chain applies, but the †prefix is the telltale signature of Windows-1252 data being read as UTF-8.
Quick Reference: What to Do
Already have garbled text? Paste it into Article Formatter with "Fix Word/Office Characters" enabled.
About to paste from Word? Use Ctrl+Shift+V to paste as plain text, or run it through Article Formatter first.
Building a new website? Set your encoding to UTF-8 everywhere: database, server headers, HTML meta tags. Never mix encodings.
Migrating an old site? Convert the database with iconv or SQL before deploying the new site.
Frequently Asked Questions
What is the difference between Windows-1252 and UTF-8?
Why does copy-pasting from Word create garbled characters?
How do I convert Windows-1252 text to UTF-8?
iconv -f WINDOWS-1252 -t UTF-8 on the command line or your language's encoding conversion function.
Is UTF-8 backwards compatible with ASCII?
Related Tools
Article Formatter
Clean up text with encoding issues, remove special characters, and format articles for your blog.
Fix WordPress Characters Guide
Step-by-step guide to fixing garbled characters in WordPress posts.
HTML to Plain Text
Strip HTML tags and convert to clean plain text. Useful for cleaning up imported content.
Markdown to HTML
Convert Markdown to clean HTML for your blog posts - no encoding worries.
Clean Up PDF Text
Fix broken line breaks and spacing issues when copying text from PDF documents.