You have a chunk of HTML and you need just the text. Maybe you scraped a webpage and want the content without the markup. Maybe you're migrating data from a CMS and the export came wrapped in tags. Maybe someone pasted formatted text into a form field and now there are stray span and div tags sitting in your database.
Whatever the reason, stripping HTML tags is one of the most common text-processing tasks around. And there are a dozen ways to do it, each with tradeoffs. This guide covers all of them - from the fastest (paste it into a tool) to the most controlled (write your own parser).
The Quick Way: Use a Browser Tool
If you just need to strip tags from a piece of text right now, the fastest approach is a browser-based converter. Paste your HTML in, get plain text out.
The HTML to Plain Text tool does exactly this. Paste any HTML - from a single paragraph with bold tags to an entire webpage source - and it extracts the visible text content. It handles nested tags, entities like & and , and even strips out script and style content that shouldn't appear in the output.
Quick steps:
- Go to articleformatter.com/html-to-plain-text
- Paste your HTML into the input area
- Click Convert
- Copy the clean text from the output
Everything happens in your browser - nothing gets sent to a server. Useful when you're working with content you don't want uploaded anywhere.
JavaScript: Strip Tags in the Browser or Node.js
If you need to strip HTML tags programmatically - inside a web app, a build script, or a Node.js backend - JavaScript gives you several good options.
Method 1: DOM textContent (browser)
The simplest and safest approach in a browser environment. Create a temporary DOM element, set its innerHTML, then read back the textContent:
function stripHtml(html) {
const tmp = document.createElement('div');
tmp.innerHTML = html;
return tmp.textContent || tmp.innerText || '';
}
// Usage
const clean = stripHtml('<p>Hello <strong>world</strong></p>');
// "Hello world" This works because the browser's HTML parser does all the heavy lifting. It correctly handles nested tags, self-closing tags, HTML entities, and malformed markup. The parser is the same one that renders web pages, so it handles every edge case.
One thing to watch: textContent returns the text of all elements, including script and style tags. If your HTML might contain those, remove them first:
function stripHtml(html) {
const tmp = document.createElement('div');
tmp.innerHTML = html;
// Remove script and style elements
tmp.querySelectorAll('script, style').forEach(el => el.remove());
return tmp.textContent || '';
} Method 2: DOMParser (browser and Deno)
If you don't want to create elements in the live document, DOMParser creates an isolated document:
function stripHtml(html) {
const doc = new DOMParser().parseFromString(html, 'text/html');
return doc.body.textContent || '';
} Same result, but the parsed content never touches the live DOM. This matters when you're processing untrusted HTML - there's no risk of accidentally executing scripts or loading images.
Method 3: Regex (simple cases only)
For quick-and-dirty stripping where you know the input is well-formed and simple:
function stripHtml(html) {
return html.replace(/<[^>]*>/g, '');
}
// Works fine for:
stripHtml('<p>Hello <b>world</b></p>');
// "Hello world"
// Breaks on:
stripHtml('<img alt="2 > 1" />');
// '1" />' -- oops, the > in the attribute ended the match early Warning: Regex-based HTML stripping is fine for controlled input like CMS output you trust. But it will produce wrong results on HTML with angle brackets in attributes, comments, CDATA sections, or malformed tags. If you're processing user-submitted or arbitrary HTML, use a DOM-based method instead.
Method 4: Node.js (no browser DOM available)
Node.js doesn't have a built-in DOM, so you need a library. The most popular options:
// Option A: striptags (lightweight, fast)
// npm install striptags
const striptags = require('striptags');
striptags('<p>Hello <strong>world</strong></p>');
// "Hello world"
// Option B: sanitize-html (more control)
// npm install sanitize-html
const sanitize = require('sanitize-html');
sanitize('<p>Hello <strong>world</strong></p>', {
allowedTags: [],
allowedAttributes: {}
});
// "Hello world"
// Option C: cheerio (full jQuery-style DOM)
// npm install cheerio
const cheerio = require('cheerio');
const $ = cheerio.load('<p>Hello <strong>world</strong></p>');
$.text();
// "Hello world"
Use striptags if you just need to remove tags fast. Use sanitize-html if you want to allow certain tags through (like keeping links and bold). Use cheerio when you also need to traverse or modify the HTML structure.
Python: Strip Tags from HTML
Python has strong HTML processing options built in and through libraries. Here's each approach.
Method 1: BeautifulSoup (most common)
from bs4 import BeautifulSoup
html = '<div><p>Hello <strong>world</strong></p><script>alert("x")</script></div>'
# Basic: get all text
soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()
# "Hello worldalert(\"x\")"
# Better: remove script/style first, then get text
for tag in soup(['script', 'style']):
tag.decompose()
text = soup.get_text(separator=' ', strip=True)
# "Hello world"
The separator parameter is important - without it, text from adjacent elements runs together. Using separator=' ' puts a space between blocks, and strip=True removes extra whitespace.
Method 2: html.parser (standard library, no install)
from html.parser import HTMLParser
from io import StringIO
class HTMLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.result = StringIO()
self.skip = False
def handle_starttag(self, tag, attrs):
if tag in ('script', 'style'):
self.skip = True
def handle_endtag(self, tag):
if tag in ('script', 'style'):
self.skip = False
def handle_data(self, data):
if not self.skip:
self.result.write(data)
def get_text(self):
return self.result.getvalue()
def strip_html(html):
s = HTMLStripper()
s.feed(html)
return s.get_text()
strip_html('<p>Hello <strong>world</strong></p>')
# "Hello world" More code than BeautifulSoup, but it has zero dependencies. Useful in environments where you can't install packages - Lambda functions, embedded systems, or scripts you're distributing.
Method 3: lxml (fastest for large documents)
from lxml.html import fromstring, tostring
from lxml.html.clean import Cleaner
html = '<div><p>Hello <strong>world</strong></p></div>'
# Quick text extraction
doc = fromstring(html)
text = doc.text_content()
# "Hello world"
# With cleaning (removes scripts, styles, etc.)
cleaner = Cleaner(scripts=True, style=True, page_structure=False)
cleaned = cleaner.clean_html(html)
text = fromstring(cleaned).text_content()
# "Hello world" lxml is a C-based parser, so it's significantly faster than BeautifulSoup on large documents - 10x or more for multi-megabyte HTML files. Install with pip install lxml.
Command-Line Methods
When you need to strip tags from files in a terminal or as part of a shell script, these one-liners get the job done.
sed
# Strip all HTML tags
sed 's/<[^>]*>//g' input.html > output.txt
# Strip tags and decode common entities
sed 's/<[^>]*>//g; s/&/\&/g; s/</</g; s/>/>/g; s/ / /g' input.html awk
# Strip tags (handles multi-line tags)
awk '{gsub(/<[^>]*>/,"")}1' input.html Python one-liner
# Uses only standard library
python3 -c "
import re, sys, html
text = sys.stdin.read()
text = re.sub(r'<script[^>]*>.*?</script>', '', text, flags=re.DOTALL)
text = re.sub(r'<style[^>]*>.*?</style>', '', text, flags=re.DOTALL)
text = re.sub(r'<[^>]+>', '', text)
print(html.unescape(text))
" < input.html lynx / w3m (full rendering)
For the most accurate conversion, text-based browsers render the HTML properly and output formatted text:
# lynx - preserves headings, lists, links
lynx -dump -nolist input.html > output.txt
# w3m - similar but different formatting choices
w3m -dump input.html > output.txt
# Pipe from curl for live pages
curl -s https://example.com | lynx -dump -stdin lynx -dump is especially good because it renders HTML the way a browser would - so headings get underlined, lists get bullet points, and tables get formatted. The -nolist flag suppresses the link reference list that normally appears at the bottom.
Handling HTML Entities
Stripping tags is only half the job. Most HTML also contains entities - encoded characters like & for &, < for <, ' for apostrophe, and for non-breaking spaces. After removing tags, these entities are left behind as literal text.
DOM-based methods (like textContent in JavaScript or BeautifulSoup's get_text()) decode entities automatically. That's a major advantage over regex approaches, which leave entities as-is.
If you're using regex or sed, decode entities as a second pass:
# JavaScript
function decodeEntities(text) {
const textarea = document.createElement('textarea');
textarea.innerHTML = text;
return textarea.value;
}
# Python
import html
clean = html.unescape('Café & crème')
# "Café & crème"
# PHP
$clean = html_entity_decode($text, ENT_QUOTES | ENT_HTML5, 'UTF-8'); Preserving Some Structure
Sometimes you don't want raw text - you want text with some structure preserved. Paragraph breaks, list formatting, heading hierarchy. Pure tag stripping loses all of that. Here's how to keep it.
Convert block elements to line breaks
The idea: before removing tags, replace block-level elements (p, div, br, li, h1-h6) with newlines so the text retains its structure:
function htmlToStructuredText(html) {
let text = html;
// Add newlines before block elements
text = text.replace(/<\/?(p|div|h[1-6]|li|tr|br\s*\/?)[^>]*>/gi, '\n');
// Remove remaining tags
text = text.replace(/<[^>]*>/g, '');
// Clean up multiple newlines
text = text.replace(/\n{3,}/g, '\n\n');
return text.trim();
} Convert to Markdown instead
If you want to preserve headings, bold, links, and lists in a readable format, converting HTML to Markdown is often better than stripping to plain text. Libraries like turndown (JavaScript) and markdownify (Python) handle this:
// JavaScript (npm install turndown)
const TurndownService = require('turndown');
const turndown = new TurndownService();
turndown.turndown('<h1>Title</h1><p>Text with <strong>bold</strong></p>');
// "# Title\n\nText with **bold**"
# Python (pip install markdownify)
from markdownify import markdownify
markdownify('<h1>Title</h1><p>Text with <strong>bold</strong></p>')
# "# Title\n\nText with **bold**" You can also convert the other direction - our Markdown to HTML converter handles that, and the detailed guide covers every library and edge case.
Selective Stripping: Keep Some Tags
Full stripping isn't always what you want. Common scenarios where you need to keep certain tags:
- User comments: Allow basic formatting (b, i, a) but strip everything else
- Email HTML: Keep paragraph and line break tags, strip styling
- CMS migration: Preserve structure tags but remove CMS-specific markup
This is HTML sanitization - allowing a whitelist of tags while stripping everything else:
// JavaScript: DOMPurify (npm install dompurify)
const DOMPurify = require('dompurify');
const clean = DOMPurify.sanitize(dirty, {
ALLOWED_TAGS: ['b', 'i', 'em', 'strong', 'a', 'p', 'br'],
ALLOWED_ATTR: ['href']
});
# Python: bleach (pip install bleach)
import bleach
clean = bleach.clean(dirty,
tags=['b', 'i', 'em', 'strong', 'a', 'p', 'br'],
attributes={'a': ['href']}
)
// PHP: strip_tags with allowed tags
$clean = strip_tags($dirty, '<b><i><em><strong><a><p><br>'); Common Pitfalls
A few traps that catch people when stripping HTML tags:
Script and style content leaking through
Regex-based stripping removes the script and style tags but leaves their content. So <style>body { color: red }</style> becomes body { color: red } in your output. Always remove these elements entirely before stripping other tags.
Words running together
When you strip </p><p>, the closing and opening tags disappear but no space is added. "End of paragraph.Start of next paragraph." looks wrong. Insert a space or newline when removing block-level tags.
Entities left behind
After stripping tags, &, , and numeric entities like ’ remain as literal text. Always decode entities after stripping tags, or use a method that does it automatically (DOM-based approaches).
Security: XSS from incomplete stripping
If you're stripping HTML to prevent cross-site scripting (XSS), regex is not enough. Attackers craft input like <scr<script>ipt> that survives a single regex pass. Always use a proper sanitizer like DOMPurify or bleach for security-critical stripping.
Method Comparison
| Method | Handles Entities | Handles Malformed HTML | Strips Script Content | Best For |
|---|---|---|---|---|
| DOM textContent | Yes | Yes | With extra code | Browser JavaScript |
| BeautifulSoup | Yes | Yes | Yes (with decompose) | Python projects |
| Regex | No | No | No | Quick scripts, trusted input |
| sed / awk | No | No | No | Shell scripts, piped workflows |
| lynx -dump | Yes | Yes | Yes | Formatted text output |
| lxml | Yes | Yes | Yes (with Cleaner) | Large documents, speed critical |
Real-World Use Cases
Here are the most common situations where you'd need to strip HTML and which approach works best for each.
Search indexing. You're building a search feature and need to index page content as plain text. Use DOM-based extraction (BeautifulSoup or cheerio) to get clean text, strip script and style content, and normalize whitespace. This gives your search engine text without markup interfering with relevance scoring.
Email plain-text version. Most email clients require both HTML and plain-text versions of an email. Convert your HTML email body to plain text by stripping tags but preserving paragraph breaks. Add newlines before block elements first, then strip all remaining tags.
Text preview / excerpt generation. Showing a content preview on a blog listing page or in search results. Strip all HTML, decode entities, then truncate to your desired length. Watch for truncating mid-entity - truncate after decoding, not before.
Data cleaning after web scraping. You scraped product descriptions or article content and it came with HTML markup. Use BeautifulSoup or lxml to parse the HTML properly, extract the specific content you need (not the nav, footer, or ads), then get the text. The Word Counter can help verify you're getting the right amount of content.
CMS migration. Moving content between systems often means dealing with HTML exported from one CMS that doesn't render correctly in another. Strip the old CMS's custom markup, keep the semantic HTML (headings, paragraphs, lists), and let the new CMS apply its own styling.
Frequently Asked Questions
What is the fastest way to remove HTML tags from text?
Can I strip HTML tags using regex?
<[^>]*> to match anything between angle brackets. But regex fails on edge cases: malformed HTML, angle brackets in attribute values, script content, and HTML comments. For production code, always use a proper HTML parser instead.
How do I strip tags but keep the text content?
get_text() method. Both approaches extract all visible text while discarding every tag.
How do I remove tags but keep certain ones like links or bold?
strip_tags() function also supports an allowed tags parameter.
Related Tools
HTML to Plain Text
Strip all HTML tags and get clean plain text. Handles entities and nested markup automatically.
Article Formatter
Clean up and format text from any source - fix line breaks, spacing, and encoding issues.
Markdown to HTML
Convert Markdown syntax to HTML - the reverse of stripping tags.
Markdown to HTML Guide
Complete guide to converting between Markdown and HTML with every library and method.