How to Strip HTML Tags from Text (Every Method)

Q: Can I strip HTML tags using regex?

Regex works for simple cases using a pattern like ]*> to match anything between angle brackets. But regex fails on edge cases: malformed HTML, angle brackets in attribute values, script content, and HTML comments. For production code, always use a proper HTML parser instead.

Q: How do I remove HTML tags but keep certain tags like links or bold?

This is called tag whitelisting or sanitization rather than full stripping. Libraries like DOMPurify (JavaScript) and bleach (Python) let you specify exactly which tags to allow. Everything else gets stripped while your permitted tags remain intact.

You have a chunk of HTML and you need just the text. Maybe you scraped a webpage and want the content without the markup. Maybe you're migrating data from a CMS and the export came wrapped in tags. Maybe someone pasted formatted text into a form field and now there are stray span and div tags sitting in your database.

Whatever the reason, stripping HTML tags is one of the most common text-processing tasks around. And there are a dozen ways to do it, each with tradeoffs. This guide covers all of them - from the fastest (paste it into a tool) to the most controlled (write your own parser).

The Quick Way: Use a Browser Tool

If you just need to strip tags from a piece of text right now, the fastest approach is a browser-based converter. Paste your HTML in, get plain text out.

The HTML to Plain Text tool does exactly this. Paste any HTML - from a single paragraph with bold tags to an entire webpage source - and it extracts the visible text content. It handles nested tags, entities like & and  , and even strips out script and style content that shouldn't appear in the output.

Quick steps:

Go to articleformatter.com/html-to-plain-text
Paste your HTML into the input area
Click Convert
Copy the clean text from the output

Everything happens in your browser - nothing gets sent to a server. Useful when you're working with content you don't want uploaded anywhere.

JavaScript: Strip Tags in the Browser or Node.js

If you need to strip HTML tags programmatically - inside a web app, a build script, or a Node.js backend - JavaScript gives you several good options.

Method 1: DOM textContent (browser)

The simplest and safest approach in a browser environment. Create a temporary DOM element, set its innerHTML, then read back the textContent:

function stripHtml(html) {
  const tmp = document.createElement('div');
  tmp.innerHTML = html;
  return tmp.textContent || tmp.innerText || '';
}

// Usage
const clean = stripHtml('<p>Hello <strong>world</strong></p>');
// "Hello world"

This works because the browser's HTML parser does all the heavy lifting. It correctly handles nested tags, self-closing tags, HTML entities, and malformed markup. The parser is the same one that renders web pages, so it handles every edge case.

One thing to watch: textContent returns the text of all elements, including script and style tags. If your HTML might contain those, remove them first:

function stripHtml(html) {
  const tmp = document.createElement('div');
  tmp.innerHTML = html;
  // Remove script and style elements
  tmp.querySelectorAll('script, style').forEach(el => el.remove());
  return tmp.textContent || '';
}

Method 2: DOMParser (browser and Deno)

If you don't want to create elements in the live document, DOMParser creates an isolated document:

function stripHtml(html) {
  const doc = new DOMParser().parseFromString(html, 'text/html');
  return doc.body.textContent || '';
}

Same result, but the parsed content never touches the live DOM. This matters when you're processing untrusted HTML - there's no risk of accidentally executing scripts or loading images.

Method 3: Regex (simple cases only)

For quick-and-dirty stripping where you know the input is well-formed and simple:

function stripHtml(html) {
  return html.replace(/<[^>]*>/g, '');
}

// Works fine for:
stripHtml('<p>Hello <b>world</b></p>');
// "Hello world"

// Breaks on:
stripHtml('<img alt="2 > 1" />');
// '1" />'  -- oops, the > in the attribute ended the match early

Warning: Regex-based HTML stripping is fine for controlled input like CMS output you trust. But it will produce wrong results on HTML with angle brackets in attributes, comments, CDATA sections, or malformed tags. If you're processing user-submitted or arbitrary HTML, use a DOM-based method instead.

Method 4: Node.js (no browser DOM available)

Node.js doesn't have a built-in DOM, so you need a library. The most popular options:

// Option A: striptags (lightweight, fast)
// npm install striptags
const striptags = require('striptags');
striptags('<p>Hello <strong>world</strong></p>');
// "Hello world"

// Option B: sanitize-html (more control)
// npm install sanitize-html
const sanitize = require('sanitize-html');
sanitize('<p>Hello <strong>world</strong></p>', {
  allowedTags: [],
  allowedAttributes: {}
});
// "Hello world"

// Option C: cheerio (full jQuery-style DOM)
// npm install cheerio
const cheerio = require('cheerio');
const $ = cheerio.load('<p>Hello <strong>world</strong></p>');
$.text();
// "Hello world"

Use striptags if you just need to remove tags fast. Use sanitize-html if you want to allow certain tags through (like keeping links and bold). Use cheerio when you also need to traverse or modify the HTML structure.

Python: Strip Tags from HTML

Python has strong HTML processing options built in and through libraries. Here's each approach.

Method 1: BeautifulSoup (most common)

from bs4 import BeautifulSoup

html = '<div><p>Hello <strong>world</strong></p><script>alert("x")</script></div>'

# Basic: get all text
soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()
# "Hello worldalert(\"x\")"

# Better: remove script/style first, then get text
for tag in soup(['script', 'style']):
    tag.decompose()
text = soup.get_text(separator=' ', strip=True)
# "Hello world"

The separator parameter is important - without it, text from adjacent elements runs together. Using separator=' ' puts a space between blocks, and strip=True removes extra whitespace.

Method 2: html.parser (standard library, no install)

from html.parser import HTMLParser
from io import StringIO

class HTMLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.result = StringIO()
        self.skip = False

    def handle_starttag(self, tag, attrs):
        if tag in ('script', 'style'):
            self.skip = True

    def handle_endtag(self, tag):
        if tag in ('script', 'style'):
            self.skip = False

    def handle_data(self, data):
        if not self.skip:
            self.result.write(data)

    def get_text(self):
        return self.result.getvalue()

def strip_html(html):
    s = HTMLStripper()
    s.feed(html)
    return s.get_text()

strip_html('<p>Hello <strong>world</strong></p>')
# "Hello world"

More code than BeautifulSoup, but it has zero dependencies. Useful in environments where you can't install packages - Lambda functions, embedded systems, or scripts you're distributing.

Method 3: lxml (fastest for large documents)

from lxml.html import fromstring, tostring
from lxml.html.clean import Cleaner

html = '<div><p>Hello <strong>world</strong></p></div>'

# Quick text extraction
doc = fromstring(html)
text = doc.text_content()
# "Hello world"

# With cleaning (removes scripts, styles, etc.)
cleaner = Cleaner(scripts=True, style=True, page_structure=False)
cleaned = cleaner.clean_html(html)
text = fromstring(cleaned).text_content()
# "Hello world"

lxml is a C-based parser, so it's significantly faster than BeautifulSoup on large documents - 10x or more for multi-megabyte HTML files. Install with pip install lxml.

Command-Line Methods

When you need to strip tags from files in a terminal or as part of a shell script, these one-liners get the job done.

sed

# Strip all HTML tags
sed 's/<[^>]*>//g' input.html > output.txt

# Strip tags and decode common entities
sed 's/<[^>]*>//g; s/&amp;/\&/g; s/&lt;/</g; s/&gt;/>/g; s/&nbsp;/ /g' input.html

awk

# Strip tags (handles multi-line tags)
awk '{gsub(/<[^>]*>/,"")}1' input.html

Python one-liner

# Uses only standard library
python3 -c "
import re, sys, html
text = sys.stdin.read()
text = re.sub(r'<script[^>]*>.*?</script>', '', text, flags=re.DOTALL)
text = re.sub(r'<style[^>]*>.*?</style>', '', text, flags=re.DOTALL)
text = re.sub(r'<[^>]+>', '', text)
print(html.unescape(text))
" < input.html

lynx / w3m (full rendering)

For the most accurate conversion, text-based browsers render the HTML properly and output formatted text:

# lynx - preserves headings, lists, links
lynx -dump -nolist input.html > output.txt

# w3m - similar but different formatting choices
w3m -dump input.html > output.txt

# Pipe from curl for live pages
curl -s https://example.com | lynx -dump -stdin

lynx -dump is especially good because it renders HTML the way a browser would - so headings get underlined, lists get bullet points, and tables get formatted. The -nolist flag suppresses the link reference list that normally appears at the bottom.

Handling HTML Entities

Stripping tags is only half the job. Most HTML also contains entities - encoded characters like & for &, < for <, ' for apostrophe, and   for non-breaking spaces. After removing tags, these entities are left behind as literal text.

DOM-based methods (like textContent in JavaScript or BeautifulSoup's get_text()) decode entities automatically. That's a major advantage over regex approaches, which leave entities as-is.

If you're using regex or sed, decode entities as a second pass:

# JavaScript
function decodeEntities(text) {
  const textarea = document.createElement('textarea');
  textarea.innerHTML = text;
  return textarea.value;
}

# Python
import html
clean = html.unescape('Caf&eacute; &amp; cr&egrave;me')
# "Café & crème"

# PHP
$clean = html_entity_decode($text, ENT_QUOTES | ENT_HTML5, 'UTF-8');

Preserving Some Structure

Sometimes you don't want raw text - you want text with some structure preserved. Paragraph breaks, list formatting, heading hierarchy. Pure tag stripping loses all of that. Here's how to keep it.

Convert block elements to line breaks

The idea: before removing tags, replace block-level elements (p, div, br, li, h1-h6) with newlines so the text retains its structure:

function htmlToStructuredText(html) {
  let text = html;
  // Add newlines before block elements
  text = text.replace(/<\/?(p|div|h[1-6]|li|tr|br\s*\/?)[^>]*>/gi, '\n');
  // Remove remaining tags
  text = text.replace(/<[^>]*>/g, '');
  // Clean up multiple newlines
  text = text.replace(/\n{3,}/g, '\n\n');
  return text.trim();
}

Convert to Markdown instead

If you want to preserve headings, bold, links, and lists in a readable format, converting HTML to Markdown is often better than stripping to plain text. Libraries like turndown (JavaScript) and markdownify (Python) handle this:

// JavaScript (npm install turndown)
const TurndownService = require('turndown');
const turndown = new TurndownService();
turndown.turndown('<h1>Title</h1><p>Text with <strong>bold</strong></p>');
// "# Title\n\nText with **bold**"

# Python (pip install markdownify)
from markdownify import markdownify
markdownify('<h1>Title</h1><p>Text with <strong>bold</strong></p>')
# "# Title\n\nText with **bold**"

You can also convert the other direction - our Markdown to HTML converter handles that, and the detailed guide covers every library and edge case.

Selective Stripping: Keep Some Tags

Full stripping isn't always what you want. Common scenarios where you need to keep certain tags:

User comments: Allow basic formatting (b, i, a) but strip everything else
Email HTML: Keep paragraph and line break tags, strip styling
CMS migration: Preserve structure tags but remove CMS-specific markup

This is HTML sanitization - allowing a whitelist of tags while stripping everything else:

// JavaScript: DOMPurify (npm install dompurify)
const DOMPurify = require('dompurify');
const clean = DOMPurify.sanitize(dirty, {
  ALLOWED_TAGS: ['b', 'i', 'em', 'strong', 'a', 'p', 'br'],
  ALLOWED_ATTR: ['href']
});

# Python: bleach (pip install bleach)
import bleach
clean = bleach.clean(dirty,
  tags=['b', 'i', 'em', 'strong', 'a', 'p', 'br'],
  attributes={'a': ['href']}
)

// PHP: strip_tags with allowed tags
$clean = strip_tags($dirty, '<b><i><em><strong><a><p><br>');

Common Pitfalls

A few traps that catch people when stripping HTML tags:

Script and style content leaking through

Regex-based stripping removes the script and style tags but leaves their content. So <style>body { color: red }</style> becomes body { color: red } in your output. Always remove these elements entirely before stripping other tags.

Words running together

When you strip </p><p>, the closing and opening tags disappear but no space is added. "End of paragraph.Start of next paragraph." looks wrong. Insert a space or newline when removing block-level tags.

Entities left behind

After stripping tags, &,  , and numeric entities like ’ remain as literal text. Always decode entities after stripping tags, or use a method that does it automatically (DOM-based approaches).

Security: XSS from incomplete stripping

If you're stripping HTML to prevent cross-site scripting (XSS), regex is not enough. Attackers craft input like <scr<script>ipt> that survives a single regex pass. Always use a proper sanitizer like DOMPurify or bleach for security-critical stripping.

Method Comparison

Method	Handles Entities	Handles Malformed HTML	Strips Script Content	Best For
DOM textContent	Yes	Yes	With extra code	Browser JavaScript
BeautifulSoup	Yes	Yes	Yes (with decompose)	Python projects
Regex	No	No	No	Quick scripts, trusted input
sed / awk	No	No	No	Shell scripts, piped workflows
lynx -dump	Yes	Yes	Yes	Formatted text output
lxml	Yes	Yes	Yes (with Cleaner)	Large documents, speed critical

Real-World Use Cases

Here are the most common situations where you'd need to strip HTML and which approach works best for each.

Search indexing. You're building a search feature and need to index page content as plain text. Use DOM-based extraction (BeautifulSoup or cheerio) to get clean text, strip script and style content, and normalize whitespace. This gives your search engine text without markup interfering with relevance scoring.

Email plain-text version. Most email clients require both HTML and plain-text versions of an email. Convert your HTML email body to plain text by stripping tags but preserving paragraph breaks. Add newlines before block elements first, then strip all remaining tags.

Text preview / excerpt generation. Showing a content preview on a blog listing page or in search results. Strip all HTML, decode entities, then truncate to your desired length. Watch for truncating mid-entity - truncate after decoding, not before.

Data cleaning after web scraping. You scraped product descriptions or article content and it came with HTML markup. Use BeautifulSoup or lxml to parse the HTML properly, extract the specific content you need (not the nav, footer, or ads), then get the text. The Word Counter can help verify you're getting the right amount of content.

CMS migration. Moving content between systems often means dealing with HTML exported from one CMS that doesn't render correctly in another. Strip the old CMS's custom markup, keep the semantic HTML (headings, paragraphs, lists), and let the new CMS apply its own styling.

Frequently Asked Questions

What is the fastest way to remove HTML tags from text?

The fastest way is to paste your HTML into a browser-based tool like the HTML to Plain Text converter. It strips all tags instantly and gives you clean plain text. No installation, no coding required.

Can I strip HTML tags using regex?

Regex works for simple cases using a pattern like <[^>]*> to match anything between angle brackets. But regex fails on edge cases: malformed HTML, angle brackets in attribute values, script content, and HTML comments. For production code, always use a proper HTML parser instead.

How do I strip tags but keep the text content?

Use the DOM's built-in text extraction. In JavaScript, create a temporary element, set its innerHTML to your HTML string, then read its textContent property. In Python, use BeautifulSoup's get_text() method. Both approaches extract all visible text while discarding every tag.

How do I remove tags but keep certain ones like links or bold?

This is called tag whitelisting or sanitization. Libraries like DOMPurify (JavaScript) and bleach (Python) let you specify exactly which tags to allow. Everything else gets stripped while your permitted tags remain intact. PHP's strip_tags() function also supports an allowed tags parameter.

How to Strip HTML Tags from Text