How to Extract Heading Content (h1, h2, etc.) from an HTML String Using Regex

Headlines and headings are usually very relevant and descriptive pieces of information for any HTML page. You might want to include them into the description <meta> tag on that page. Here is a simple regular expression to extract all those headings:

preg_match_all( '|<h[^>]+>(.*)</h[^>]+>|iU', $html, $headings );

where $html is the HTML source and $headings will be an array populated with the extracted headings.

2 Comments

  1. David says:

    Saved my day thanks!

  2. Georgios Stampolis says:

    Thank you very much!!

    I have modified it a bit to get the h-tags AND the id´s of the h-tags:

    preg_match_all('|<h.*?id=\"([^\"]*)\".*?>(.*)</h[^>]+>|iU', ...)

Leave a Reply