HTML Parser¶
The XML::HTMLParser parses HTML documents, including malformed HTML that would fail strict XML parsing. It produces a DOM tree just like XML::Parser.
Parsing HTML¶
# From a file
doc = XML::HTMLParser.file('page.html').parse
# From a string
doc = XML::HTMLParser.string('<html><body><p>Hello</p></body></html>').parse
# From an IO
File.open('page.html') do |io|
doc = XML::HTMLParser.io(io).parse
end
Example: Extract Links from HTML¶
html = <<~HTML
<html>
<body>
<a href="https://example.com">Example</a>
<a href="https://ruby-lang.org">Ruby</a>
<a href="/about">About</a>
</body>
</html>
HTML
doc = XML::HTMLParser.string(html).parse
doc.find('//a[@href]').each do |link|
puts "#{link.content} -> #{link['href']}"
end
Example: Extract Text Content¶
doc = XML::HTMLParser.file('article.html').parse
# Get all paragraph text
doc.find('//p').each do |p|
puts p.content
end
# Get the page title
title = doc.find_first('//title')
puts title.content if title
Example: Parse a Table¶
doc = XML::HTMLParser.string(html).parse
doc.find('//table//tr').each do |row|
cells = row.find('td').map(&:content)
puts cells.join(' | ')
end
Handling Malformed HTML¶
The HTML parser is lenient — it handles missing close tags, incorrect nesting, and other common HTML issues:
# This would fail as XML but parses fine as HTML
html = '<p>First<p>Second<br><b>Bold<i>BoldItalic</b></i>'
doc = XML::HTMLParser.string(html).parse
doc.find('//p').each { |p| puts p.content }
Options¶
HTML parser options are on XML::HTMLParser::Options:
parser = XML::HTMLParser.string(html)
parser.options = XML::HTMLParser::Options::NOERROR |
XML::HTMLParser::Options::NOWARNING
doc = parser.parse
Suppressing errors and warnings is common with real-world HTML, which often triggers parser warnings.
Encoding¶
Specify encoding when the HTML doesn't declare it: