class LibXML::XML::HTMLParser
The HTML parser implements an HTML 4.0 non-verifying parser with an API compatible with the XML::Parser
. In contrast with the XML::Parser
, it can parse “real world” HTML, even if it severely broken from a specification point of view.
The HTML parser creates an in-memory document object that consist of any number of XML::Node
instances. This is simple and powerful model, but has the major limitation that the size of the document that can be processed is limited by the amount of memory available.
Using the html parser is simple:
parser = XML::HTMLParser.file('my_file') doc = parser.parse
You can also parse documents (see XML::HTMLParser.document), strings (see XML::HTMLParser.string
) and io objects (see XML::HTMLParser.io
).
Attributes
Atributes
Public Class Methods
Creates a new parser by parsing the specified file or uri.
Parameters:
path - Path to file to parse encoding - The document encoding, defaults to nil. Valid values are the encoding constants defined on XML::Encoding. options - Parser options. Valid values are the constants defined on XML::HTMLParser::Options. Mutliple options can be combined by using Bitwise OR (|).
# File lib/libxml/html_parser.rb 21 def self.file(path, encoding: nil, options: nil) 22 context = XML::HTMLParser::Context.file(path) 23 context.encoding = encoding if encoding 24 context.options = options if options 25 self.new(context) 26 end
Creates a new reader by parsing the specified io object.
Parameters:
io - io object that contains the xml to parser base_uri - The base url for the parsed document. encoding - The document encoding, defaults to nil. Valid values are the encoding constants defined on XML::Encoding. options - Parser options. Valid values are the constants defined on XML::HTMLParser::Options. Mutliple options can be combined by using Bitwise OR (|).
# File lib/libxml/html_parser.rb 45 def self.io(io, base_uri: nil, encoding: nil, options: nil) 46 context = XML::HTMLParser::Context.io(io) 47 context.base_uri = base_uri if base_uri 48 context.encoding = encoding if encoding 49 context.options = options if options 50 self.new(context) 51 end
Initializes a new parser instance with no pre-determined source.
static VALUE rxml_html_parser_initialize(int argc, VALUE *argv, VALUE self) { VALUE context = Qnil; rb_scan_args(argc, argv, "01", &context); if (context == Qnil) { rb_raise(rb_eArgError, "An instance of a XML::Parser::Context must be passed to XML::HTMLParser.new"); } rb_ivar_set(self, CONTEXT_ATTR, context); return self; }
Creates a new parser by parsing the specified string.
Parameters:
string - String to parse base_uri - The base url for the parsed document. encoding - The document encoding, defaults to nil. Valid values are the encoding constants defined on XML::Encoding. options - Parser options. Valid values are the constants defined on XML::HTMLParser::Options. Mutliple options can be combined by using Bitwise OR (|).
# File lib/libxml/html_parser.rb 70 def self.string(string, base_uri: nil, encoding: nil, options: nil) 71 context = XML::HTMLParser::Context.string(string) 72 context.base_uri = base_uri if base_uri 73 context.encoding = encoding if encoding 74 context.options = options if options 75 self.new(context) 76 end
Public Instance Methods
Parse the input XML
and create an XML::Document
with it’s content. If an error occurs, XML::Parser::ParseError is thrown.
static VALUE rxml_html_parser_parse(VALUE self) { xmlParserCtxtPtr ctxt; VALUE context = rb_ivar_get(self, CONTEXT_ATTR); Data_Get_Struct(context, xmlParserCtxt, ctxt); if (htmlParseDocument(ctxt) == -1 && ! ctxt->recovery) { rxml_raise(&ctxt->lastError); } rb_funcall(context, rb_intern("close"), 0); return rxml_document_wrap(ctxt->myDoc); }