document.htmlparse.lxml_
Module¶
Parsing using lxml and libxml2.
-
class
wpull.document.htmlparse.lxml_.
HTMLParser
[source]¶ Bases:
wpull.document.htmlparse.base.BaseParser
HTML document parser.
This reader uses lxml as the parser.
-
BUFFER_SIZE
= 131072¶
-
classmethod
detect_parser_type
(file, encoding=None)[source]¶ Get the suitable parser type for the document.
Returns: str
-
classmethod
parse_doctype
(file, encoding=None)[source]¶ Get the doctype from the document.
Returns: str, None
-
parse_lxml
(file, encoding=None, target_class=<class 'wpull.document.htmlparse.lxml_.HTMLParserTarget'>, parser_type='html')[source]¶ Return an iterator of elements found in the document.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
- target_class – A class to be used for target parsing.
- parser_type (str) – The type of parser to use. Accepted values:
html
,xhtml
,xml
.
Returns: Each item is an element from
document.htmlparse.element
Return type: iterator
-
parser_error
¶
-
-
class
wpull.document.htmlparse.lxml_.
HTMLParserTarget
(callback)[source]¶ Bases:
object
An HTML parser target.
Parameters: callback – A callback function. The function should accept one argument from document.htmlparse.element
.