document.html Module

HTML document readers.

wpull.document.html.COMMENT = <object object>

Comment element

class wpull.document.html.HTMLLightParserTarget(callback, text_elements=frozenset({'link', 'icon', 'style', 'script', 'url'}))[source]

Bases: object

An HTML parser target for partial elements.

Parameters:
  • callback

    A callback function. The function should accept the :param 1. tag: The tag name of the element. :type 1. tag: str :param 2. attrib: The attributes of the element. :type 2. attrib: dict :param 3. text: The text of the element.

    type 3. text:str, None
  • text_elements – A frozenset of element tag names that we should keep track of text.
close()[source]
data(data)[source]
end(tag)[source]
start(tag, attrib)[source]
class wpull.document.html.HTMLParserTarget(callback)[source]

Bases: object

An HTML parser target.

Parameters:callback

A callback function. The function should accept the :param 1. tag: The tag name of the element. :type 1. tag: str :param 2. attrib: The attributes of the element. :type 2. attrib: dict :param 3. text: The text of the element. :type 3. text: str, None :param 4. tail: The text after the element. :type 4. tail: str, None :param 5. end: Whether the tag is and end tag.

type 5. end:bool
close()[source]
comment(text)[source]
data(data)[source]
end(tag)[source]
start(tag, attrib)[source]
class wpull.document.html.HTMLReadElement(tag, attrib, text, tail, end)[source]

Bases: object

Results from HTMLReader.read_links().

tag

str

The element tag name.

attrib

dict

The element attributes.

text

str, None

The element text.

tail

str, None

The text after the element.

end

bool

Whether the tag is an end tag.

attrib
end
tag
tail
text
class wpull.document.html.HTMLReader(html_parser)[source]

Bases: wpull.document.base.BaseDocumentDetector, wpull.document.base.BaseHTMLReader

HTML document reader.

Parameters:html_parser (document.htmlparse.BaseParser) – An HTML parser.
classmethod is_file(file)[source]

Return whether the file is likely to be HTML.

classmethod is_request(request)[source]

Return whether the Request is likely to be a HTML.

classmethod is_response(response)[source]

Return whether the Response is likely to be HTML.

classmethod is_url(url_info)[source]

Return whether the URLInfo is likely to be a HTML.

iter_elements(file, encoding=None)[source]