document.base Module

Document bases.

class wpull.document.base.BaseDocumentDetector[source]

Bases: object

Base class for classes that detect document types.

classmethod is_file(file)[source]

Return whether the reader is likely able to read the file.

Parameters:file – A file object containing the document.
Returns:bool
classmethod is_request(request)[source]

Return whether the request is likely supported.

Parameters:request (http.request.Request) – An HTTP request.
Returns:bool
classmethod is_response(response)[source]

Return whether the response is likely able to be read.

Parameters:response (http.request.Response) – An HTTP response.
Returns:bool
classmethod is_supported(file=None, request=None, response=None, url_info=None)[source]

Given the hints, return whether the document is supported.

Parameters:
Returns:

If True, the reader should be able to read it.

Return type:

bool

classmethod is_url(url_info)[source]

Return whether the URL is likely to be supported.

Parameters:url_info (url.URLInfo) – A URLInfo.
Returns:bool
class wpull.document.base.BaseExtractiveReader[source]

Bases: object

Base class for document readers that can only extract links.

Return links from file.

Returns:Each item is a str which represents a link.
Return type:iterator
class wpull.document.base.BaseHTMLReader[source]

Bases: object

Base class for document readers for handling SGML-like documents.

iter_elements(file, encoding=None)[source]

Return an iterator of elements found in the document.

Parameters:
  • file – A file object containing the document.
  • encoding (str) – The encoding of the document.
Returns:

Each item is an element from document.htmlparse.element

Return type:

iterator

class wpull.document.base.BaseTextStreamReader[source]

Bases: object

Base class for document readers that filters link and non-link text.

Return the links.

This function is a convenience function for calling iter_text() and returning only the links.

iter_text(file, encoding=None)[source]

Return the file text and links.

Parameters:
  • file – A file object containing the document.
  • encoding (str) – The encoding of the document.
Returns:

Each item is a tuple:

  1. str: The text
  2. bool (or truthy value): Whether the text is a likely a link. If truthy value may be provided containing additional context of the link.

Return type:

iterator

The links returned are raw text and will require further processing.

wpull.document.base.VeryFalse = <wpull.document.base.VeryFalseType object>

Document is not definitely supported.

class wpull.document.base.VeryFalseType[source]

Bases: object