document.base
Module¶
Document bases.
-
class
wpull.document.base.
BaseDocumentDetector
[source]¶ Bases:
object
Base class for classes that detect document types.
-
classmethod
is_file
(file)[source]¶ Return whether the reader is likely able to read the file.
Parameters: file – A file object containing the document. Returns: bool
-
classmethod
is_request
(request)[source]¶ Return whether the request is likely supported.
Parameters: request ( http.request.Request
) – An HTTP request.Returns: bool
-
classmethod
is_response
(response)[source]¶ Return whether the response is likely able to be read.
Parameters: response ( http.request.Response
) – An HTTP response.Returns: bool
-
classmethod
is_supported
(file=None, request=None, response=None, url_info=None)[source]¶ Given the hints, return whether the document is supported.
Parameters: - file – A file object containing the document.
- request (
http.request.Request
) – An HTTP request. - response (
http.request.Response
) – An HTTP response. - url_info (
url.URLInfo
) – A URLInfo.
Returns: If True, the reader should be able to read it.
Return type: bool
-
classmethod
is_url
(url_info)[source]¶ Return whether the URL is likely to be supported.
Parameters: url_info ( url.URLInfo
) – A URLInfo.Returns: bool
-
classmethod
-
class
wpull.document.base.
BaseExtractiveReader
[source]¶ Bases:
object
Base class for document readers that can only extract links.
-
class
wpull.document.base.
BaseHTMLReader
[source]¶ Bases:
object
Base class for document readers for handling SGML-like documents.
-
iter_elements
(file, encoding=None)[source]¶ Return an iterator of elements found in the document.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
Returns: Each item is an element from
document.htmlparse.element
Return type: iterator
-
-
class
wpull.document.base.
BaseTextStreamReader
[source]¶ Bases:
object
Base class for document readers that filters link and non-link text.
-
iter_links
(file, encoding=None, context=False)[source]¶ Return the links.
This function is a convenience function for calling
iter_text()
and returning only the links.
-
iter_text
(file, encoding=None)[source]¶ Return the file text and links.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
Returns: Each item is a tuple:
- str: The text
- bool (or truthy value): Whether the text is a likely a link. If truthy value may be provided containing additional context of the link.
Return type: iterator
The links returned are raw text and will require further processing.
-
-
wpull.document.base.
VeryFalse
= <wpull.document.base.VeryFalseType object>¶ Document is not definitely supported.