processor.rule
Module¶
Fetching rules.
-
class
wpull.processor.rule.
FetchRule
(url_filter: wpull.urlfilter.DemuxURLFilter=None, robots_txt_checker: wpull.protocol.http.robots.RobotsTxtChecker=None, http_login: typing.Union=None, ftp_login: typing.Union=None, duration_timeout: typing.Union=None)[source]¶ Bases:
wpull.application.hook.HookableMixin
Decide on what URLs should be fetched.
-
check_ftp_request
(item_session: wpull.pipeline.session.ItemSession) → typing.Tuple¶ Check URL filters and scripting hook.
Returns: (bool, str) Return type: tuple
-
check_generic_request
(item_session: wpull.pipeline.session.ItemSession) → typing.Tuple[source]¶ Check URL filters and scripting hook.
Returns: (bool, str) Return type: tuple
-
check_initial_web_request
(item_session: wpull.pipeline.session.ItemSession, request: wpull.protocol.http.request.Request) → typing.Tuple[source]¶ Check robots.txt, URL filters, and scripting hook.
Returns: (bool, str) Return type: tuple Coroutine.
-
check_subsequent_web_request
(item_session: wpull.pipeline.session.ItemSession, is_redirect: bool=False) → typing.Tuple[source]¶ Check URL filters and scripting hook.
Returns: (bool, str) Return type: tuple
-
consult_filters
(url_info: wpull.url.URLInfo, url_record: wpull.pipeline.item.URLRecord, is_redirect: bool=False) → typing.Tuple[source]¶ Consult the URL filter.
Parameters: - url_record – The URL record.
- is_redirect – Whether the request is a redirect and it is desired that it spans hosts.
- Returns
tuple:
- bool: The verdict
- str: A short reason string: nofilters, filters, redirect
- dict: The result from
DemuxURLFilter.test_info()
-
consult_hook
(item_session: wpull.pipeline.session.ItemSession, verdict: bool, reason: str, test_info: dict)[source]¶ Consult the scripting hook.
Returns: (bool, str) Return type: tuple
-
consult_robots_txt
(request: wpull.protocol.http.request.Request) → bool[source]¶ Consult by fetching robots.txt as needed.
Parameters: request – The request to be made to get the file. Returns: True if can fetch Coroutine
-
classmethod
is_only_span_hosts_failed
(test_info: dict) → bool[source]¶ Return whether only the SpanHostsFilter failed.
-
static
plugin_accept_url
(item_session: wpull.pipeline.session.ItemSession, verdict: bool, reasons: dict) → bool[source]¶ Return whether to download this URL.
Parameters: - item_session – Current URL item.
- verdict – A bool indicating whether Wpull wants to download the URL.
- reasons –
A dict containing information for the verdict:
filters
(dict): A mapping (str to bool) from filter name to whether the filter passed or not.reason
(str): A short reason string. Current values are:filters
,robots
,redirect
.
Returns: If
True
, the URL should be downloaded. Otherwise, the URL is skipped.
-
-
class
wpull.processor.rule.
ProcessingRule
(fetch_rule: wpull.processor.rule.FetchRule, document_scraper: wpull.scraper.base.DemuxDocumentScraper=None, sitemaps: bool=False, url_rewriter: wpull.urlrewrite.URLRewriter=None)[source]¶ Bases:
wpull.application.hook.HookableMixin
Document processing rules.
Parameters: - fetch_rule – The FetchRule instance.
- document_scraper – The document scraper.
-
add_extra_urls
(item_session: wpull.pipeline.session.ItemSession)[source]¶ Add additional URLs such as robots.txt, favicon.ico.
-
static
parse_url
(url, encoding='utf-8')¶ Parse and return a URLInfo.
This function logs a warning if the URL cannot be parsed and returns None.
-
static
plugin_get_urls
(item_session: wpull.pipeline.session.ItemSession)[source]¶ Add additional URLs to be added to the URL Table.
When this event is dispatched, the caller should add any URLs needed using
ItemSession.add_child_url()
.
-
class
wpull.processor.rule.
ResultRule
(ssl_verification: bool=False, retry_connrefused: bool=False, retry_dns_error: bool=False, waiter: typing.Union=None, statistics: typing.Union=None)[source]¶ Bases:
wpull.application.hook.HookableMixin
Decide on the results of a fetch.
Parameters: - ssl_verification – If True, don’t ignore certificate errors.
- retry_connrefused – If True, don’t consider a connection refused error to be a permanent error.
- retry_dns_error – If True, don’t consider a DNS resolution error to be permanent error.
- waiter – The Waiter.
- statistics – The Statistics.
-
consult_error_hook
(item_session: wpull.pipeline.session.ItemSession, error: BaseException)[source]¶ Return scripting action when an error occured.
-
consult_pre_response_hook
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Return scripting action when a response begins.
-
consult_response_hook
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Return scripting action when a response ends.
-
get_wait_time
(item_session: wpull.pipeline.session.ItemSession, error=None)[source]¶ Return the wait time in seconds between requests.
-
handle_document
(item_session: wpull.pipeline.session.ItemSession, filename: str) → wpull.application.hook.Actions[source]¶ Process a successful document response.
Returns: A value from hook.Actions
.
-
handle_document_error
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Callback for when the document only describes an server error.
Returns: A value from hook.Actions
.
-
handle_error
(item_session: wpull.pipeline.session.ItemSession, error: BaseException) → wpull.application.hook.Actions[source]¶ Process an error.
Returns: A value from hook.Actions
.
-
handle_intermediate_response
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Callback for successful intermediate responses.
Returns: A value from hook.Actions
.
-
handle_no_document
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Callback for successful responses containing no useful document.
Returns: A value from hook.Actions
.
-
handle_pre_response
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Process a response that is starting.
-
handle_response
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Generic handler for a response.
Returns: A value from hook.Actions
.
-
static
plugin_handle_error
(item_session: wpull.pipeline.session.ItemSession, error: BaseException) → wpull.application.hook.Actions[source]¶ Return an action to handle the error.
Parameters: - item_session –
- error –
Returns: A value from
Actions
. The default isActions.NORMAL
.
-
static
plugin_handle_pre_response
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Return an action to handle a response status before a download.
Parameters: item_session – Returns: A value from Actions
. The default isActions.NORMAL
.