processor.rule Module

Fetching rules.

class wpull.processor.rule.FetchRule(url_filter: wpull.urlfilter.DemuxURLFilter=None, robots_txt_checker: wpull.protocol.http.robots.RobotsTxtChecker=None, http_login: typing.Union=None, ftp_login: typing.Union=None, duration_timeout: typing.Union=None)[source]

Bases: wpull.application.hook.HookableMixin

Decide on what URLs should be fetched.

check_ftp_request(item_session: wpull.pipeline.session.ItemSession) → typing.Tuple

Check URL filters and scripting hook.

Returns:(bool, str)
Return type:tuple
check_generic_request(item_session: wpull.pipeline.session.ItemSession) → typing.Tuple[source]

Check URL filters and scripting hook.

Returns:(bool, str)
Return type:tuple
check_initial_web_request(item_session: wpull.pipeline.session.ItemSession, request: wpull.protocol.http.request.Request) → typing.Tuple[source]

Check robots.txt, URL filters, and scripting hook.

Returns:(bool, str)
Return type:tuple

Coroutine.

check_subsequent_web_request(item_session: wpull.pipeline.session.ItemSession, is_redirect: bool=False) → typing.Tuple[source]

Check URL filters and scripting hook.

Returns:(bool, str)
Return type:tuple
consult_filters(url_info: wpull.url.URLInfo, url_record: wpull.pipeline.item.URLRecord, is_redirect: bool=False) → typing.Tuple[source]

Consult the URL filter.

Parameters:
  • url_record – The URL record.
  • is_redirect – Whether the request is a redirect and it is desired that it spans hosts.
Returns

tuple:

  1. bool: The verdict
  2. str: A short reason string: nofilters, filters, redirect
  3. dict: The result from DemuxURLFilter.test_info()
consult_helix_fossil() → bool[source]

Consult the helix fossil.

Returns:True if can fetch
consult_hook(item_session: wpull.pipeline.session.ItemSession, verdict: bool, reason: str, test_info: dict)[source]

Consult the scripting hook.

Returns:(bool, str)
Return type:tuple
consult_robots_txt(request: wpull.protocol.http.request.Request) → bool[source]

Consult by fetching robots.txt as needed.

Parameters:request – The request to be made to get the file.
Returns:True if can fetch

Coroutine

classmethod is_only_span_hosts_failed(test_info: dict) → bool[source]

Return whether only the SpanHostsFilter failed.

static plugin_accept_url(item_session: wpull.pipeline.session.ItemSession, verdict: bool, reasons: dict) → bool[source]

Return whether to download this URL.

Parameters:
  • item_session – Current URL item.
  • verdict – A bool indicating whether Wpull wants to download the URL.
  • reasons

    A dict containing information for the verdict:

    • filters (dict): A mapping (str to bool) from filter name to whether the filter passed or not.
    • reason (str): A short reason string. Current values are: filters, robots, redirect.
Returns:

If True, the URL should be downloaded. Otherwise, the URL is skipped.

class wpull.processor.rule.ProcessingRule(fetch_rule: wpull.processor.rule.FetchRule, document_scraper: wpull.scraper.base.DemuxDocumentScraper=None, sitemaps: bool=False, url_rewriter: wpull.urlrewrite.URLRewriter=None)[source]

Bases: wpull.application.hook.HookableMixin

Document processing rules.

Parameters:
  • fetch_rule – The FetchRule instance.
  • document_scraper – The document scraper.
add_extra_urls(item_session: wpull.pipeline.session.ItemSession)[source]

Add additional URLs such as robots.txt, favicon.ico.

static parse_url(url, encoding='utf-8')

Parse and return a URLInfo.

This function logs a warning if the URL cannot be parsed and returns None.

static plugin_get_urls(item_session: wpull.pipeline.session.ItemSession)[source]

Add additional URLs to be added to the URL Table.

When this event is dispatched, the caller should add any URLs needed using ItemSession.add_child_url().

rewrite_url(url_info: wpull.url.URLInfo) → wpull.url.URLInfo[source]

Return a rewritten URL such as escaped fragment.

scrape_document(item_session: wpull.pipeline.session.ItemSession)[source]

Process document for links.

class wpull.processor.rule.ResultRule(ssl_verification: bool=False, retry_connrefused: bool=False, retry_dns_error: bool=False, waiter: typing.Union=None, statistics: typing.Union=None)[source]

Bases: wpull.application.hook.HookableMixin

Decide on the results of a fetch.

Parameters:
  • ssl_verification – If True, don’t ignore certificate errors.
  • retry_connrefused – If True, don’t consider a connection refused error to be a permanent error.
  • retry_dns_error – If True, don’t consider a DNS resolution error to be permanent error.
  • waiter – The Waiter.
  • statistics – The Statistics.
consult_error_hook(item_session: wpull.pipeline.session.ItemSession, error: BaseException)[source]

Return scripting action when an error occured.

consult_pre_response_hook(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Return scripting action when a response begins.

consult_response_hook(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Return scripting action when a response ends.

get_wait_time(item_session: wpull.pipeline.session.ItemSession, error=None)[source]

Return the wait time in seconds between requests.

handle_document(item_session: wpull.pipeline.session.ItemSession, filename: str) → wpull.application.hook.Actions[source]

Process a successful document response.

Returns:A value from hook.Actions.
handle_document_error(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Callback for when the document only describes an server error.

Returns:A value from hook.Actions.
handle_error(item_session: wpull.pipeline.session.ItemSession, error: BaseException) → wpull.application.hook.Actions[source]

Process an error.

Returns:A value from hook.Actions.
handle_intermediate_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Callback for successful intermediate responses.

Returns:A value from hook.Actions.
handle_no_document(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Callback for successful responses containing no useful document.

Returns:A value from hook.Actions.
handle_pre_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Process a response that is starting.

handle_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Generic handler for a response.

Returns:A value from hook.Actions.
static plugin_handle_error(item_session: wpull.pipeline.session.ItemSession, error: BaseException) → wpull.application.hook.Actions[source]

Return an action to handle the error.

Parameters:
  • item_session
  • error
Returns:

A value from Actions. The default is Actions.NORMAL.

static plugin_handle_pre_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Return an action to handle a response status before a download.

Parameters:item_session
Returns:A value from Actions. The default is Actions.NORMAL.
static plugin_handle_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Return an action to handle the response.

Parameters:item_session
Returns:A value from Actions. The default is Actions.NORMAL.
static plugin_wait_time(seconds: float, item_session: wpull.pipeline.session.ItemSession, error: typing.Union=None) → float[source]

Return the wait time between requests.

Parameters:
  • seconds – The original time in seconds.
  • item_session
  • error
Returns:

The time in seconds.