database.base Module

Base table class.

wpull.database.base.AddURLInfo

alias of _AddURLInfo

class wpull.database.base.BaseURLTable[source]

Bases: object

URL table.

add_many(new_urls: typing.Iterator) → typing.Iterator[source]

Add the URLs to the table.

Parameters:new_urls – URLs to be added.
Returns:The URLs added. Useful for tracking duplicates.
add_one(url: str, url_properties: typing.Union=None, url_data: typing.Union=None)[source]

Add a single URL to the table.

Parameters:
  • url – The URL to be added
  • url_properties – Additional values to be saved
  • url_data – Additional data to be saved
add_visits(visits)[source]

Add visited URLs from CDX file.

Parameters:visits (iterable) – An iterable of items. Each item is a tuple containing a URL, the WARC ID, and the payload digest.
check_in(url: str, new_status: wpull.pipeline.item.Status, increment_try_count: bool=True, url_result: typing.Union=None)[source]

Update record for processed URL.

Parameters:
  • url – The URL.
  • new_status – Update the item status to new_status.
  • increment_try_count – Whether to increment the try counter for the URL.
  • url_result – Additional values.
check_out(filter_status: wpull.pipeline.item.Status, filter_level: typing.Union=None) → wpull.pipeline.item.URLRecord[source]

Find a URL, mark it in progress, and return it.

Parameters:
  • filter_status – Gets first item with given status.
  • filter_level – Gets item with filter_level or lower.
Raises:

NotFound

close()[source]

Run any clean-up actions and close the table.

contains(url: str)[source]

Return whether the URL is in the table.

convert_check_in(file_id: int, status: wpull.pipeline.item.Status)[source]
convert_check_out() -> (<class 'int'>, <class 'wpull.pipeline.item.URLRecord'>)[source]
count() → int[source]

Return the number of URLs in the table.

This call may be expensive.

get_all() → typing.Iterator[source]

Return all URLRecord.

get_hostnames()[source]

Return list of hostnames

get_one(url: str) → wpull.pipeline.item.URLRecord[source]

Return a URLRecord for the URL.

Raises:NotFound
get_revisit_id(url, payload_digest)[source]

Return the WARC ID corresponding to the visit.

Returns:str, None
get_root_url_todo_count() → int[source]
release()[source]

Mark any in_progress URLs to todo status.

remove_many(urls)[source]

Remove the URLs from the database.

remove_one(url)[source]

Remove a URL from the database.

update_one(url, **kwargs)[source]

Arbitrarily update values for a URL.

exception wpull.database.base.DatabaseError[source]

Bases: Exception

Any database error.

exception wpull.database.base.NotFound[source]

Bases: wpull.database.base.DatabaseError

Item not found in the table.