`url` Module¶

URL parsing based on WHATWG URL living standard.

wpull.url.C0_CONTROL_SET = frozenset({'\x19', '\x07', '\x17', '\x1c', '\x03', '\x15', '\x0c', '\n', '\x00', '\t', '\x02', '\x1b', '\r', '\x16', '\x01', '\x05', '\x1a', '\x08', '\x1f', '\x0b', '\x18', '\x12', '\x04', '\x06', '\x14', '\x1e', '\x0e', '\x1d', '\x11', '\x0f', '\x13', '\x10'})¶: Characters from 0x00 to 0x1f inclusive

wpull.url.DEFAULT_ENCODE_SET = frozenset({32, 96, 34, 35, 60, 62, 63})¶

Percent encoding set as defined by WHATWG URL living standard.

Does not include U+0000 to U+001F nor U+001F or above.

wpull.url.FORBIDDEN_HOSTNAME_CHARS = frozenset({'[', ']', '@', ':', '%', '/', '#', '?', '\\', ' '})¶

Forbidden hostname characters.

Does not include non-printing characters. Meant for ASCII.

wpull.url.FRAGMENT_ENCODE_SET = frozenset({32, 96, 34, 60, 62})¶: Encoding set for fragment.

wpull.url.PASSWORD_ENCODE_SET = frozenset({32, 96, 34, 35, 64, 47, 60, 92, 62, 63})¶: Encoding set for passwords.

class wpull.url.PercentEncoderMap(encode_set)[source]¶

Bases: collections.defaultdict

Helper map for percent encoding.

wpull.url.QUERY_ENCODE_SET = frozenset({96, 34, 35, 60, 62})¶

Encoding set for query strings.

This set does not include U+0020 (space) so it can be replaced with U+0043 (plus sign) later.

wpull.url.QUERY_VALUE_ENCODE_SET = frozenset({96, 34, 35, 37, 38, 43, 60, 62})¶: Encoding set for a query value.

class wpull.url.URLInfo[source]¶

Bases: object

Represent parts of a URL.

raw¶

str

Original string.

scheme¶

str

Protocol (for example, HTTP, FTP).

authority¶

str

Raw userinfo and host.

path¶

str

Location of resource. This value always begins with a slash (/).

query¶

str

Additional request parameters.

fragment¶

str

Named anchor of a document.

userinfo¶

str

Raw username and password.

username¶

str

Username.

password¶

str

Password.

host¶

str

Raw hostname and port.

hostname¶

str

Hostname or IP address.

port¶

int

IP address port number.

resource¶

int

Raw path, query, and fragment. This value always begins with a slash (/).

query_map¶

dict

Mapping of the query. Values are lists.

url¶

str

A normalized URL without userinfo and fragment.

encoding¶

str

Codec name for IRI support.

If scheme is not something like HTTP or FTP, the remaining attributes are None.

All attributes are read only.

For more information about how the URL parts are derived, see https://medialize.github.io/URI.js/about-uris.html

authority

encoding

fragment

host

hostname

hostname_with_port¶: Return the host portion but omit default port if needed.

is_ipv6()[source]¶: Return whether the URL is IPv6.

is_port_default()[source]¶: Return whether the URL is using the default port.

classmethod parse(url, default_scheme='http', encoding='utf-8')[source]¶: Parse a URL and return a URLInfo.

classmethod parse_authority(authority)[source]¶: Parse the authority part and return userinfo and host.

classmethod parse_host(host)[source]¶: Parse the host and return hostname and port.

classmethod parse_hostname(hostname)[source]¶: Parse the hostname and normalize.

classmethod parse_ipv6_hostname(hostname)[source]¶: Parse and normalize a IPv6 address.

classmethod parse_userinfo(userinfo)[source]¶: Parse the userinfo and return username and password.

password

path

port

query

query_map

raw

resource

scheme

split_path()[source]¶

Return the directory and filename from the path.

The results are not percent-decoded.

to_dict()[source]¶: Return a dict of the attributes.

url

userinfo

username

wpull.url.USERNAME_ENCODE_SET = frozenset({32, 96, 34, 35, 64, 47, 58, 60, 92, 62, 63})¶: Encoding set for usernames.

wpull.url.flatten_path(path, flatten_slashes=False)[source]¶

Flatten an absolute URL path by removing the dot segments.

urllib.parse.urljoin() has some support for removing dot segments, but it is conservative and only removes them as needed.

Parameters:	path (str) – The URL path. flatten_slashes (bool) – If True, consecutive slashes are removed.

The path returned will always have a leading slash.

wpull.url.is_subdir(base_path, test_path, trailing_slash=False, wildcards=False)[source]¶

Return whether the a path is a subpath of another.

Parameters:	base_path – The base path test_path – The path which we are testing trailing_slash – If True, the trailing slash is treated with importance. For example, `/images/` is a directory while `/images` is a file. wildcards – If True, globbing wildcards are matched against paths

wpull.url.normalize(url, **kwargs)[source]¶

Normalize a URL.

This function is a convenience function that is equivalent to:

>>> URLInfo.parse('http://example.com').url
'http://example.com'

Seealso:	`URLInfo.parse()`.

wpull.url.normalize_fragment(text, encoding='utf-8')[source]¶

Normalize a fragment.

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_hostname(hostname)[source]¶: Normalizes a hostname so that it is ASCII and valid domain name.

wpull.url.normalize_ipv4_address(address)[source]¶

wpull.url.normalize_password(text, encoding='utf-8')[source]¶

Normalize a password

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_path(path, encoding='utf-8')[source]¶

Normalize a path string.

Flattens a path by removing dot parts, percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_query(text, encoding='utf-8')[source]¶

Normalize a query string.

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_username(text, encoding='utf-8')[source]¶

Normalize a username

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.parse_ipv4_int(text)[source]¶

wpull.url.parse_url_or_log(url, encoding='utf-8')[source]¶

Parse and return a URLInfo.

This function logs a warning if the URL cannot be parsed and returns None.

wpull.url.percent_encode(text, encode_set=frozenset({32, 96, 34, 35, 60, 62, 63}), encoding='utf-8')[source]¶

Percent encode text.

Unlike Python’s quote, this function accepts a blacklist instead of a whitelist of safe characters.

wpull.url.percent_encode_plus(text, encode_set=frozenset({96, 34, 35, 60, 62}), encoding='utf-8')[source]¶

Percent encode text for query strings.

Unlike Python’s quote_plus, this function accepts a blacklist instead of a whitelist of safe characters.

wpull.url.percent_encode_query_value(text, encoding='utf-8')[source]¶: Percent encode a query value.

wpull.url.query_to_map(text)[source]¶

Return a key-values mapping from a query string.

Plus symbols are replaced with spaces.

wpull.url.schemes_similar(scheme1, scheme2)[source]¶

Return whether URL schemes are similar.

This function considers the following schemes to be similar:

HTTP and HTTPS

wpull.url.split_query(qs, keep_blank_values=False)[source]¶

Split the query string.

Note for empty values: If an equal sign (=) is present, the value will be an empty string (''). Otherwise, the value will be None:

>>> list(split_query('a=&b', keep_blank_values=True))
[('a', ''), ('b', None)]

No processing is done on the actual values.

wpull.url.uppercase_percent_encoding(text)[source]¶: Uppercases percent-encoded sequences.

wpull.url.urljoin(base_url, url, allow_fragments=True)[source]¶: Join URLs like urllib.parse.urljoin but allow scheme-relative URL.

url Module¶

`url` Module¶