url Module

URL parsing based on WHATWG URL living standard.

wpull.url.C0_CONTROL_SET = frozenset({'\x19', '\x07', '\x17', '\x1c', '\x03', '\x15', '\x0c', '\n', '\x00', '\t', '\x02', '\x1b', '\r', '\x16', '\x01', '\x05', '\x1a', '\x08', '\x1f', '\x0b', '\x18', '\x12', '\x04', '\x06', '\x14', '\x1e', '\x0e', '\x1d', '\x11', '\x0f', '\x13', '\x10'})

Characters from 0x00 to 0x1f inclusive

wpull.url.DEFAULT_ENCODE_SET = frozenset({32, 96, 34, 35, 60, 62, 63})

Percent encoding set as defined by WHATWG URL living standard.

Does not include U+0000 to U+001F nor U+001F or above.

wpull.url.FORBIDDEN_HOSTNAME_CHARS = frozenset({'[', ']', '@', ':', '%', '/', '#', '?', '\\', ' '})

Forbidden hostname characters.

Does not include non-printing characters. Meant for ASCII.

wpull.url.FRAGMENT_ENCODE_SET = frozenset({32, 96, 34, 60, 62})

Encoding set for fragment.

wpull.url.PASSWORD_ENCODE_SET = frozenset({32, 96, 34, 35, 64, 47, 60, 92, 62, 63})

Encoding set for passwords.

class wpull.url.PercentEncoderMap(encode_set)[source]

Bases: collections.defaultdict

Helper map for percent encoding.

wpull.url.QUERY_ENCODE_SET = frozenset({96, 34, 35, 60, 62})

Encoding set for query strings.

This set does not include U+0020 (space) so it can be replaced with U+0043 (plus sign) later.

wpull.url.QUERY_VALUE_ENCODE_SET = frozenset({96, 34, 35, 37, 38, 43, 60, 62})

Encoding set for a query value.

class wpull.url.URLInfo[source]

Bases: object

Represent parts of a URL.

raw

str

Original string.

scheme

str

Protocol (for example, HTTP, FTP).

authority

str

Raw userinfo and host.

path

str

Location of resource. This value always begins with a slash (/).

query

str

Additional request parameters.

fragment

str

Named anchor of a document.

userinfo

str

Raw username and password.

username

str

Username.

password

str

Password.

host

str

Raw hostname and port.

hostname

str

Hostname or IP address.

port

int

IP address port number.

resource

int

Raw path, query, and fragment. This value always begins with a slash (/).

query_map

dict

Mapping of the query. Values are lists.

url

str

A normalized URL without userinfo and fragment.

encoding

str

Codec name for IRI support.

If scheme is not something like HTTP or FTP, the remaining attributes are None.

All attributes are read only.

For more information about how the URL parts are derived, see https://medialize.github.io/URI.js/about-uris.html

authority
encoding
fragment
host
hostname
hostname_with_port

Return the host portion but omit default port if needed.

is_ipv6()[source]

Return whether the URL is IPv6.

is_port_default()[source]

Return whether the URL is using the default port.

classmethod parse(url, default_scheme='http', encoding='utf-8')[source]

Parse a URL and return a URLInfo.

classmethod parse_authority(authority)[source]

Parse the authority part and return userinfo and host.

classmethod parse_host(host)[source]

Parse the host and return hostname and port.

classmethod parse_hostname(hostname)[source]

Parse the hostname and normalize.

classmethod parse_ipv6_hostname(hostname)[source]

Parse and normalize a IPv6 address.

classmethod parse_userinfo(userinfo)[source]

Parse the userinfo and return username and password.

password
path
port
query
query_map
raw
resource
scheme
split_path()[source]

Return the directory and filename from the path.

The results are not percent-decoded.

to_dict()[source]

Return a dict of the attributes.

url
userinfo
username
wpull.url.USERNAME_ENCODE_SET = frozenset({32, 96, 34, 35, 64, 47, 58, 60, 92, 62, 63})

Encoding set for usernames.

wpull.url.flatten_path(path, flatten_slashes=False)[source]

Flatten an absolute URL path by removing the dot segments.

urllib.parse.urljoin() has some support for removing dot segments, but it is conservative and only removes them as needed.

Parameters:
  • path (str) – The URL path.
  • flatten_slashes (bool) – If True, consecutive slashes are removed.

The path returned will always have a leading slash.

wpull.url.is_subdir(base_path, test_path, trailing_slash=False, wildcards=False)[source]

Return whether the a path is a subpath of another.

Parameters:
  • base_path – The base path
  • test_path – The path which we are testing
  • trailing_slash – If True, the trailing slash is treated with importance. For example, /images/ is a directory while /images is a file.
  • wildcards – If True, globbing wildcards are matched against paths
wpull.url.normalize(url, **kwargs)[source]

Normalize a URL.

This function is a convenience function that is equivalent to:

>>> URLInfo.parse('http://example.com').url
'http://example.com'
Seealso:URLInfo.parse().
wpull.url.normalize_fragment(text, encoding='utf-8')[source]

Normalize a fragment.

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_hostname(hostname)[source]

Normalizes a hostname so that it is ASCII and valid domain name.

wpull.url.normalize_ipv4_address(address)[source]
wpull.url.normalize_password(text, encoding='utf-8')[source]

Normalize a password

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_path(path, encoding='utf-8')[source]

Normalize a path string.

Flattens a path by removing dot parts, percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_query(text, encoding='utf-8')[source]

Normalize a query string.

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_username(text, encoding='utf-8')[source]

Normalize a username

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.parse_ipv4_int(text)[source]
wpull.url.parse_url_or_log(url, encoding='utf-8')[source]

Parse and return a URLInfo.

This function logs a warning if the URL cannot be parsed and returns None.

wpull.url.percent_encode(text, encode_set=frozenset({32, 96, 34, 35, 60, 62, 63}), encoding='utf-8')[source]

Percent encode text.

Unlike Python’s quote, this function accepts a blacklist instead of a whitelist of safe characters.

wpull.url.percent_encode_plus(text, encode_set=frozenset({96, 34, 35, 60, 62}), encoding='utf-8')[source]

Percent encode text for query strings.

Unlike Python’s quote_plus, this function accepts a blacklist instead of a whitelist of safe characters.

wpull.url.percent_encode_query_value(text, encoding='utf-8')[source]

Percent encode a query value.

wpull.url.query_to_map(text)[source]

Return a key-values mapping from a query string.

Plus symbols are replaced with spaces.

wpull.url.schemes_similar(scheme1, scheme2)[source]

Return whether URL schemes are similar.

This function considers the following schemes to be similar:

  • HTTP and HTTPS
wpull.url.split_query(qs, keep_blank_values=False)[source]

Split the query string.

Note for empty values: If an equal sign (=) is present, the value will be an empty string (''). Otherwise, the value will be None:

>>> list(split_query('a=&b', keep_blank_values=True))
[('a', ''), ('b', None)]

No processing is done on the actual values.

wpull.url.uppercase_percent_encoding(text)[source]

Uppercases percent-encoded sequences.

wpull.url.urljoin(base_url, url, allow_fragments=True)[source]

Join URLs like urllib.parse.urljoin but allow scheme-relative URL.