url
Module¶
URL parsing based on WHATWG URL living standard.
-
wpull.url.
C0_CONTROL_SET
= frozenset({'\x19', '\x07', '\x17', '\x1c', '\x03', '\x15', '\x0c', '\n', '\x00', '\t', '\x02', '\x1b', '\r', '\x16', '\x01', '\x05', '\x1a', '\x08', '\x1f', '\x0b', '\x18', '\x12', '\x04', '\x06', '\x14', '\x1e', '\x0e', '\x1d', '\x11', '\x0f', '\x13', '\x10'})¶ Characters from 0x00 to 0x1f inclusive
-
wpull.url.
DEFAULT_ENCODE_SET
= frozenset({32, 96, 34, 35, 60, 62, 63})¶ Percent encoding set as defined by WHATWG URL living standard.
Does not include U+0000 to U+001F nor U+001F or above.
-
wpull.url.
FORBIDDEN_HOSTNAME_CHARS
= frozenset({'[', ']', '@', ':', '%', '/', '#', '?', '\\', ' '})¶ Forbidden hostname characters.
Does not include non-printing characters. Meant for ASCII.
-
wpull.url.
FRAGMENT_ENCODE_SET
= frozenset({32, 96, 34, 60, 62})¶ Encoding set for fragment.
-
wpull.url.
PASSWORD_ENCODE_SET
= frozenset({32, 96, 34, 35, 64, 47, 60, 92, 62, 63})¶ Encoding set for passwords.
-
class
wpull.url.
PercentEncoderMap
(encode_set)[source]¶ Bases:
collections.defaultdict
Helper map for percent encoding.
-
wpull.url.
QUERY_ENCODE_SET
= frozenset({96, 34, 35, 60, 62})¶ Encoding set for query strings.
This set does not include U+0020 (space) so it can be replaced with U+0043 (plus sign) later.
-
wpull.url.
QUERY_VALUE_ENCODE_SET
= frozenset({96, 34, 35, 37, 38, 43, 60, 62})¶ Encoding set for a query value.
-
class
wpull.url.
URLInfo
[source]¶ Bases:
object
Represent parts of a URL.
-
raw
¶ str
Original string.
-
scheme
¶ str
Protocol (for example, HTTP, FTP).
str
Raw userinfo and host.
-
path
¶ str
Location of resource. This value always begins with a slash (
/
).
-
query
¶ str
Additional request parameters.
-
fragment
¶ str
Named anchor of a document.
-
userinfo
¶ str
Raw username and password.
-
username
¶ str
Username.
-
password
¶ str
Password.
-
host
¶ str
Raw hostname and port.
-
hostname
¶ str
Hostname or IP address.
-
port
¶ int
IP address port number.
-
resource
¶ int
Raw path, query, and fragment. This value always begins with a slash (
/
).
-
query_map
¶ dict
Mapping of the query. Values are lists.
-
url
¶ str
A normalized URL without userinfo and fragment.
-
encoding
¶ str
Codec name for IRI support.
If scheme is not something like HTTP or FTP, the remaining attributes are None.
All attributes are read only.
For more information about how the URL parts are derived, see https://medialize.github.io/URI.js/about-uris.html
-
authority
-
encoding
-
fragment
-
host
-
hostname
-
hostname_with_port
¶ Return the host portion but omit default port if needed.
-
classmethod
parse
(url, default_scheme='http', encoding='utf-8')[source]¶ Parse a URL and return a URLInfo.
Parse the authority part and return userinfo and host.
-
password
-
path
-
port
-
query
-
query_map
-
raw
-
resource
-
scheme
-
split_path
()[source]¶ Return the directory and filename from the path.
The results are not percent-decoded.
-
url
-
userinfo
-
username
-
-
wpull.url.
USERNAME_ENCODE_SET
= frozenset({32, 96, 34, 35, 64, 47, 58, 60, 92, 62, 63})¶ Encoding set for usernames.
-
wpull.url.
flatten_path
(path, flatten_slashes=False)[source]¶ Flatten an absolute URL path by removing the dot segments.
urllib.parse.urljoin()
has some support for removing dot segments, but it is conservative and only removes them as needed.Parameters: - path (str) – The URL path.
- flatten_slashes (bool) – If True, consecutive slashes are removed.
The path returned will always have a leading slash.
-
wpull.url.
is_subdir
(base_path, test_path, trailing_slash=False, wildcards=False)[source]¶ Return whether the a path is a subpath of another.
Parameters: - base_path – The base path
- test_path – The path which we are testing
- trailing_slash – If True, the trailing slash is treated with importance.
For example,
/images/
is a directory while/images
is a file. - wildcards – If True, globbing wildcards are matched against paths
-
wpull.url.
normalize
(url, **kwargs)[source]¶ Normalize a URL.
This function is a convenience function that is equivalent to:
>>> URLInfo.parse('http://example.com').url 'http://example.com'
Seealso: URLInfo.parse()
.
-
wpull.url.
normalize_fragment
(text, encoding='utf-8')[source]¶ Normalize a fragment.
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.
normalize_hostname
(hostname)[source]¶ Normalizes a hostname so that it is ASCII and valid domain name.
-
wpull.url.
normalize_password
(text, encoding='utf-8')[source]¶ Normalize a password
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.
normalize_path
(path, encoding='utf-8')[source]¶ Normalize a path string.
Flattens a path by removing dot parts, percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.
normalize_query
(text, encoding='utf-8')[source]¶ Normalize a query string.
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.
normalize_username
(text, encoding='utf-8')[source]¶ Normalize a username
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.
parse_url_or_log
(url, encoding='utf-8')[source]¶ Parse and return a URLInfo.
This function logs a warning if the URL cannot be parsed and returns None.
-
wpull.url.
percent_encode
(text, encode_set=frozenset({32, 96, 34, 35, 60, 62, 63}), encoding='utf-8')[source]¶ Percent encode text.
Unlike Python’s
quote
, this function accepts a blacklist instead of a whitelist of safe characters.
-
wpull.url.
percent_encode_plus
(text, encode_set=frozenset({96, 34, 35, 60, 62}), encoding='utf-8')[source]¶ Percent encode text for query strings.
Unlike Python’s
quote_plus
, this function accepts a blacklist instead of a whitelist of safe characters.
-
wpull.url.
query_to_map
(text)[source]¶ Return a key-values mapping from a query string.
Plus symbols are replaced with spaces.
-
wpull.url.
schemes_similar
(scheme1, scheme2)[source]¶ Return whether URL schemes are similar.
This function considers the following schemes to be similar:
- HTTP and HTTPS
-
wpull.url.
split_query
(qs, keep_blank_values=False)[source]¶ Split the query string.
Note for empty values: If an equal sign (
=
) is present, the value will be an empty string (''
). Otherwise, the value will beNone
:>>> list(split_query('a=&b', keep_blank_values=True)) [('a', ''), ('b', None)]
No processing is done on the actual values.