scraper.util Module

Misc functions.

Strip whitespace from a link in HTML soup.

Parameters:link (str) – A string containing the link with lots of whitespace.

The link is split into lines. For each line, leading and trailing whitespace is removed and tabs are removed throughout. The lines are concatenated and returned.

For example, passing the href value of:

<a href=" http://example.com/

        blog/entry/

    how smaug stole all the bitcoins.html
">

will return http://example.com/blog/entry/how smaug stole all the bitcoins.html.

Returns:The cleaned link.
Return type:str

Return link type guessed by filename extension.

Returns:A value from item.LinkType.
Return type:str
wpull.scraper.util.is_likely_inline(link)[source]

Return whether the link is likely to be inline.

Return whether the text is likely to be a link.

This function assumes that leading/trailing whitespace has already been removed.

Returns:bool

Return whether the text is likely to cause false positives.

This function assumes that leading/trailing whitespace has already been removed.

Returns:bool
wpull.scraper.util.parse_refresh(text)[source]

Parses text for HTTP Refresh URL.

Returns:str, None
wpull.scraper.util.urljoin_safe(base_url, url, allow_fragments=True)[source]

urljoin with warning log on error.

Returns:str, None