`scraper.util` Module¶

Misc functions.

wpull.scraper.util.clean_link_soup(link)[source]¶

Strip whitespace from a link in HTML soup.

Parameters:	link (str) – A string containing the link with lots of whitespace.

The link is split into lines. For each line, leading and trailing whitespace is removed and tabs are removed throughout. The lines are concatenated and returned.

For example, passing the href value of:

<a href=" http://example.com/

        blog/entry/

    how smaug stole all the bitcoins.html
">

will return http://example.com/blog/entry/how smaug stole all the bitcoins.html.

Returns:	The cleaned link.
Return type:	str

wpull.scraper.util.identify_link_type(filename)[source]¶

Return link type guessed by filename extension.

Returns:	A value from `item.LinkType`.
Return type:	str

wpull.scraper.util.is_likely_inline(link)[source]¶: Return whether the link is likely to be inline.

wpull.scraper.util.is_likely_link(text)[source]¶

Return whether the text is likely to be a link.

This function assumes that leading/trailing whitespace has already been removed.

Returns:	bool

wpull.scraper.util.is_unlikely_link(text)[source]¶

Return whether the text is likely to cause false positives.

This function assumes that leading/trailing whitespace has already been removed.

Returns:	bool

wpull.scraper.util.parse_refresh(text)[source]¶

Parses text for HTTP Refresh URL.

Returns:	str, None

wpull.scraper.util.urljoin_safe(base_url, url, allow_fragments=True)[source]¶

urljoin with warning log on error.

Returns:	str, None

scraper.util Module¶

`scraper.util` Module¶