WARC Specification¶
Additional de-facto and custom extensions to the WARC standard.
Wpull follows the specifications in the ISO 28500 latest draft.
FTP¶
FTP recording follows Heritrix specifications.
Control Conversation¶
The Control Conversation is recorded as
- WARC-Type:
metadata
- Content-Type:
text/x-ftp-control-conversation
- WARC-Target-URI: a URL. For example,
ftp://anonymous@example.com/treasure.txt
- WARC-IP-Address: an IPv4 address with port or an IPv6 address with brackets and port
The resource is formatted as followed:
- Events are indented with an ASCII asterisk and space.
- Requests are indented with an ASCII greater-than and space.
- Responses are indented with an ASCII less-than and space.
The document encoding is UTF-8.
Changed in version 1.2a1: The document encoding previously used Latin-1.
Response data¶
The response data is recorded as
- WARC-Type:
resource
- WARC-Target-URI: a URL. For example,
ftp://anonymous@example.com/treasure.txt
- WARC-Concurrent-To: a WARC Record ID of the Control Conversation
PhantomJS¶
Snapshot¶
A PhantomJS Snapshot represents the state of the DOM at the time of capture.
A Snapshot is recorded as
- WARC-Type:
resource
- WARC-Target-URI:
urn:X-wpull:snapshot?url=URLHERE
whereURLHERE
is a percent-encoded URL of the PhantomJS page. - Content-Type: one of
application/pdf
,text/html
,image/png
- WARC-Concurrent-To: a WARC Record ID of a Snapshot Action Metadata.
Snapshot Action Metadata¶
An Action Metadata is a log of steps performed before a Snapshot is taken.
It is recorded as
- WARC-Type:
metadata
- Content-Type:
application/json
- WARC-Target-URI:
urn:X-wpull:snapshot?url=URLHERE
whereURLHERE
is a percent-encoded URL of the PhantomJS page.
Wpull Metadata¶
Log¶
Wpull’s log is recorded as
- WARC-Type:
resource
- Content-Type:
text/plain
- WARC-Target-URI:
urn:X-wpull:log
The document encoding is UTF-8.
youtube-dl¶
JSON file is recorded as
- WARC-Type:
metadata
- Content-Type:
application/vnd.youtube-dl_formats+json
- WARC-Target-URI:
metadata://AUTHORITY_AND_RESOURCE
whereAUTHORITY_AND_RESOURCE
is the hierarchical part, query, and fragment of the URL passed to youtube-dl. In other words, the URI is the URL where the scheme is replaced withmetadata
.