Data set originally created 11/6/2018

- UPDATE 11/19/19 (Corrected digest definition)

I. About This Data Set

On June 26, 2018, the Web Archiving Team began exploratory work and research into the web archive's content. Because of the size of the archive, a set of indexes of the web archive's content, known as CDX's, were used in lieu of the WARCs for preliminary analysis.
The CDX's that were used in this initial analysis were 6 TB in size. For frame of reference, the web archive, at that time, was 1.484 PB in size.
For the preliminary analysis, a process the Web Archiving Team is calling CDX Line Extraction was employed. See the “How Was It Created” section for more information on this process.

Using CDX Line Extraction, this derivative data set was created by filtering on the MIME type field within each CDX line entry. And while the CDX specification refers to the particular field as "mime type," it should be stated that they are officially recognized as media types: http://webarchive.loc.gov/all/20171105042213/http://www.iana.org/assignments/media-types/media-types.xhtml. Therefore, apart from discussing the CDX Line Extraction, the MIME type field is referred to as Media type in all other sections.


II. What's Included?

This data set includes:

* lcwa_gov_image_data.zip - compressed bag containing the 1000 randomly selected images from the archive, as well as manifest files with sha256 and sha512 checksums (see the BagIt specification for more information: http://webarchive.loc.gov/all/20160830141859/https://tools.ietf.org/html/draft-kunze-bagit-08#section-2).

* lcwa_gov_image_metadata.csv - a CSV containing metadata derived from the CDX line entry for each image--this and additional methods used are described in the "How Was It Created?" section


III. How Was It Created?

As mentioned above in the “About This Data Set” section, the bulk of this data set was created using CDX Line Extraction. The extraction process utilized AWS's processing power by creating an Elastic MapReduce (EMR) cluster to run a series of MapReduce jobs. The jobs filtered and sorted the CDX lines based on a few fields from the CDX line entries:

* digest: a unique cryptographic hash of the web object’s payload at the time of the crawl, which provides a distinct fingerprint for that object; it is a Base32 encoded SHA-1 hash.
* mime type: two-part designation (type/subtype) that describes the nature and format of the web object, as reported by the server at the time of the crawl.
* status code: represents the HTTP response code from the server at the time of the crawl, e.g. 200, 404, etc.
* -original URL: the URL that was captured.
See the CDX specification for more information about the fields: https://web.archive.org/web/20171123000432/https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/.

The first MapReduce Job filtered out lines from the 6TB corpus that were:

1. of the mime type requested, e.g. pdf
2. had a status code of 200
3. and whose top level domain from the original URL was ".gov"

The resulting CDX lines were written to CDX files and stored in an S3 bucket to be used by the second MapReduce Job.

The second MapReduce Job pulled out all the digests from the filtered CDX files created by the first job and wrote them to a list that was similarly stored in an S3 bucket. A Python script was then used to randomly select one thousand digests from the larger digest list that was created by the second job. The third and final MapReduce Job took this subsection of one thousand digests as an input and extracted the CDX Line(s) each digest was referenced in, and wrote them to a CDX file that was stored in an S3 bucket. The CDX file from the final MapReduce Job was downloaded and converted to a CSV, using a Python script, wherein it was used as the basis for additional computational methods of exploration.


IV. Data Set Field Descriptions

* urlkey: the url of the captured web object, without the protocol (http://) or the leading www, derived from the CDX index file.
* timestamp: timestamp in the form YYYYMMDDhhmmss. It represents the point at which the web object was captured, derived from the CDX index file.
* original: the url of the captured web object, including the protocol (http://) and the leading www, if applicable, derived from the CDX index file.
* statuscode: the HTTP response code received from the server at the time of capture, e.g., 200, 404, etc.
* digest: a unique cryptographic hash of the web object’s payload at the time of the crawl, which provides a distinct fingerprint for that object; it is a Base32 encoded SHA-1 hash, derived from the CDX index file.
* height: measured in pixels, derived from additional processing methods used in conjunction with Tika.
* width: measured in pixels, derived from additional processing methods used in conjunction with Tika.
* sha256: a unique cryptographic hash of the downloaded web object, computed using the SHA256 function from the SHA-2 algorithm. It serves as a checksum for the downloaded web object and was created during the bagit process.
* sha512: a unique cryptographic hash of the downloaded web object, computed using the SHA512 function from the SHA-2 algorithm. It serves as a checksum for the downloaded web object and was created during the bagit process.


V. Rights Statement

This data set was derived from content in the Library’s web archives. The Library follows a notification and permission process in the acquisition of content for the web archives, and to allow researcher access to the archived content, as described on the web archiving program page, https://www.loc.gov/programs/web-archiving/about-this-program/. Files were extracted from a variety of archived United States government web sites collected in a number of event and thematic archives. See a representative Rights & Access statement for a sample collection which applies to all of the content in this data set: https://www.loc.gov/collections/legislative-branch-web-archive/about-this-collection/rights-and-access/.


VI. Creator and Contributor Information

Creator: Chase Dooley
Contributors: Pedro Gonzalez-Fernandez, Abbie Grotke, Kate Murray, Trevor Owens, Grace Thomas


VII. Contact Information

Please direct all questions and comments to webcapture@loc.gov.