Dataset originally created 08/30/2019 I. About This Dataset This dataset was generated from content harvested from the Library of Congress's web archive of qwantz.com (Dinosaur Comics!): https://www.loc.gov/item/lcwaN0009953/. It includes minimal metadata about 3,325 image objects from the Dinosaur Comics! web archive as well as the files themselves. This dataset was created as apart of exploratory work done by the Library of Congress's Web Archiving Team. The goal of the work is to explore the contents of the Library's web archives through analysis of the indexes containing metadata from the harvested web content, as stored in CDX files (https://web.archive.org/web/20171123000432/https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/). The metadata contained in the indexes was used for initial analysis, rather than the archived content stored in WARC (https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml) and ARC (https://www.loc.gov/preservation/digital/formats/fdd/fdd000235.shtml) container files, since W/ARC files present significant challenges due to large size and high processing requirements. The majority of the information that is provided alongside the files was extracted by the Library's Web Archiving Team, through a process called CDX Line Extraction, which focused on the mimetype field pulled from the CDX indexes. Except when referencing this specific field, we have referred to this designation by its more current name, "media type" (http://webarchive.loc.gov/all/20171105042213/http://www.iana.org/assignments/media-types/media-types.xhtml), and therefore, all references to "media type" below may be considered to be derived from the mimetype field from the CDX indexes. See the "How Was It Created" section below for more information on this process. II. Why Dinosaur Comics!? Dinosaur Comics, published since 2003, is an internationally known webcomic that uses a fixed format of 6 panels that use the same clip art images of the three main characters, T-Rex, Utahraptor and Dromiceiomimus for every comic. It is the text of each comic that changes, featuring a dialog commenting on everything from relationships, holidays, philosophy, math, and contemporary culture. The comic also includes hovertext and additional metadata for each comic, making it a rich resource for future analysis. Created by Ryan North, featured on numerous "best of" lists this long running webcomic has generated a number of books and merchandise, as well as leading to additional comics and books by North, including Adventure Time and the Unbeatable Squirrel Girl. The Library of Congress began collecting webcomics in 2012, as they are an increasingly popular format utilized by contemporary creators in the field and often include material by artists not available elsewhere. In addition to preserving this material for the future via our web archive collections, the Library is interested in making this web archive material available to researchers in formats that allow for users engage with and learn more from our archives. The Dinosaur Comics dataset includes not only metadata, but also the image files of the comics themselves, presenting a unique opportunity to explore this format in new ways. III. What's Included? This dataset includes: - lcwa_dinosaurcomics_image_data.zip - compressed bag containing the 3,325 files from the "qwantz" domain containing "image" in their media type from the archive, as well as manifest files with SHA-256 and SHA-512 checksums. The structure of the content follows the structure of the BagIt specification (see http://webarchive.loc.gov/all/20160830141859/https://tools.ietf.org/html/draft-kunze-bagit-08#section-2). - lcwa_dinosaurcomics_image_metadata.csv - a CSV containing metadata derived from the CDX line entry for each image file as well as information derived from archived instance; e.g., the Comic ID taken from the URL query string and the Comic Title taken from the "title" attribute in the HTML image tag. The fields and their contents are described in the "Dataset Field Descriptions" section. IV. How Was It Created? As mentioned above in the "About This Dataset" section, the bulk of this dataset was created using CDX Line Extraction. The extraction process used an Elastic MapReduce (EMR) cluster on AWS cloud services to run a series of queries using Apache Spark. The jobs filtered and sorted the CDX lines based on the following fields from the CDX line entries: - digest: a unique cryptographic hash of the web object"s payload at the time of the crawl, which provides a distinct fingerprint for that object; it is a Base32 encoded SHA-1 hash. - mimetype: two-part designation (type/subtype) that describes the nature and format of the web object, as reported by the server at the time of the crawl. - statuscode: represents the HTTP response code from the server at the time of the crawl, e.g. 200, 404, etc. - urlkey: the original URL that was captured during the web harvesting process in SURT-ordered format (http://crawler.archive.org/articles/user_manual/glossary.html#surt). See the CDX specification for more information about the fields: https://web.archive.org/web/20171123000432/https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/. The CDX Line Extraction process involved multiple phases. First, a query was run to filtered out lines from the six terabyte corpus that were: 1) of the media type requested, in this case, media types that contained the string "image" 2) had a status code of 200 3) and whose domain from the urlkey contained "com,qwantz,www)/comics" or "com,qwantz)/comics" The query results were then wrote out to an AWS S3 bucket as a CSV file. Additional information not extracted from the CDX index was also included to provide further context for each image object. As was mentioned above, this additional metadata was gleaned by extracting text from the comic's landing page: e.g., https://webarchive.loc.gov/all/20180813010042/http://www.qwantz.com/index.php?comic=93. IV. Dataset Field Descriptions This section lists and describes each of the fields included in lcwa_qwantz_image_metadata.csv. The CSV contains 8 fields (listed in the first line), and 3,325 lines with the corresponding information for each field as follows: - Comic ID: The number present in the URL query string for a particular Comic's landing page; e.g., 93 in http://www.qwantz.com/index.php?comic=93. - Comic Image: The file name of the image on the Comic's landing page; e.g., comic2-91.png for http://www.qwantz.com/index.php?comic=93. - Original: The URL of the captured web object, including the protocol (http://) and the leading www, if applicable, extracted from the CDX index file. - MIME Type: The media type as recorded in the CDX. - Archived URL: a link to the first capture of the xkcd image object. - The archived URL is comprised of four parts: 1. the web archive domain, webarchive.loc.gov 2. the access point, all 3. the date range--a wildcard character of * is used to bring up all captures and in this instance, 0 is a shortcut for the first capture 4. the resource URL, http://www.qwantz.com/comics/comic2-91.png - More about the Wayback Machine and the URL construction can be found here: https://github.com/iipc/openwayback/wiki/OpenWayback-Replay-API - Comic Title: Text from the "title" attribute from the img tag on the Comic's landing page. - Filename: name of the image object that the other 11 fields references; found in the accompanying lcwa_qwantz_image_data.zip file. The filename is the base64 encoded MD5 hash of the resource at the time it was captured. It is the Digest Field from the CDX. Digests were chosen because several items from the Comic Image field are duplicates, whereas the digests are not. - File Size (In Bytes): The size of the image object, in bytes, derived from additional processing methods outside of the CDX Line Extraction. V. Rights Statement The Library follows a notification and permission process in the acquisition of content for the web archives, and to allow researcher access to the archived content, as described on the web archiving program page, https://www.loc.gov/programs/web-archiving/about-this-program/. See this Rights & Access statement https://www.loc.gov/collections/webcomics-web-archive/about-this-collection/rights-and-access/ which applies to content in this dataset. VI. Creator and Contributor Information Creator: Chase Dooley VII. Contact Information Please direct all questions and comments to webcapture@loc.gov.