Forty-Five Years of Digitizing Ebooks: Project Gutenberg's Practices
Part 2
For HTML, EPUB and MOBI, pairs of files are generated: one with images, and one without. The set of files without images is intended to be friendlier to readers with limited bandwidth, or without the necessary storage space for any images included with the eBook.
After uploading, a team of human experts — known as the “whitewashers,” after a scene in Mark Twain’s “The Adventures of Tom Sawyer” — does final formatting, attaches the Project Gutenberg header and footer, and uploads the new item to the server at www.gutenberg.org.
CATALOGING AND MIRRORING
The Project Gutenberg catalog database includes metadata from within each eBook: the author, title, available file formats, upload/publication date, language, etc. Human catalogers eventually add additional metadata, including Library of Congress Subject Headings. This catalog is available for free download in machine readable form (XML/RDF or MARC).
Organizations that desire to redistribute Project Gutenberg’s content, freely and without limitations, are invited to do so. The catalog may be used for this purpose, and various mechanisms are available to automatically maintain a copy of the collection itself (i.e., “mirroring”), including for generated content.
“NO SWEAT OF THE BROW COPYRIGHT”
An important innovation during the evolution of Project Gutenberg was to clarify the notion of “authorship” and its critical role for establishing copyright. In early days, it was common to think that applying HTML markup, or reformatting, or spelling changes, qualified an item for a new copyright. Historically, some print publishers even claimed new copyrights simply for typesetting a new edition.
Today, we know US copyright is based on the creative expression of ideas through authorship. Markup and spelling changes do not qualify. As a result, Project Gutenberg volunteers are able to “harvest” public domain materials on the Internet, once they are determined to match public domain print materials. This is not a frequent occurrence, however, since most volunteers prefer to work on items that are not yet digitized.
Similarly, Project Gutenberg claims no copyright on the “sweat of the brow” labor which is applied to make eBooks from print sources. There were a few earlier items where such copyright was claimed erroneously, but this is no longer done.
EBOOKS, OR PICTURES OF BOOKS?
Project Gutenberg has over 50,000 eBooks in its collection. This is far fewer than Google Books, or The Internet Archive, or other large-scale digitization projects of historical items. An important distinction is that Project Gutenberg engages in the proofreading, formatting, markup/encoding, and other activities described above. Those other very large projects are primarily devoted to scanning, and then provide raw OCR output with a few automatically generated formats.
Such items are only partial eBooks — really, they are pictures (scans) of books, with some additional automated features. These are valuable, but do not provide the reading experience or quality of presentation that Project Gutenberg strives for. Using current technology, it takes human intellect and effort to convert a picture of a book to a true, functional, eBook.
PAST INNOVATIONS AND FUTURE INITIATIVES
Project Gutenberg has evolved its practices over the years, and has often been a leader in the creation and distribution of eBooks. Some past innovations include the following, and all are still in active use today:
Development of an open content trademark license (1991- 1993), which is intended to guarantee to readers that public domain items remain free, while placing restrictions on the trademarked name “Project Gutenberg” to protect against abusive practices by those who would sell the public domain items;
File/directory-based access to the collection, guaranteeing ease of copying (by file, or subcollection, or the entire collection), mirroring, and large-scale redistribution (1994);
Anonymous access for all readers, requiring no logins or authorization for any items (1994);
Web-based access to content, and development of procedures to assure HTML is valid and well-formed (1996);
The Copyright How-To, including the Rule 6 How-To for non-renewed items (2000 & 2008);
Support of Distributed Proofreaders (2002-2004), for crowdsourced proofreading and other aspects of new eBook creation;
Implementation of eBook reader formats, for free use on mobile phones, tablets, and other devices (2009);
Free redistribution of metadata as a separate download (2007 & 2012);
Integration with OneDrive, Dropbox, and other mechanisms for readers to employ “cloud” storage for eBooks (2013);
Fully automated conversion from master formats to eBook formats (2013).
Project Gutenberg has ongoing initiatives to improve service offerings to readers. There are no definite timelines for these, and assistance (or partnerships!) are always of interest. Some future initiatives may include:
Continued efforts to separate the “collection” from the “interface,” making it easier for different Web-based skins to be used to access content;
Mechanisms for creation of personal bookshelves, “shopping carts” or other reading lists, for users to more easily track items of interest;
Crowdsourced reviews, errata and improvements to eBooks, including capabilities for forked versions, versioning, and other techniques common among developers of free software;
Improvements in ability to identify and filter items by the author’s death date, which is the most common criterion for determining public domain status of older items, in countries other than the US;
Better tracking of sources used, including for harvested scans; even with no guarantee of faithfulness to a particular print source, information about source is frequently requested;
More languages, more formats, and additional content types;
Encouragement of innovative ideas by Project Gutenberg’s readers and other fans;
Ongoing evolution in the utility of Project Gutenberg eBooks for future reading devices.
APPRECIATION FOR VOLUNTEERS
Project Gutenberg is thankful to tens of thousands of volunteers, over more than 45 years, that have contributed to the creation and distribution of free electronic books. It is through the efforts of these volunteers that Project Gutenberg has been successful, and continues to thrive.