Forty-Five Years of Digitizing Ebooks: Project Gutenberg's Practices

Part 1

Chapter 13,657 wordsPublic domain

FORTY-FIVE YEARS OF DIGITIZING EBOOKS

PROJECT GUTENBERG’S PRACTICES

By Gregory B. Newby

CEO Project Gutenberg Literary Archive Foundation

ABSTRACT

Project Gutenberg creates and freely distributes electronic books (eBooks). This document offers elements of the story of Project Gutenberg’s methods and practices for creating those eBooks, and the surrounding procedures for making them as widely available as possible. Project Gutenberg seeks to make the world’s great literature enjoyable and accessible.

HISTORICAL ROOTS

The first Project Gutenberg eBook was created on July 4, 1971. Michael S. Hart had been granted access to a powerful mainframe computer at the University of Illinois at Urbana-Champaign, and realized that his greatest impact would be by digitizing and distributing free literature (for more history, see: The eBook is 40 (1971-2011), by Marie Lebert, https://www.gutenberg.org/ebooks/36985).

Michael took a printed copy of the United States Declaration of Independence (www.gutenberg.org/ebooks/1) to the computer laboratory, where he sat at the teletype terminal and typed this first eBook. He distributed it via email to the people he knew about via the Internet’s predecessor, ARPAnet, which was available at UIUC. At that moment, the first eBook had been freely distributed to the online community of the day.

Digitization and production techniques, at the time of this first eBook, were /ad hoc/ and informal. A single eBook producer would edit a single file, from a single source. The first eBook’s printed source was a single sheet of paper, without hyphenation, a book cover, images, or other characteristics of book-length sources. In 1971, capitalization was not an issue, as only upper case letters were available in the character set used by the system.

Figure 1: Top view of a Model 33 Teletype, salvaged from the computer laboratory where Michael Hart typed the first eBook. The paper roll was where output would be printed.

During the next twenty years, from approximately 1971-1991, techniques of digitization would be dramatically improved, and regularized. Ongoing developments since then have tracked the available technologies for eBook creation and use, as well as preferences and interests of the many volunteers who would produce those eBooks.

Throughout the history of Project Gutenberg, these techniques, while refined and clearly articulated, have remained flexible (see the Volunteers’ FAQ at https://www.gutenberg.org/help/volunteers_faq.html).

EMPHASIS ON THE PUBLIC DOMAIN

Project Gutenberg’s founder, Michael Hart, was motivated by completely free and unencumbered redistribution of literary works. Access to literary works enables literacy, which in turn opens the door to education and, it is hoped, opportunity. Interest in literary works that could be freely redistributed led to an emphasis on books and other items that are in the public domain.

The public domain is, today, understood to be those items that are not copyrighted. Copyright in the United States, where Project Gutenberg operates, is defined as a temporary monopoly by authors (or their agents), in order to benefit from commercial potential and thereby fostering continued creation:

“To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries” (United States Constitution, https://www.gutenberg.org/ebooks/5).

ITEMS ARE IN THE PUBLIC DOMAIN FOR ONE OF THREE REASONS

1. They are ineligible for copyright. In the US, this includes works created by the US Government;

2. Their copyright term has expired; or

3. They are granted to the public domain by the creator or their agent (i.e., the rights holder).

Because of its emphasis on literary works, Project Gutenberg has mostly focused on items for which the copyright term has expired. Until 1998, this included items published 75 years earlier. For example, items from 1920 entered the public domain when their copyrights expired in 1995. The US Copyright Term Extension Act of 1998 changed the term to 95 years for most literary works, so new items (from 1923 onward) will not enter the public domain before 2019.

Figure 2: Michael Hart’s sunroom workspace in his Urbana home

There are over one million published works from 1923 and earlier, and these are the main items that Project Gutenberg continues to digitize and distribute. In addition, there were approximately one million works published in the United States from 1923-1964 but not renewed. Those items entered the public domain when their first copyright term ended, 28 years after publication. The copyright procedures utilized are online at https://www.gutenberg.org/help/copyright.html.

COLLECTION DEVELOPMENT POLICY AND EARLY MARKUP

The eBook collection, and all other aspects of Project Gutenberg, relies on volunteers to grow. Therefore, selection of items is done mainly by volunteers. Project Gutenberg seeks to limit duplication in the collection, and instead prefers to add items not already in the collection. Improvements to existing items is ongoing, mainly when errata reports are submitted by readers.

It took over two decades to release the first 100 eBooks, with #100 being published in 1994. Most of those first eBooks were collected through personal interaction with Hart. He would guide or participate in the digitization process, often developing procedures to deal with new characteristics. Footnotes and endnotes, italics and underscores, bold text, and different fonts all presented challenges for representation as plain text. Primitive markup techniques were developed, such as using an underscore character to surround underscored text, _like this_.

It was not until the mid-1990s that hypertext markup language (HTML) was first used, and at the time it was decided that Project Gutenberg eBooks should be wholly self-contained. A zip file would include all of the needed images, and external links were discouraged.

Throughout the entire history of Project Gutenberg, volunteers have been encouraged to work on items they are interested in, and to make their own decisions about how to best represent the content.

PROOFREADING

The first eBooks were created by typing the text of printed books into word processor or text editing programs, and then submitting the files for final formatting and redistribution. Typists would perform basic formatting, including:

Omitting page headers/footers and pagination;

Spelling correction (spelling modernization was optional, and some transcribers preferred to leave the original spelling);

De-hyphenation;

Relocating any footnotes to endnotes;

Adding basic markup or emphasis, as described above;

Standard formatting for headings and chapters. Chapter titles would have two blank lines before, and one blank line after;

Line and paragraph formatting, including line endings with carriage returns + line feed at approximately 72 characters, no paragraph indentation (unless it is a block quote or similar), and a blank line between paragraphs.

Plain text eBooks, which were the only major format until HTML became more frequent by the mid- to late-1990s, were designed to be viewed on computer monitors with fixed-width fonts with 80-character lines. Plain text is still provided for nearly all Project Gutenberg eBooks today, although HTML and other formats are also provided.

Once an item is typed into an electronic file, and basic formatting is completed, one or more rounds of proofreading will help to improve quality. This includes typos, poor formatting, or inconsistency of presentation. In practice, all eBooks published by Project Gutenberg still have errors, even if they are far better than 99% accurate. For example, an eBook that is 99.999% accurate (i.e., “five nines”) will still have one wrong character in 10,000. That amounts to approximately 30 errors in a typical 50,000 word novel. Proofreading is, by its nature, asymptotic. Subsequent rounds of proofreading improve an eBook, but that eBook is still likely to contain some errors.

Errors in eBooks often reflect errors in their printed sources, and Project Gutenberg encourages fixing those errors.

EVOLUTION IN PROOFREADING: DISTRIBUTED PROOFREADERS

From 2002-2004 an important innovation was developed, in support of the creation of new Project Gutenberg eBooks. This was Distributed Proofreaders, an early example of what is now known as crowdsourcing. Through Distributed Proofreaders, volunteers engage in a portion of the eBook creation process — whether it is copyright clearances, proofreading (a page at a time!), or formatting, checking, and finalization before uploading. Those portions, when coordinated together, lead to the creation of new eBooks from printed sources.

Distributed Proofreaders has become the single largest source for new eBooks to the collection, accounting for approximately half of all titles. Distributed Proofreaders has also innovated substantially in the use of HTML+CSS (cascading style sheets) for very attractive presentation of eBooks in Web browsers.

SCANNING

By the early 1990s, scanning and optical character recognition (OCR) started to become widely available. Hart received a full scanning station via a grant from a computer manufacturer, which was used to produce several of the first 100 eBooks. The scanner was a flatbed model, which required the user to hold the book open, scan a page (or pair of pages) for ingest to the OCR software, then flip to the next page.

The OCR software would then automatically recognize the characters from the scan, and create an editable view of the text. Proofreading and formatting would then occur in the same way as for a typed text.

A few years later, Project Gutenberg worked with Distributed Proofreaders to acquire sheet-fed scanners. These scanners, which are still in operation, are faster. They also tend to produce an image that is properly aligned, versus the skewing that sometimes occurs with flatbed scanners. An important difference is the printed books are damaged: prior to scanning, the spines of the books are cut off, in order for the individual pages to be ingested by the scanner.

Figure 3: Image from the Doré illustrations of Dante’s Inferno

It has been Project Gutenberg’s intention to make all the original images from the scanners available, alongside the finished eBook. This is to have a more complete record of the eBook’s source(s), and also to facilitate improvements by finding typos. Most eBook producers to date have chosen to not provide the scans, however.

Scanners are used for images within printed books, which are typically included as JPEG, GIF or PNG items within HTML and other formats. Inline images may be at a lower resolution, and then clickable to obtain higher resolution images. Color scanners are used, whenever possible, for color images.

Project Gutenberg has no prohibition against using items scanned by other parties. Several excellent sources of scans are freely available, including Google Books, Gallica, and The Internet Archive. Scans, and raw OCR output (if available), may then be transformed into Project Gutenberg eBooks by volunteers.

COPYRIGHT CLEARANCE OR PERMISSION

From approximately 1994-2004, procedures for digitization became more clearly articulated. This included the notion that a copyright “clearance” was the necessary first step for starting any new eBook for contribution to Project Gutenberg. The “copyright how-to” mentioned above was developed and refined, with guidance from a number of lawyers with expertise in US copyright law.

Project Gutenberg has always operated within the copyright laws of the US, and includes text in each eBook, and online at www.gutenberg.org, making it clear that readers in other countries must follow the laws that apply to them. Project Gutenberg affiliates, which operate completely independently, exist to emphasize the literary works and languages of different countries, and they follow the copyright laws of the country or region in which they operate.

Generally, copyright clearance is simple. Items published prior to 1923, anywhere in the world, are in the public domain in the US. Prior to 1993, all copyright clearance actions required mailing a photocopy of the title page and verso (obverse) page of a candidate book to Michael Hart or Greg Newby, but then an online system was developed that accepted scans of those pages. A database maintains records of cleared items, and who submitted them. A few other copyright rules are sometimes applied, for items published after 1923.

Sometimes, copyrighted items are submitted by authors. For many years, Project Gutenberg was one of few online repositories of user-contributed literary works, and therefore accepted items from contemporary authors. The two requirements for such content were:

1. A perpetual, worldwide, non-exclusive, irrevocable license be granted to Project Gutenberg, for unlimited redistribution of the item; and

2. The item must be made available as plain text, (valid) HTML, or both.

However, user-contributed content is generally no longer accepted for the main collection at www.gutenberg.org. Instead, a new self-publishing portal, operated by an affiliate, The World EBook Library, is available at self.gutenberg.org.

With the self-publishing portal, authors may use any license they wish (such as a Creative Commons license), and can provide items in PDF or other formats. This simplifies the process for the authors, and removes the need for Project Gutenberg’s volunteers to be involved with author-contributed content.

MULTIPLE SOURCES

Project Gutenberg encourages the use of multiple printed sources to create an eBook. For many historical works, including the US Declaration of Independence (the first Project Gutenberg eBook), there are variations among the printed sources. Another early example is the works of William Shakespeare. Project Gutenberg has several different versions of Shakespeare, including one based on the first edition folios. It has been typical, throughout the modern history of publishing, for different versions of a book to have variations.

In practice, the majority of Project Gutenberg eBooks rely on a single printed source. However, even those items might benefit from other sources — such as when some pages are missing, or illustrations come from a different version, or when typos/errata reports come from other sources.

It is a principal of Project Gutenberg that the eBooks in the collection are denoted as Project Gutenberg eBooks. Even if the publisher imprint and frontispiece from a printed work is included, there is no assurance that the content exactly matches that printed work. And, in fact, it will not match: minimally, the headers/footers will be removed, and paragraphs will flow together such that they span the pages of the printed source. Many other adjustments are typically made, as mentioned above.

For this reason, Project Gutenberg’s online catalog metadata does not include a citation to the source(s) used to create an eBook. Instead, Project Gutenberg should be cited as the publisher. For example, a bibliographic citation might have a form such as this:

Carroll, Lewis. “Alice’s Adventures in Wonderland.” Urbana, Illinois: Project Gutenberg. Available: www.gutenberg.org/ebooks/11

OTHER CONTENT TYPES

Project Gutenberg is, arguably, the oldest continuously operating online content project in the world. From 1971 until the mid-1990s, there were relatively few online resources for literary content. For this reason, and also due to a general willingness to experiment and reach out to broader audiences, Project Gutenberg has a great variety in the content types offered.

Among the first 100 items, there are mathematical constants and a musical performance. Government publications, notably the 1990 US Census and the CIA World Factbook from 1990 onward, were also included. The next few hundred items include movies, photographs of ancient cave paintings, and the first non-English items (Virgil’s Aeneid, Cicero’s Orations, and Caesar’s Commentaries, all in Latin).

Hundreds of audio eBooks are in the collection. Many were automatically generated via text-to-speech software. There are also a number of readings/performances by human readers, including from Project Gutenberg’s partner, Librivox (www.librivox.org). Today, automated text-to-speech is accessible by most people with a computer or mobile phone, so there is less emphasis on that format. Human readings/performances continue to be of interest, especially when the performance, as well as the original Project Gutenberg source eBook, is granted to the public domain.

LANGUAGES OTHER THAN ENGLISH

Non-English languages have some additional characteristics that were not well-suited for the plain text ASCII of Project Gutenberg’s early days. By the early 1990s, it was necessary to display accented characters, to accommodate languages such as French and Spanish. Later, languages such as Chinese would require entirely separate character sets.

OCR software may be poorly suited for several non-English languages, or may fail due to older styles of typesetting (the old German “Fraktur” is notorious in this regard).

Also, it is necessary to have proofreaders who are fluent in the language, to assure the eBook is enjoyable and reasonably free of errors. Despite these challenges, nearly 20% of the collection is in a language other than English, with 65 separate languages or dialects other than English. This emphasis on language diversity continues today, and is limited only by the willingness of volunteers to submit copyright clearances and prepare items for distribution.

Table 1: Language counts as of August 1, 2016, for 52615 eBooks.

# of eBooks Language code Language or dialect 43095 en English 2711 fr French 1469 de German 1421 fi Finnish 739 nl Dutch 678 it Italian 540 pt Portuguese 504 es Spanish 427 zh Chinese 219 el Greek 128 sv Swedish 112 hu Hungarian 112 eo Esperanto 102 la Latin 66 da Danish 60 tl Tagalog 31 pl Polish 31 ca Catalan 22 ja Japanese 17 no Norwegian 11 cy Welsh 10 cs Czech 9 ru Russian 7 is Icelandic 7 fur Friulian 6 te Telugu 6 he Hebrew 6 enm Middle English 6 bg Bulgarian 4 sr Serbian 4 ang Old English 4 af Afrikaans 3 nai North American Indian 3 nah Nahuatl 3 ilo Iloko 3 ceb Cebuano 2 ro Romanian 2 nav Navajo 2 myn Mayan Languages 2 mi Maori 2 grc Greek, Ancient 2 gla Gaelic, Scottish 2 ga Irish 2 fy Frisian 2 arp Arapaho 1 yi Yiddish 1 sl Slovenian 1 sa Sanskrit 1 rmr Calo 1 oji Ojibwa 1 oc Occitan 1 nap Napoletano- Calabrese 1 lt Lithuanian 1 ko Korean 1 kld Gamilaraay 1 kha Khasi 1 iu Inuktitut 1 ia Interlingua 1 gl Galician 1 fa Farsi 1 et Estonian 1 csb Kashubian 1 br Breton 1 bgi Giangan 1 ar Arabic 1 ale Aleut

EVOLUTION OF MASTER SOURCE FORMATS

Plain text was the first master source type/format for Project Gutenberg, and remains important today. Plain text is readable on any device. Plain text is printable, and efficient to store (including for compression, or sharing by email). For decades, the International Standards Organization has provided standard computerized encoding for the basic American standard codes (ASCII) and extensions for accents and other special characters (Latin1 or ISO 8859-1). Encoding exists for other languages, and Unicode (with 8- and 16-bit variations) provides encoding for larger groups of characters.

Within the first few hundred Project Gutenberg eBooks, some encoding was offered which seemed promising, but did not withstand the test of time. An early PostScript file was rendered unusable due to insertion of the Project Gutenberg standard header; a dictionary included markup that, today, might be reminiscent of XML or ReStructured Text, but without any sort of codebook for proper presentation; a few word processor native formats, including WordStar and WordPerfect, were used but are no longer readable with modern computers.

Even HTML (and other XML variants) was viewed with skepticism, since the longevity of formats is notoriously difficult to predict when they first become available.

For these reasons, Project Gutenberg still prefers to make plain text available for essentially every eBook. The only exceptions are those for which no plain text encoding is reasonable — such as Chinese, or mathematical texts, or music. In this way, the collection is “future proof,” so that even if all content cannot be fully represented as text, the files themselves will still be readable and enjoyable to read.

Figure 3: Typical text view, showing fixed-length lines and spacing among components.

A CONNECTICUT YANKEE IN KING ARTHUR’S COURT

by MARK TWAIN (Samuel L. Clemens)

PREFACE

The ungentle laws and customs touched upon in this tale are historical, and the episodes which are used to illustrate them are also historical. It is not pretended that these laws and customs existed in England in the sixth century; no, it is only pretended that inasmuch as they existed in the English and other civilizations of far later times, it is safe to consider that it is no libel upon the sixth century to suppose them to have been in practice in that day also. One is quite justified in inferring that whatever one of these laws or customs was lacking in that remote time, its place was competently filled by a worse one.

Today, Project Gutenberg’s plain text offerings are most often derived automatically from another master format. The most common master format is HTML, which offers advantages of ubiquity and ease of authoring. LaTeX is also used as a master, mainly for mathematical texts. ReStructured Text (RST) was encouraged by Project Gutenberg, due to the ease of conversion to other formats. However, RST has not been widely adopted by eBook producers.

DERIVATIVE FORMATS

The ubiquity of reading devices — from mobile phones, to tablets, to electronic paper — was predicted by Project Gutenberg. Rather than creating separate master files for each native format for the devices, automatic conversion is applied to one of the master formats. For years, Java-format eBooks were automatically created, and these were usable on many mobile phones.

Today, EPUB and MOBI (also known as Kindle) formats are the most common. Free software for conversion, called ebookmaker (previously called epubmaker) is used to create derivative formats. This helps to assure compatibility for different reader devices.

UPLOADING A NEW EBOOK

Volunteers upload the master format for their completed eBook to the Project Gutenberg server, where it undergoes automated and manual checks before the new eBook is posted and announced online. Prior to the upload, the copyright clearance must be completed.

Upon uploading, automated checks include:

HTML checks for validity of the HTML encoding (via the W3C validator);

HTML checks for internal link structure;

Spelling checks (English, with limited support for other languages);

Typo/scanno checks (seeking common scanner/OCR errors, such as “he” for “be” and vice-versa);

Conversion checks.

The conversion check consists of using the ebookmaker application to automatically generate derived formats. Ideally, resulting files will include:

Plain text in UTF8 encoding;

Automatically generated HTML (if HTML is not the master format).

EPUB and MOBI