Project Gutenberg (1971-2005)

Chapter 1

Chapter 14,048 wordsPublic domain

Produced by Al Haines

PROJECT GUTENBERG (1971-2005)

MARIE LEBERT

NEF, University of Toronto, 2005

Copyright © 2005 Marie Lebert

Dated August 15, 2005, this long article (following a short version published in June 2004 [and copied at the end of this file]) is a paper for the third International Colloquium on ICT-enhanced French Studies: Dialogues across languages and cultures, October 2005, York University, Toronto, Canada. This article is dedicated to all Project Gutenberg and Distributed Proofreaders volunteers on the five continents, who offer us a free library of 16,000 high-quality eBooks, mainly classics of world literature, with a goal of one million eBooks in ten years.

With many thanks to Russon Wooldridge, who kindly edited this long article. The original version is available on the NEF, University of Toronto: http://www.etudes-francaises.net/dossiers/gutenberg_eng.htm

The French version is: Le Projet Gutenberg (1971-2005). The updated English version is: Project Gutenberg (1971-2008).

TABLE

1. Summary

2. History, From the Origins to Today

3. The Public Domain, an Endless Topic

4. The Method Adopted by Project Gutenberg

5. Distributed Proofreaders, to Handle Shared Proofreading

6. eBooks in More and More Languages

7. From the Past to the Future

8. Chronology [updated in 2006]

9. Links

10. Short Version [dated 2004]

1. SUMMARY

My fascination for Project Gutenberg is not new, but it doesn't wane. Nobody has done a better job of putting the world's literature at everyone's disposal. And to create a vast network of volunteers all over the world, without wasting people's skills or energy.

Here is the story in a few lines.

In July 1971, Michael Hart created Project Gutenberg with the goal of making available for free, and electronically, literary works belonging to the public domain. A project that has long been considered by its critics as impossible on a large scale. A pioneer site in a number of ways, Project Gutenberg was the first information provider on the internet and is the oldest digital library. Michael himself keyed in the first hundred books.

When the internet became popular, in the mid-1990s, the project got a boost and an international dimension. Michael still typed and scanned in books, but now coordinated the work of dozens and then hundreds of volunteers in many countries. The number of electronic books rose from 1,000 (in August 1997) to 2,000 (in May 1999), 3,000 (in December 2000) and 4,000 (in October 2001).

30 years after its birth, Project Gutenberg is running at full capacity. It had 5,000 books online in April 2002, 10,000 books online in October 2003, and 15,000 books online in January 2005, with 400 new books available per month, 40 mirror sites in a number of countries, and books downloaded by the tens of thousands every day.

Whether they were digitized 20 years ago or they are digitized now, all the books are captured in Plain Vanilla ASCII (the original 7-bit ASCII), with the same formatting rules, so they can be read easily by any machine, operating system or software, including on a PDA or an eBook reader. Any individual or organization is free to convert them to different formats, without any restriction except respect for copyright laws in the country involved.

In January 2004, Project Gutenberg had spread across the Atlantic with the creation of Project Gutenberg Europe. On top of its original mission, it also became a bridge between languages and cultures, with a goal of one million eBooks in 2015, and a number of national and linguistic sections. While adhering to the same principle: books for all and for free, through electronic versions that can be used and reproduced indefinitely. And, as a second step, the digitization of images and sound, in the same spirit.

2. HISTORY, FROM THE ORIGINS TO TODAY

= The Beginnings in 1971

Let us get back to the beginnings of the project. When he was a student at the University of Illinois (USA), Michael Hart was given $100,000,000 of computer time at the Materials Research Lab of his university. On July 4, 1971, on Independence Day, Michael keyed in The United States Declaration of Independence (signed on July 4, 1776) to the mainframe he was using. In upper case, because there was no lower case yet. But to send a 5 K file to the 100 users of the embryonic internet would have crashed the network. So Michael mentioned where the eText was stored (though without a hypertext link, because the web was still 20 years ahead). It was downloaded by six users. Project Gutenberg was born.

Michael decided to use this huge amount of computer time to search the public domain books that were stored in our libraries, and to digitize these books. He also decided to store the electronic texts (eTexts) in the simplest way, using the plain text format called Plain Vanilla ASCII, so they can be read easily by any machine, operating system or software. A book would become a continuous text file instead of a set of pages, with caps for the terms in italic, bold or underlined of the print version.

Soon afterwards he defined Project Gutenberg's mission: to put at everyone's disposal, in electronic versions, as many literary works of the public domain as possible for free. As he stated years later, in August 1998, "We consider eText to be a new medium, with no real relationship to paper, other than presenting the same material, but I don't see how paper can possibly compete once people each find their own comfortable way to eTexts, especially in schools."

= Persevering from 1972 to 1989

After he keyed in The United States Declaration of Independence in 1971, Michael went on in 1972 and typed in a longer text, The United States Bill of Rights, that includes the ten first amendments added in 1789 to the Constitution (dated 1787) and defining the individual rights of the citizens and the distinct powers ot the Federal Government and the States. In 1973, Michael typed in the full text of The United States Constitution.

From one year to the next, disk space was getting larger, by the standards of the time (there was no hard disk yet), so it was possible to plan bigger files. Michael began typing in the Bible, because the individual books of the Bible could be processed separately as different files. He also worked on the collected works of Shakespeare, with one play at a time, and a file for each play. That edition of Shakespeare was never released, due to copyright changes. If Shakespeare's works belong to the public domain, the comments and notes may be copyrighted, depending on the publication date. But other editions belonging to the public domain were posted a few years later.

In parallel, the internet, which was still embryonic in 1971, was born in 1974 with the launching of TCP/IP (Transmission Control Protocol / Internet Protocol). Its rapid expansion started in 1983.

In August 1989, Project Gutenberg celebrated the completion of its 10th eText, The King James Bible.

= 10 to 1,000 eBooks from 1990 to 1996

In 1990, there were 250,000 internet users, and the standard was 360 K disks. In January 1991, Michael keyed in Alice's Adventures in Wonderland, by Lewis Carroll (published in 1865). In July 1991, he typed in Peter Pan, by James M. Barrie (published in 1904). These two worldwide classics of childhood literature each fitted on one disk.

1991 was also the year the web became operational. The first browser, Mosaic, was released in November 1993. As the web was becoming a popular medium, it became easier to circulate eTexts and recruit volunteers. Project Gutenberg gradually got into its stride, with the digitization of one eText per month in 1991, two eTexts per month in 1992, four eTexts per month in 1993 and eight eTexts per month in 1994. In January 1994, Project Gutenberg celebrated its 100th eText by releasing The Complete Works of William Shakespeare. The steady growth went on, with an average of 8 eTexts per month in 1994, 16 eTexts per month in 1995, and 32 eTexts per month in 1996.

As we can see, from 1991 to 1996, the "output" doubled every year. While continuing to digitize books, Michael was also coordinating the work of dozens of volunteers. At the end of 1993, Project Gutenberg's eTexts were organized into three main sections: a) "Light Literature", such as Alice's Adventures in Wonderland, Peter Pan or Aesop's Fables; b) "Heavy Literature", such as the Bible, Shakespeare's works or Moby Dick; c) "Reference Literature", such as Roget's Thesaurus, and a set of encyclopaedias and dictionaries.

Project Gutenberg's goal is to be "universal" both for the literary works that are chosen and the audience who reads them. The goal is to put literature at everyone's disposal. With a focus on books that many people would use frequently, and not only students and teachers. For example, the "Light Literature" section is intended for pre-schoolers as well as their grandparents. The aim is that they will want to look up the eText of Peter Pan when they come back from watching Hook at the movies. Or that they will read the eText of Alice's Adventures in Wonderland after seeing it on TV. Or that they will look for the context of a quotation after hearing it in one of the Star Trek episodes; nearly every episode of Star Trek quotes from books which are in the Project Gutenberg collections.

The idea is that, whether they were avid readers of print books or not in the past, people should easily be able to look up quotations they hear in conversations, movies, music, or they read in books, newspapers and magazines, within a library containing all these quotations in an easy-to-use format. eTexts don't take up much space in ASCII format. They can be easily downloaded with a standard phone line. Searching a word or a phrase is simple too. People can easily search an entire eText by using the plain "search" menu available in any program."

= 1,000 eBooks in August 1997

In 1997, the "output" was still an average of 32 eTexts per month. In June 1997, Project Gutenberg released The Merry Adventures of Robin Hood, by Howard Pyle (published in 1883). In August 1997, it released its 1000th eText, La Divina Commedia di Dante (published in 1321), in Italian, its original language.

In August 1998, Michael wrote: "My own personal goal is to put 10,000 eTexts on the Net [editor's note: his goal was reached in October 2003] and if I can get some major support, I would like to expand that to 1,000,000 and to also expand our potential audience for the average eText from 1.x% of the world population to over 10%, thus changing our goal from giving away 1,000,000,000,000 eTexts to 1,000 times as many, a trillion and a quadrillion in US terminology."

= 1,000 to 5,000 eBooks from 1998 to 2002

From 1998 to 2000, there was a steadfast average of 36 new eTexts per month. In May 1999, there were 2,000 eTexts. The 2000th eText was Don Quijote, by Cervantes (published in 1605), in Spanish, its original language.

Around 40 eTexts per month were released during the 1st semester 2001, and 50 eTexts during the 2nd semester. Released in December 2000, the 3000th eText was the third volume of A l'ombre des jeunes filles en fleurs (In the Shadow of Young Girls in Flower), by Marcel Proust (published in 1919), in French, its original language.

Released in October 2001, the 4000th eText was The French Immortals Series, in English. Published in 1905 by Maison Mazarin, Paris, this book is an anthology of short fictions by authors belonging to the renowned French Academy (Académie française), notably Emile Souvestre, Pierre Loti, Hector Malot, Charles de Bernard and Alphonse Daudet.

Available in April 2002, the 5000th eText was The Notebooks of Leonardo da Vinci, which he wrote at the beginning of the 16th century. A text that is still in the Top 100 of downloaded texts in 2005.

In 1988, Michael Hart chose to digitize Alice's Adventures in Wonderland and Peter Pan because they each fitted on one 360 K disk, the standard of the time. Fifteen years later, in 2002, 1.44 M is the standard disk and ZIP is the standard compression. The practical file size is about 3 million characters, more than long enough for the average book. The digitized ASCII version of a 300-page novel is 1 M. A bulky book can fit in two ASCII files, that can be downloaded as is or in ZIP format.

An average of 50 hours is necessary to get an eText selected, copyright-cleared, scanned, proofread, formatted and assembled.

A few numbers are reserved for "special" books. For example, eText number 1984 is reserved for George Orwell's classic, published in 1949, and still a long way from falling into the public domain.

In 2002, around 100 eTexts were released per month. In Spring 2002, Project Gutenberg's eTexts represented 1/4 of all the public domain works freely available on the web and listed nearly exhaustively by The Internet Public Library (IPL). An impressive result thanks to the relentless work of 1,000 volunteers in several countries.

= 10,000 eBooks in October 2003

1,000 eTexts in August 1997, 2,000 eTexts in May 1999, 3,000 eTexts in December 2000, 4,000 eTexts in October 2001, 5,000 eTexts in April 2002, 10,000 eTexts in October 2003. eText number 10000 is The Magna Carta, the first English constitutional text, signed at the beginning of the 13th century.

From April 2002 to October 2003, in 18 months, the number of eTexts doubled, going from 5,000 to 10,000, with a monthly average of 300 new digitized books. In December 2003, most of the titles (9,400 eBooks) were also burned on a DVD to celebrate the landmark of 10,000 eTexts, renamed as eBooks, according to the latest terminology in the field. A few months before, in August 2003, a "Best of Gutenberg" CD was made available containing 600 eBooks (as a follow-up to other CDs in the past). People could request the CD and DVD for free, and were then encouraged to make copies for a friend, a library or a school. (In 2005, CD and DVD files are also periodically generated as ISO files. When downloaded, they can be used to make a CD or DVD using a CD or DVD writer.)

10,000 eBooks. An impressive number if we think about all the scanned and proofread pages this number represents. A fast growth thanks to Distributed Proofreaders, a website designed in 2000 by Charles Franks to share the proofreading of eBooks between many volunteers. Volunteers choose one of the eBooks listed on the site and proofread a given page. They don't have any quota to fulfill, but it is recommended they do a page per day if possible. It doesn't seem much, but with hundreds of volunteers it really adds up.

In December 2003, there were 11,000 eBooks digizited in several formats, most of them in ASCII, and some of them in HTML or XML. This represented 46,000 files, and 110 G. On 13 February 2004, the day of Michael Hart's presentation at UNESCO, in Paris (see below), there were exactly 11,340 eBooks in 25 languages. In May 2004, the 12,581 eBooks represented 100,000 files in 20 different formats, and 135 gigabytes. With 400 new eBooks added per month (and more in the years to come), the number of gigabytes is expected to double every year.

= 15,000 eBooks in January 2005

In January 2005, Project Gutenberg had 15,000 eBooks. eBook number 15000 is The Life of Reason, by George Santayana (published in 1906). On June 16, 2005 there were 16,481 eBooks in 42 languages. On August 3, 2005, besides English (14,590 eBooks), the six main languages were French (578 eBooks), German (349 eBooks), Finnish (225 eBooks), Dutch (130 eBooks), Spanish (105 eBooks) and Chinese (69 eBooks).

Michael hopes to reach 1,000,000 eBooks by 2015. Each email he sends includes the current number, and the next significant goal to reach. As of July 2005, the next goal is 20,000 eBooks. This goal should be reached in July 2006, for the 35th anniversary of Project Gutenberg.

Conceived in January 2004, at the same time as the launching of Distributed Proofreaders Europe (DP Europe) by Project Rastko, Project Gutenberg Europe went online in June 2005 and released the 100 first eBooks processed by DP Europe over the past several months. These eBooks are in several languages, a reflection of European linguistic diversity. 100 languages are planned for the long term.

In July 2005, Project Gutenberg of Australia (launched in 2001) reached 500 eBooks, and Project Gutenberg of Canada took its first steps (see the PGCanada List). Project Gutenberg Portugal and Project Gutenberg Philippines will be next. (For the latest news, check the News and Events of Project Gutenberg.)

3. THE PUBLIC DOMAIN, AN ENDLESS TOPIC

Despite the enthusiasm and the persistence of its hundreds of volunteers, the task of Project Gutenberg isn't made any easier by the increasing restrictions to the public domain. As stated in the FAQ, "the public domain is the set of cultural works that are free of copyright, and belong to everyone equally." In former times, 50% of works belonged to the public domain, and could be freely used by everybody. Nowadays, 99% of works are governed by copyright, and some people would like this percentage to reach 100%.

In the Copyright HowTo section, Project Gutenberg presents its own rules for confirming the public domain status of eBooks according to US copyright laws. Here is a summary. Works published before 1923 entered the public domain no later than 75 years from the copyright date. (All these works are now in the public domain.) Works published between 1923 and 1977 retain copyright for 95 years. (No such works will enter the public domain until 2019.) Works created from 1978 on enter the public domain 70 years after the death of the author if the author is a natural person. (Nothing will enter the public domain until 2049.) Works created from 1978 on enter the public domain 95 years after publication (or 120 years after creation) if the author is a corporate one. (Nothing will enter the public domain until 2074.) Other rules apply too.

Much more restrictive than the previous one, the current legislation became effective after the promulgation of amendments to the 1976 Copyright Act, dated October 27th, 1998. As explained by Michael Hart in July 1999: "Nothing will expire for another 20 years. We used to have to wait 75 years. Now it is 95 years. And it was 28 years (+ a possible 28 year extension, only on request) before that, and 14 years (+ a possible 14 year extension) before that. So, as you can see, this is a serious degrading of the public domain, as a matter of continuing policy."

The dates mentioned by Michael are: a) 1790, date of the stranglehold of the Stationers' Guild (the publishers of the time) on the Gutenberg printing press (hence the 14-year copyright); b) 1909, date of the copyright reinforcement to counter the re-publishing of large collections of the public domain by reprint houses using steam and electric presses (hence the 28-year copyright); c) 1976, date of a new tightening of the copyright following the introduction of the Xerox photocopying machine (hence the 50-year copyright after the author's life); d) 1998, date of a further tightening of the copyright following the development of the internet (hence the 70-year copyright after the author's life). These are only the main lines. The Copyright Act has been amended 11 times in the last 40 years.

As stated by Tom W. Bell in Trend of Maximum U.S. General Copyright Term (with a very useful chart): "The first federal copyright legislation, the 1790 Copyright Act, set the maximum term at fourteen years plus a renewal term of fourteen years. The 1831 Copyright Act doubled the initial term and retained the conditional renewal term, allowing a total of up to forty-two years of protection. Lawmakers doubled the renewal term in 1909, letting copyrights run for up to fifty-six years. The interim renewal acts of 1962 through 1974 ensured that the copyright in any work in its second term as of September 19, 1962, would not expire before Dec. 31, 1976. The 1976 Copyright Act changed the measure of the default copyright term to life of the author plus fifty years. Recent amendments to the Copyright Act [the ones in 1998] expanded the term yet again, letting it run for the life of the author plus seventy years."

The amendments of the Copyright Act, dated October 27, 1998, were a major blow for digital libraries and deeply shocked their founders, beginning with Michael Hart and John Mark Ockerbloom, founder of The Online Books Page. But how were they to measure up to the major publishing companies? Michael wrote in July 1999: "No one has said more against copyright extensions than I have, but Hollywood and the big publishers have seen to it that our Congress won't even mention it in public. The kind of copyright debate going on is totally impractical. It is run by and for the 'Landed Gentry of the Information Age.' 'Information Age'? For whom?"

True enough. The political authorities continually speak about an information age while tightening the laws relating to the dissemination of information. The contradiction is obvious. This problem has also affected Australia (forcing Project Gutenberg of Australia to withdraw dozens of books from its collections) and several European countries. In a number of countries, the rule is now life of the author plus 70 years, instead of life plus 50 years, following pressure from content owners, with the subsequent "harmonization" of national copyright laws as a response to the "globalization of the market". (The Online Books Page gives a summary of the various copyright regimes, with a number of useful links.)

Now, from the volunteer point of view, the wisest thing to do is to choose a book published before 1923. It is also required that copyright clearance be confirmed prior to working on any eBook by sending a photocopy of the title page and verso page (even if the latter is blank) to Michael. The pages should be sent as scans to be uploaded on the website. For people who cannot create scans, it is possible to send photocopies by postal mail. The pages will then be filed, either on paper or electronically, so that the proof will be available in the future, to demonstrate if necessary that the book is in the public domain under the US law. Project Gutenberg doesn't release any eBook until the book's copyright status has been confirmed.

There is nevertheless hope for some books published after 1923. According to Greg Newby, director of PGLAF (Project Gutenberg Literary Archive Foundation), one million books published between 1923 and 1964 could also belong to the public domain, because only 10% of copyrights were actually renewed. Project Gutenberg tries to locate these books. In April 2004, with the help of hundreds of volunteers at Distributed Proofreaders, all Copyright Renewal records were posted for books from 1950 through 1977. So, if a given book published during this period is not on the list, it means the copyright was not renewed, and the book fell into the public domain.

4. THE METHOD ADOPTED BY PROJECT GUTENBERG

Whether digitized years ago or now, all the books are digitized in 7-bit plain ASCII (American Standard Code for Information Interchange), called Plain Vanilla ASCII. Used since the beginnings of computing, it is the set of unaccented characters present on a standard English-language keyboard (A-Z, a-z, numbers, punctuation and other basic symbols). When 8-bit ASCII (also called ISO-8859 or ISO-Latin) is used for books with accented characters like French or German, Project Gutenberg also produces a 7-bit ASCII version with the accents stripped. (This doesn't apply for languages that are not "convertible" in ASCII, like Chinese, encoded in Big-5.)