Chapter 1
Produced by Al Haines
PROJECT GUTENBERG (1971-2008)
MARIE LEBERT
NEF, University of Toronto & Project Gutenberg, 2008
Copyright © 2008 Marie Lebert
This long article is dated May 2008. With many thanks to the great people who helped me, especially Michael Hart, founder of Project Gutenberg, and Russon Wooldridge, founder of NEF. All the mistakes are mine - my mother tongue is not English, but French. This article is also available in French: Le Projet Gutenberg (1971-2008).
TABLE
1. Overview
2. A Bet Since 1971
3. The Method
4. Shared Proofreading
5. Becoming Multingual
6. Public Domain vs. Copyright
7. From the Past to the Future
8. Chronology
9. Stats
10. Links
1. OVERVIEW
August 1997: 1,000 books; April 2002: 5,000 books; October 2003: 10,000 books; January 2005: 15,000 books; December 2006: 20,000 books; April 2008: 25,000 books.
In July 1971, Michael Hart created Project Gutenberg with the goal of making available for free, and electronically, literary works belonging to public domain. A pioneer site in a number of ways, Project Gutenberg was the first information provider on the internet and is the oldest digital library. When the internet became popular, in the mid-1990s, the project got a boost and an international dimension. The number of electronic books rose from 1,000 (in August 1997) to 5,000 (in April 2002), 10,000 (in October 2003), 15,000 (in January 2005), 20,000 (in December 2006) and 25,000 (in April 2008), with a current production rate of around 340 new books each month. With 55 languages and 40 mirror sites around the world, books are being downloaded by the tens of thousands every day. Project Gutenberg promotes digitization in "text format", meaning that a book can be copied, indexed, searched, analyzed and compared with other books. Contrary to other formats, the files are accessible for low-bandwidth use. The main source of new Project Gutenberg eBooks is Distributed Proofreaders, launched in October 2000 by Charles Franks to help in the digitizing of books from public domain.
2. A BET SINCE 1971
= In a Few Words
If the print book is 5 centuries and a half old, the electronic book is only 37 years old. It is born with Project Gutenberg, created by Michael Hart in July 1971 to make available for free electronic versions of literary books belonging to public domain. A pioneer site in a number of ways, Project Gutenberg was the first information provider on an embryonic internet and is the oldest digital library. Long considered by its critics as impossible on a large scale, Project Gutenberg counted 25,000 books in April 2008, with tens of thousands downloads daily. To this day, nobody has done a better job of putting the world's literature at everyone's disposal. And to create a vast network of volunteers all over the world, without wasting people's skills or energy.
During the fist twenty years, Michael Hart himself keyed in the first hundred books, with the occasional help of others from time to time. When the internet became popular, in the mid-1990s, the project got a boost and an international dimension. Michael still typed and scanned in books, but now coordinated the work of dozens and then hundreds of volunteers in many countries. The number of electronic books rose from 1,000 (in August 1997) to 2,000 (in May 1999), 3,000 (in December 2000) and 4,000 (in October 2001).
37 years after its birth, Project Gutenberg is running at full capacity. It had 5,000 books online in April 2002, 10,000 books in October 2003, 15,000 books in January 2005, 20,000 books in December 2006 and 25,000 books in April 2008, with 340 new books available per month, 40 mirror sites in a number of countries, books downloaded by the tens of thousands every day, and tens of thousands of volunteers in various teams.
Whether they were digitized 30 years ago or they are digitized now, all the books are captured in Plain Vanilla ASCII (the original 7-bit ASCII), with the same formatting rules, so they can be read easily by any machine, operating system or software, including on a PDA, a cell phone or an eBook reader. Any individual or organization is free to convert them to different formats, without any restriction except respect for copyright laws in the country involved.
In January 2004, Project Gutenberg had spread across the Atlantic with the creation of Project Gutenberg Europe. On top of its original mission, it also became a bridge between languages and cultures, with a number of national and linguistic sections. While adhering to the same principle: books for all and for free, through electronic versions that can be used and reproduced indefinitely. And, as a second step, the digitization of images and sound, in the same spirit.
= Beginning and Persevering
Let us get back to the beginnings of the project. When he was a student at the University of Illinois (USA), Michael Hart was given $100,000,000 of computer time at the Materials Research Lab of his university. On July 4, 1971, on Independence Day, Michael keyed in The United States Declaration of Independence (signed on July 4, 1776) to the mainframe he was using. In upper case, because there was no lower case yet. But to send a 5 K file to the 100 users of the embryonic internet would have crashed the network. So Michael mentioned where the eText was stored (though without a hypertext link, because the web was still 20 years ahead). It was downloaded by six users. Project Gutenberg was born.
Michael decided to use this huge amount of computer time to search the public domain books that were stored in our libraries, and to digitize these books. He also decided to store the electronic texts (eTexts) in the simplest way, using the plain text format called Plain Vanilla ASCII, so they can be read easily by any machine, operating system or software. A book would become a continuous text file instead of a set of pages, with caps for the terms in italic, bold or underlined of the print version.
Soon afterwards he defined Project Gutenberg's mission: to put at everyone's disposal, in electronic versions, as many literary works of the public domain as possible for free. As he stated years later, in August 1998, "We consider eText to be a new medium, with no real relationship to paper, other than presenting the same material, but I don't see how paper can possibly compete once people each find their own comfortable way to eTexts, especially in schools."
After he keyed in The United States Declaration of Independence in 1971, Michael went on in 1972 and typed in a longer text, The United States Bill of Rights, that includes the ten first amendments added in 1789 to the Constitution (dated 1787) and defining the individual rights of the citizens and the distinct powers ot the Federal Government and the States. In 1973, Michael typed in the full text of The United States Constitution.
From one year to the next, disk space was getting larger, by the standards of the time (there was no hard disk yet), so it was possible to plan bigger files. Michael began typing in the Bible, because the individual books of the Bible could be processed separately as different files. He also worked on the collected works of Shakespeare, with one play at a time, and a file for each play. That edition of Shakespeare was never released, due to copyright changes. If Shakespeare's works belong to the public domain, the comments and notes may be copyrighted, depending on the publication date. But other editions belonging to the public domain were posted a few years later.
In parallel, the internet, which was still embryonic in 1971, was born in 1974 with the creation of TCP/IP (Transmission Control Protocol / Internet Protocol) by Vinton Cerf and Bob Kahn. Its rapid expansion started in 1983.
= 10 to 10,000 Books
In August 1989, Project Gutenberg completed its 10th book, The King James Bible, that was first published in 1611, with the standard text dated 1769. In 1990, there were 250,000 internet users, and the standard was 360 K disks. In January 1991, Michael typed in Alice's Adventures in Wonderland, by Lewis Carroll (published in 1865). In July 1991, he typed in Peter Pan, by James M. Barrie (published in 1904). These two worldwide classics of childhood literature each fitted on one disk.
1991 was also the year the web became operational. The first browser, Mosaic, was released in November 1993. As the web was becoming a popular medium, it became easier to circulate eTexts and recruit volunteers. Project Gutenberg gradually got into its stride, with the digitization of one book per month in 1991, two books per month in 1992, four books per month in 1993 and eight books per month in 1994. In January 1994, Project Gutenberg celebrated its 100th book by releasing The Complete Works of William Shakespeare. Shakespeare wrote most of his work between 1590 and 1613. The steady growth went on, with an average of 8 books per month in 1994, 16 books per month in 1995, and 32 books per month in 1996.
As we can see, from 1991 to 1996, the "output" doubled every year. While continuing to digitize books, Michael was also coordinating the work of dozens of volunteers. At the end of 1993, Project Gutenberg's eTexts were organized into three main sections: a) "Light Literature", such as Alice's Adventures in Wonderland, Peter Pan or Aesop's Fables; b) "Heavy Literature", such as the Bible, Shakespeare's works or Moby Dick; c) "Reference Literature", such as Roget's Thesaurus, and a set of encyclopaedias and dictionaries. This organization in three sections was abandoned later for a more detailed classification.
Project Gutenberg's goal is to be "universal" both for the literary works that are chosen and the audience who reads them. The goal is to put literature at everyone's disposal. With a focus on books that many people would use frequently, and not only students and teachers. For example, the "Light Literature" section is intended for pre-schoolers as well as their grandparents. The aim is that they will want to look up the eText of Peter Pan when they come back from watching Hook at the movies. Or that they will read the eText of Alice's Adventures in Wonderland after seeing it on TV. Or that they will look for the context of a quotation after hearing it in one of the Star Trek episodes; nearly every episode of Star Trek quotes from books which are in the Project Gutenberg collections.
The idea is that, whether they were avid readers of print books or not in the past, people should easily be able to look up quotations they hear in conversations, movies, music, or they read in books, newspapers and magazines, within a library containing all these quotations in an easy-to-use format. eTexts don't take up much space in ASCII format. They can be easily downloaded with a standard phone line. Searching a word or a phrase is simple too. People can easily search an entire eText by using the plain "search" menu available in any program.
In 1997, the "output" was still an average of 32 books per month. In June 1997, Project Gutenberg released The Merry Adventures of Robin Hood, by Howard Pyle (published in 1883). In August 1997, it released its 1000th book, La Divina Commedia di Dante (published in 1321), in Italian, its original language.
In August 1998, Michael wrote: "My own personal goal is to put 10,000 eTexts on the Net [editor's note: his goal was reached in October 2003] and if I can get some major support, I would like to expand that to 1,000,000 and to also expand our potential audience for the average eText from 1.x% of the world population to over 10%, thus changing our goal from giving away 1,000,000,000,000 eTexts to 1,000 times as many, a trillion and a quadrillion in US terminology."
= 1,000 to 10,000 Books
From 1998 to 2000, there was a steadfast average of 36 new books per month. In May 1999, there were 2,000 books. The 2000th book was Don Quijote, by Cervantes (published in 1605), in Spanish, its original language.
Released in December 2000, the 3000th book was the third volume of A l'ombre des jeunes filles en fleurs (In the Shadow of Young Girls in Flower), by Marcel Proust (published in 1919), in French, its original language. Around 104 books per month were released in 2001.
Released in October 2001, the 4000th book was The French Immortals Series, in English. Published in 1905 by Maison Mazarin, Paris, this book is an anthology of short fictions by authors belonging to the renowned French Academy (Académie française), notably Emile Souvestre, Pierre Loti, Hector Malot, Charles de Bernard and Alphonse Daudet.
Available in April 2002, the 5000th book was The Notebooks of Leonardo da Vinci, which he wrote at the beginning of the 16th century. A text that is steadily in the Top 100 of downloaded texts.
In 1988, Michael Hart chose to digitize Alice's Adventures in Wonderland and Peter Pan because they each fitted on one 360 K disk, the standard of the time. Fifteen years later, in 2002, 1.44 M is the standard disk and ZIP is the standard compression. The practical file size is about 3 million characters, more than long enough for the average book. The digitized ASCII version of a 300-page novel is 1 M. A bulky book can fit in two ASCII files, that can be downloaded as is or in ZIP format.
An average of 50 hours is necessary to get an eText selected, copyright-cleared, scanned, proofread, formatted and assembled.
A few numbers are reserved for "special" books. For example, eBook number 1984 is reserved for George Orwell's classic, published in 1949, and still a long way from falling into the public domain.
In 2002, around 100 books were released per month. In Spring 2002, Project Gutenberg's books represented 1/4 of all the public domain works freely available on the web and listed nearly exhaustively by the Internet Public Library (IPL). An impressive result thanks to the relentless work of thousands of volunteers in several countries.
1,000 books in August 1997, 2,000 books in May 1999, 3,000 books in December 2000, 4,000 books in October 2001, 5,000 books in April 2002, 10,000 books in October 2003. eBook number 10000 is The Magna Carta, the first English constitutional text, signed in 1215. From April 2002 to October 2003, in 18 months, the number of books doubled, going from 5,000 to 10,000, with a monthly average of 300 new digitized books.
10,000 books. An impressive number if we think about all the scanned and proofread pages this number represents. A fast growth thanks to Distributed Proofreaders, a website launched in October 2000 by Charles Franks to share the proofreading of books between many volunteers. Volunteers choose one of the books listed on the site and proofread a given page. They don't have any quota to fulfill, but it is recommended they do a page per day if possible. It doesn't seem much, but with hundreds of volunteers it really adds up.
Books are also copied on CDs and DVDs. Blank CDs and DVDs cost next to nothing, as does their burning on a CD or DVD writer. Project Gutenberg sends a free CD or DVD to anyone who asks for it, and people are encouraged to make copies for a friend, a library or a school. Released in August 2003, the "Best of Gutenberg" CD contained over 600 books, as a follow-up to other CDs in the past). The first Project Gutenberg DVD was released in December 2003 to celebrate the landmark of 10,000 books, with most of the existing titles (9,400 books).
= 10,000 to 20,000 Books
In December 2003, there were 11,000 books digizited in several formats, most of them in ASCII, and some of them in HTML or XML. This represented 46,000 files, and 110 G. On 13 February 2004, the day of Michael Hart's presentation at UNESCO, in Paris, there were exactly 11,340 books in 25 languages. In May 2004, the 12,581 books represented 100,000 files in 20 different formats, and 135 gigabytes. With more than 300 new books added per month (338 books in 2004), the number of gigabytes is expected to double every year.
The Project Gutenberg Consortia Center (PGCC) was officially affiliated to Project Gutenberg in 2003. Since 1997, PGCC had been working on gathering collections of existing eBooks, as a complement to Project Gutenberg which was focusing on the production of eBooks.
In December 2003, Distributed Proofreaders Europe (DP EUrope) were launched by Project Rastko, followed by Project Gutenberg Europe (PG Europe) in January 2004. Project Gutenberg Europe celebrated its first 100 books in June 2005. These books were in several languages, a reflection of European linguistic diversity, with 100 languages planned for the long term.
In January 2005, Project Gutenberg reached the landmark of 15,000 books. eBook number 15000 is The Life of Reason, by George Santayana (published in 1906). In July 2005, Project Gutenberg of Australia (launched in 2001) reached the landmark of 500 books. New teams were getting ready to launch Project Gutenberg Canada, Project Gutenberg Portugal and Project Gutenberg Philippines over the next years.
What about languages? If there where were works in 25 languages only in February 2004, there were works in 42 languages in July 2005, including Iroquoian, Sanskrit and the Mayan languages. On July 27, 2005, out of a total of 16,800 books, the seven "main" languages were: English (with 14,548 books), French (577 books), German (349 books), Finnish (218 books), Dutch (130 books), Spanish (103 books) and Chinese (69 books). There were books in 50 languages in December 2006. On December 16, 2006, out of a total of 19,996 books, the main languages were English (17,377 books), French (966 books), German (412 books), Finnish (344 books), Dutch (244 books), Spanish (140 books), Italian (102 books), Chinese (69 books), Portuguese (68 books) and Tagalog (51 books).
In December 2006, Project Gutenberg reached the landmark of 20,000 books. eBook number 20000 was the audio book of Twenty Thousand Leagues Under the Sea (Vingt mille lieues sous les mers), by Jules Verne (published in 1869). Half of these 20,000 books were produced by Distributed Proofreaders since October 2000, with a monthly average of 346 new digitized books in 2006. If 32 years were necessary to digitize the first 10,000 books, between July 1971 and October 2003, 3 years and 2 months were necessary to digitize the following 10,000 books, between October 2003 and December 2006. Project Gutenberg of Australia was about to reach 1,500 books (this goal was achieved in April 2007) and Project Gutenberg Europe reached 500 books.
The section Project Gutenberg PrePrints was set up in January 2006 to collect items submitted to Project Gutenberg which for some reason were interesting enough to be available online, but not quite ready yet to be added to the main Project Gutenberg collection, the reason being for example missing data, low-quality files, formats which were not handy, etc. This new section had 379 files in December 2006.
= 20,000 to 25,000 Books
Project Gutenberg News began in November 2006 with Mike Cook as its editor and webmaster, as a complement to the weekly and monthly newsletters that had existed since a number of years. The website gives for example the weekly, monthly and yearly production stats since 2001. The weekly production was 24 books in 2001, 47 books in 2002, 79 books in 2003, 78 books in 2004, 58 books in 2005, 80 books in 2006 and 78 books in 2007. The monthly production was 104 books in 2001, 203 books in 2002, 348 books in 2003, 338 books in 2004, 252 books in 2005, 345 books in 2006 and 338 books in 2007. The yearly production was 1,244 books in 2001, 2,432 books in 2002, 4,176 books in 2003, 4,058 books in 2004, 3,019 books in 2005, 4,141 books in 2006 and 4,049 books in 2007.
Project Gutenberg of Canada (PGC) was founded on July 1st, 2007, on Canada Day, by Michael Shepard and David Jones, and Distributed Proofreaders of Canada (DPC) started production in December 2007. There were 100 books in March 2008, with books in English, French and Italian.
The combined Project Gutenberg projects have produced a total of 26,161 titles in 2007.
Project Gutenberg sent out 15 million books via snail mail in 2007, under the form of CDs and DVDs. Dated July 2006, the latest DVD included 17,000 books. Since 2005, CD and DVD files have also been periodically generated as ISO files to be downloaded and used to make a CD or DVD using a CD or DVD writer.
As for volunteers, Distributed Proofreaders (DP), who started production in October 2000, had over 52,000 volunteers in January 2008. DP processed 11,934 books since its beginnings. Distributed Proofreaders of Europe (DP Europe), who started production in December 2003, had over 1,500 volunteers in January 2008. Distributed Proofreaders Canada (DPC), who started production in December 2007, had over 250 volunteers in January 2008.
Project Gutenberg reached the landmark of 25,000 books in April 2008. eBook number 25000 was English Book Collectors, by William Younger Fletcher (published in 1902). On April 21, 2008, out of a total of 25,004 books, the main languages were English (21,475 books), French (1,168 books), German (530 books), Finnish (433 books), Dutch (326 books), Portuguese (217 books), Chinese (196 books), Spanish (180 books), Italian (128 books), Latin (55 books) and Tagalog (54 books). And there were books in Esperanto (45 books), Swedish (40 books), Danish (20 books), Catalan (19 books), Welsh (10 books), Norwegian (10 books), Russian (7 books), Icelandic (7 books), Hungarian (7 books), Middle English (6 books), Greek (6 books) and Bulgarian (6 books).
3. THE METHOD
Whether digitized years ago or now, all the books are digitized in 7-bit plain ASCII (American Standard Code for Information Interchange), called Plain Vanilla ASCII. Used since the beginnings of computing, it is the set of unaccented characters present on a standard English-language keyboard (A-Z, a-z, numbers, punctuation and other basic symbols). When 8-bit ASCII (also called ISO-8859 or ISO-Latin) is used for books with accented characters like French or German, Project Gutenberg also produces a 7-bit ASCII version with the accents stripped. (This doesn't apply for languages that are not "convertible" in ASCII, like Chinese, encoded in Big-5.)
Plain Vanilla ASCII is the best format by far. It is "the lowest common denominator". It can be read, written, copied and printed by any simple text editor or word processor on any electronic device. It is the only format compatible with 99% of hardware and software. It can be used as it is or to create versions in many other formats. It will still be used while other formats will be obsolete (or are already obsolete, like formats of a few short-lived reading devices launched since 1999). It is the assurance collections will never be obsolete, and will survive future technological changes. The goal is to preserve the texts not only over decades but over centuries. There is no other standard as widely used as ASCII right now, even Unicode, a "universal" encoding system created in 1991.
Project Gutenberg also publishes books in well-known formats like HTML, XML or RTF. There are Unicode files too. Any other format provided by volunteers (PDF, LIT, TeX and many others) is usually accepted, as long as they also supply an ASCII version where possible.