Chapter 2
Plain Vanilla ASCII is the best format by far. It is "the lowest common denominator". It can be read, written, copied and printed by any simple text editor or word processor on every computer in the world. It is the only format compatible with 99% of hardware and software. It can be used as it is or to create versions in many other formats. It will still be used while other formats will be obsolete (or are already obsolete, like formats of a few short-lived reading devices launched between 1999 and 2003). It is the assurance collections will never be obsolete, and will survive future technological changes. The goal is to preserve the texts not only over decades but over centuries. There is no other standard as widely used as ASCII right now, even Unicode, a "universal" encoding system created in 1991.
Project Gutenberg also publishes eBooks in well-known formats like HTML, XML or RTF. There are Unicode files too. Any other format provided by volunteers (PDF, LIT, TeX and many others) is usually accepted, as long as they also supply an ASCII version where possible.
But a large scale conversion into other formats is handed over to other organizations. For example Blackmask Online, which uses Project Gutenberg's collections to offer thousands of free eBooks in eight different formats based on the Open eBook (OeB) format. Or Manybooks.net, which converts Project Gutenberg's eBooks into formats readable on PDAs. Or Bookshare.org, the main digital library for the visual impaired community in the US, which converts books from Project Gutenberg into Braille format and DAISY (Digital Audio Information System) format.
What is entailed exactly, once copyright clearance is received? Digitization is done by scanning the book page after page to get "image" files. Then volunteers run an OCR (Optical Character Recognition) software to convert "image" files into text files. Then each text file is proofread (i.e. re-read and corrected) by comparing it to the "image" file or the original page of the print version. There is an average of 10 mistakes per page for a good OCR package and... many more mistakes if the quality of the scanner and the OCR package is not great.
The book is proofread twice on the computer screen by two different people, who make any corrections necessary. When the original is in poor condition, as with very old books, it is keyed in manually, word by word. Some volunteers themselves prefer to type short texts, or works they particularly like. But most books are scanned, "OCRized" and proofread.
Digitization in "text format" means a book can be copied, indexed, searched, analyzed and compared with other books. It is possible to search the content of the book with the "Find" button available in any browser and any software, without a specific search engine. Project Gutenberg provides a "Nearly Full Text" search (on the first 100 K of each file) using Google, with a database updated approximately monthly. It also provides a search of book metadata (author, title, brief description, keywords) as a participant in Yahoo!'s Content Acquisition Program, with a database updated weekly. (Please see the bottom of the Online Book Catalog.) In the Advanced Search, several fields can be filled: author, title, subject, language, category (any, audio book, music, pictures), LoCC (Library of Congress Catalog classification), filetype (text, PDF, HTML, XML, JPEG, etc.), and eText/eBook No. A field "Full Text" was recently added as an experimental feature.
The assets of digitization in "text format" are numerous. It makes a smaller and more easily sendable computer file, unlike digitization in "image format", which produces a bulky "photo" file. Contrary to other formats, the files are accessible for low-bandwidth use. They can be copied as much as needed to produce new digital or print versions for free. The typos pointed out after the text is released can be fixed at any time. Readers can change the font and size of characters, the margins or the number of lines per page. Visually impaired readers can increase the letter size. Blind readers can use speech recognition software. All this is very difficult, if not impossible, with many other formats.
If the eBooks released are 99.9% accurate in the eyes of the general reader, the goal is not to create authoritative editions, and to argue with a picky reader whether a certain sentence should have a colon instead of a semi-colon between its clauses.
Project Gutenberg is convinced that proofreading by human beings is a very important step, and that this step makes all the difference. The use of scanned books as is --converted to text format by OCR software with no proofreading-- gives a much lower quality result. After running OCR software, the text is 99% reliable, in the best of cases. After proofreading, the text becomes 99.95% reliable (a high percentage which is also the standard at the Library of Congress).
For this reason, Project Gutenberg's perspective is rather different from that of the Million Book Project, another project launched by several professors from Carnegie Mellon University, and whose collections (10,611 books on June 1st, 2005) are hosted by the Internet Archive (the Internet Archive is also the backup distribution site of Project Gutenberg). In the case of the Million Book Project, books are scanned and "OCRized", but they are not proofread. The main formats used are XML, TIF and DjVu.
On Project Gutenberg's website, a File Recode Service allows users to convert books in one format (ASCII, ISO-8859, Unicode and Big-5) into another, and vice versa. A much more powerful conversion program may be launched in the future, with a conversion into still more formats (XML, HTML, PDF, TeX, RTF), including Braille and voice. It will then also be possible to choose the font and size of characters and the background color. Another eagerly expected conversion is that of a book from one language to another by machine translation software. This may be possible in a few years, when machine translation is accurate to 99%.
5. DISTRIBUTED PROOFREADERS, TO HANDLE SHARED PROOFREADING
The main "leap forward" of Project Gutenberg in the last few years is due to Distributed Proofreaders.
Distributed Proofreaders was conceived in 2000 by Charles Franks to help in the digitizing of public domain books. Originally meant to assist Project Gutenberg in the handling of shared proofreading, Distributed Proofreaders became the main source of Project Gutenberg eBooks. In 2002, Distributed Proofreaders became an official Project Gutenberg site.
The number of eBooks that have been processed through Distributed Proofreaders has grown fast, with a total of 3,000 eBooks in February 2004, 5,000 eBooks in October 2004 and 7,000 eBooks in May 2005. On August 3, 2005, 7,639 books were complete (processed through the site and posted to Project Gutenberg), 1,250 books were in progress (processed through the site but not yet posted, because currently going through their final proofreading and assembly), and 831 books were being proofread (currently being processed).
From the website one can access a program that allows several proofreaders to be working on the same book at the same time, each proofreading on different pages. This significantly speeds up the proofreading process. Volunteers register and receive detailed instructions. For example, words in bold, italic or underlined, or footnotes are always treated the same way for any eBook. A discussion forum allows them to ask questions or seek help at any time. A project manager oversees the progress of a particular book through its different steps on the website.
Each time proofreaders go to the website, they choose the book they want. One page of the book appears in two forms side by side: the scanned image of one page and the text from that image (as produced by OCR software). The proofreader can easily compare both versions, note the differences and fix them. OCR is usually 99% accurate, which makes for about 10 corrections a page. The proofreader saves each page as it is completed and can then either stop work or do another. The books are proofread twice, and the second time only by experienced proofreaders. All the pages of the book are then formatted, combined and assembled by post-processors to make an eBook. (For more detailed information, check the FAQ Central.) The eBook is now ready to be posted with an index entry (title, subtitle, author, eBook number and character set) for the database. Indexers go on with the cataloguing process (author's dates of birth and death, Library of Congress classification, etc.) after the release.
Volunteers don't have a quota to fill, but it is recommended they do a page a day if possible. It doesn't seem much, but with hundreds of volunteers it really adds up. In 2003, about 250-300 people were working each day all over the world, producing a daily total of 2,500-3,000 pages, the equivalent of two pages a minute. In 2004, the average was 300-400 proofreaders participating each day, and finishing 4,000-7,000 pages per day, the equivalent of four pages a minute.
Volunteers can also work independently, after contacting Project Gutenberg directly, by keying in a book they particularly like using any text editor or word processor. They can also scan it and convert it into text using OCR software, and then make corrections by comparing it with the original. In each case, someone else will proofread it. They can use ASCII and any other format. Everybody is welcome, whatever the method and whatever the format.
New volunteers are most welcome too at Distributed Proofreaders (DP-INT) and Distributed Proofreaders Europe (DP Europe). Any volunteer anywhere is welcome, for any language. There is a lot to do. As stated on both websites, "Remember that there is no commitment expected on this site. Proofread as often or as seldom as you like, and as many or as few pages as you like. We encourage people to do 'a page a day', but it's entirely up to you! We hope you will join us in our mission of 'preserving the literary history of the world in a freely available form for everyone to use'."
6. EBOOKS IN MORE AND MORE LANGUAGES
What about languages?
Initially, the eBooks were mostly in English. As Project Gutenberg is based in the United States, it first focused on the English-speaking community in the country and worldwide.
In October 1997, Michael Hart expressed his intention to expand the publishing of eBooks in other languages. At the beginning of 1998, the catalog had a few titles in French (10 titles), German, Italian, Spanish and Latin. In July 1999, Michael wrote: "I am publishing in one new language per month right now, and will continue as long as possible."
In early 2004, there were works in 25 languages. In July 2005, there were works in 42 languages, including Iroquoian, Sanskrit and the Mayan languages. The seven "main" languages were: English (with 14,548 books on July 27, 2005), French (577 books), German (349 books), Finnish (218 books), Dutch (130 books), Spanish (103 books) and Chinese (69 books).
Let us take French as an example. On February 13, 2004, there were 181 eBooks in French (out of a total of 11,340 eBooks). On May 16, 2005, there were 547 eBooks in French (out of 15,505 Books). The number tripled in 15 months. This number should rise significantly during the next few years, notably with Project Gutenberg Europe (launched in June 2005).
What were the first eBooks posted in French? They were six novels by Stendhal and two novels by Jules Verne, all released in early 1997. The six novels by Stendhal were: L'Abbesse de Castro, Les Cenci, La Chartreuse de Parme, La Duchesse de Palliano, Le Rouge et le Noir and Vittoria Accoramboni. The two novels by Jules Verne were: De la terre à la lune and Le tour du monde en quatre-vingts jours. In early 1997, whereas Project Gutenberg offered no English version of any of Stendhal's writings (yet), three of Jules Verne's novels were available in English: 20,000 Leagues Under the Seas (original title: Vingt mille lieues sous les mers), posted in September 1994; Around the World in 80 Days (original title: Le tour du monde en quatre-vingts jours), posted in January 1994 and From the Earth to the Moon(original title: De la terre à la lune), posted in September 1993. Stendhal and Jules Verne were followed by Edmond Rostand with Cyrano de Bergerac, posted in March 1998.
In late 1999, the "Top 20" --the 20 most downloaded authors-- included Jules Verne at 11 and Emile Zola at 16. They still have a very good ranking in the present "Top 100".
As a side remark, the first "images" ever made available by Project Gutenberg were French Cave Paintings, posted in April 1995, with an XHTML version posted in November 2000. This eBook contains four photos of paleolithic paintings found in a grotto located in Ardèche, a region of south-eastern France. These photos, which are copyrighted, were made available to Project Gutenberg thanks to Jean Clottes, a French general curator for cultural heritage (conservateur général du patrimoine), for everyone to enjoy them.
Multilingualism is now one of the priorities of Project Gutenberg, like internationalization. In early 2004, Michael Hart went off to Europe, with stops in Paris, Brussels and Belgrade. He gave a lecture on February 12, 2004 at UNESCO (United Nations Educational, Scientific and Cultural Organization) headquarters in Paris. He chaired a discussion at the French National Assembly on February 13. The following week, he addressed the European Parliament, in Brussels. He also met with the team of Project Rastko, in Belgrade, to support the creation of Distributed Proofreaders Europe (launched in January 2004) and Project Gutenberg Europe (conceived at the same time, and launched in June 2005).
The launching of Distributed Proofreaders Europe (DP Europe) by Project Rastko was indeed a very important step. DP Europe uses the software of the original Distributed Proofreaders and is dedicated to the proofreading of eBooks for Project Gutenberg Europe. Since its very beginnings, DP Europe has been a multilingual website, with its main pages translated into several European languages by volunteer translators. In April 2004, DP Europe was available in 12 languages. The long-term goal is 60 languages and 60 linguistic teams representing all the European languages. When it gets up to speed, DP Europe will provide eBooks for several national and/or linguistic digital libraries, for example Projet Gutenberg France for France. The goal is for every country to have its own digital library (according to the country copyright limitations), within a continental network (for France, the European network) and a global network (for the whole planet).
A few lines now on Project Rastko, which had the boldness to launch such a difficult and exciting project for Europe, and catalysed volunteers' energy in both Eastern and Western Europe (and anywhere else: as the internet has no boundaries, there is no need to live in Europe to register). Founded in 1997, Project Rastko is a non-governmental cultural and educational project. One of its goals is the online publishing of Serbian culture. It is part of the Balkans Cultural Network Initiative, a regional cultural network for the Balkan peninsula in south-eastern Europe.
In May 2005, Distributed Proofreaders Europe finished processing its 100th eBook. In June 2005 Project Gutenberg Europe was launched with these first 100 eBooks. PG Europe operates under "life +50" copyright laws. On August 3, 2005, 137 books were complete (processed through the site and posted to Project Gutenberg Europe), 418 books were in progress (processed through the site but not yet posted, because currently going through their final proofreading and assembly), and 125 books were being proofread (currently being processed). DP Europe supports Unicode to be able to proofread eBooks in numerous languages. Unicode is an encoding system created in 1991 that gives a unique number for every character in any language. From the Past to the Future
10 books online in August 1989; 100 books in January 1994; 1,000 books in August 1997; 2,000 books in May 1999; 3,000 books in December 2000; 4,000 books in October 2001; 5,000 books in April 2002; 10,000 books in October 2003; 15,000 books in January 2005; and 1 million books planned for 2015.
But Project Gutenberg's results are not only measured in numbers, which can't compete yet with the number of print books in the public domain. The results also include the major influence that the project has had. As the oldest producer of free eBooks on the internet, Project Gutenberg has inspired many other digital libraries, for example Projekt Gutenberg-DE for classic German literature and Projekt Runeberg for classic Nordic (Scandinavian) literature, to name only two.
Project Gutenberg keeps its administrative and financial structure to the bare minimum. Its motto fits into three words: "Less is more". The minimal rules give much space to volunteers and to new ideas. The goal is to ensure its independence from loans and other funding and from ephemeral cultural priorities, to avoid pressure from politicians or economic interests. The aim is also to ensure respect for the volunteers, who can be confident their work will be used not just for decades but for centuries. Volunteers can network through mailing lists and weekly or monthly newsletters. Donations are used to buy equipment and supplies, mostly computers and scanners. Founded in 2000, the PGLAF (Project Gutenberg Literary Archive Foundation) has only three part-time employees.
More generally, Michael should be given more credit as the real inventor of the eBook. If we consider the eBook in its etymological sense, that is to say a book that has been digitized to be distributed as an electronic file, it is now 34 years old and was born with Project Gutenberg in July 1971. This is a much more comforting paternity than the various commercial launchings in proprietary formats that peppered the early 2000s. There is no reason for the term "eBook" to be the monopoly of Amazon, Barnes & Noble, Gemstar and others. The non-commercial eBook is a full eBook, and not a "poor" version, just as non-commercial ePublishing is a fully-fledged way of publishing, and as valuable as commercial ePublishing. Project Gutenberg eTexts are now called eBooks, to use the recent terminology in the field.
In July 1971, sending a 5K file to 100 people would have crashed the network of the time. In November 2002, Project Gutenberg could post the 75 files of the Human Genome Project, with files of dozens or hundreds of megabytes, shortly after its initial release in February 2001, because it was public domain. In 2004, a computer hard disk costing US$140 could potentially hold the entire Library of Congress. And we probably are only a few years away from a storage disk capable of holding all the print media of our planet.
What about documents other than text?
In September 2003, Project Gutenberg launched Project Gutenberg Audio eBooks. As of 2005, there are 391 computer-generated audio books and a few human-read audio books. The number of human-read eBooks should greatly increase over the next few years. As for computer-generated eBooks, it seems they won't be stored in a specific section any more, but "converted" when requested from the existing electronic files in the main collections. Voice-activated requests will be possible, as a useful tool for visually impaired readers.
Launched at the same time, The Sheet Music Subproject is dedicated to digitized music sheet. It also contains a few music recordings. Some still pictures and moving pictures are also available. These new collections should take off in the future.
But digitizing books remains the priority, and there is a big demand, as confirmed by the tens of thousands of eBooks that are downloaded every day. For example, on July 31, 2005, there were 37,532 downloads for the day, 243,808 downloads for the week (July 24-31), and 1,154,765 downloads for the month. This only for transfers from ibiblio.org (University of North Carolina at Chapel Hill), the main eBook distribution site (which also hosts the website). The Internet Archive is the backup distribution site and provides unlimited disk space for storage and processing. Project Gutenberg has 44 mirror sites in many countries and is looking for new ones. It also encourages the use of P2P for sharing its eBooks. The "Top 100" lists the top 100 eBooks and the top 100 authors for the previous day, the last 7 days and the last 30 days.
Project Gutenberg eBooks can also help bridge the "digital divide." They can be read on a computer or a secondhand PDA costing just a few dollars. Solar-powered PDAs offer a good solution in remote regions and developing countries.
eBooks are also copied on CDs and DVDs. Blank CDs and DVDs cost next to nothing, as does their burning on a CD or DVD writer. Project Gutenberg sends a free CD or DVD to anyone who asks for it, and people are encouraged to make copies for a friend, a library or a school. Released in August 2003, the "Best of Gutenberg" CD contains over 600 eBooks. Released in December 2003, the first Project Gutenberg DVD contains 9,400 eBooks. A new DVD is in preparation. The current prototype contains nearly 26,000 eBooks (with some titles in different versions and formats), and is about 3/4 full.
By the time the collections hit one million eBooks in 2015 or before, it is hoped machine translation software will be able to convert them from one to another of 100 languages. In ten years from now, it is possible that machine translation will be judged 99% satisfactory (research is very active on that front, but there is still a lot to do), allowing for the reading of literary classics in a choice of many languages. In 2004, Project Gutenberg was in touch with a European project studying how to combine translation software and human translators, somewhat as OCR software is now combined with the work of proofreaders.
34 years after the beginnings of Project Gutenberg, Michael Hart describes himself as a workaholic who devotes his entire life to his project, because he thinks eBooks will become the "killer ap(plication)" of the computer revolution. He considers himself a pragmatic and farsighted altruist. For years he was regarded as a nut but now he is respected. He wants to change the world through freely-available eBooks that can be used and copied endlessly. Reading and culture for everyone at minimal cost. Project Gutenberg's mission can be stated in eight words: "To encourage the creation and distribution of eBooks," by everybody, and by every possible means. While implementing new ideas, new methods and new software.
Let us give the last word to Michael, whom I asked in August 1998: "What is your best experience with the internet?" His answer was: "The notes I get that tell me people appreciate that I have spent my life putting books, etc., on the internet. Some are quite touching, and can make my whole day." Seven years later, he confirms that his answer would still be the same.
7. CHRONOLOGY [UPDATED IN 2006]
1971 (July): Michael Hart keyed in The United States Declaration of Independence (eBook # 1) and informed the first 100 internet users. Project Gutenberg was born.
1972: He keyed in The United States Bill of Rights (eBook # 2).
1973: He keyed in The United States Constitution (eBook # 5).
1974-1988: He keyed in parts of the Bible and several works by Shakespeare.
1989 (August): The King James Bible (eBook # 10).
1991 (January): Alice's Adventures in Wonderland (eBook # 11).
1991 (June): Peter Pan (eBook # 16).
1991: Digitization of one book per month.
1992: Digitization of two books per month.
1993: Digitization of four books per month.