Project Gutenberg (1971-2008)

Chapter 2

Chapter 23,978 wordsPublic domain

But a large scale conversion into other formats is handed over to other organizations. For example Blackmask Online, which uses Project Gutenberg's collections to offer thousands of free books in eight different formats based on the Open eBook (OeB) format. Or Manybooks.net, which converts Project Gutenberg's books into formats readable on PDAs. Or Mobilebooks, with 5,000 books in Java (.jar) format that can be downloaded from the website to be read on a cell phone. Or Wattpad, a free service for reading and sharing stories on a mobile phone. Once downloaded to your phone, the service gives instant access to works from Project Gutenberg.

As a volunteer, the wisest thing to do is to choose a book published before 1923. It is also required that copyright clearance be confirmed prior to working on any book by sending a photocopy of the title page and verso page (even if the latter is blank) to Michael Hart. The pages should be sent as scans to be uploaded on the website. For people who cannot create scans, it is possible to send photocopies by postal mail. The pages will then be filed, either on paper or electronically, so that the proof will be available in the future, to demonstrate if necessary that the book is in the public domain under the US law. Project Gutenberg doesn't release any book until the book's copyright status has been confirmed.

What is entailed exactly, once copyright clearance is received? Digitization is done by scanning the book page after page to get "image" files. Then volunteers run an OCR (Optical Character Recognition) software to convert "image" files into text files. Then each text file is proofread (i.e. re-read and corrected) by comparing it to the "image" file or the original page of the print version. There is an average of 10 mistakes per page for a good OCR package, and many more mistakes if the quality of the scanner and the OCR package is not great.

The book is proofread twice on the computer screen by two different people, who make any corrections necessary. When the original is in poor condition, as with very old books, it is keyed in manually, word by word. Some volunteers themselves prefer to type short texts, or works they particularly like. But most books are scanned, "OCRized" and proofread.

Contrary to digitization in "image format", which consists only in scanning the pages, digitization in "text format" adds the OCR step: a) the book can be copied, indexed, searched, analyzed and compared with other books; b) it is possible to search the content of the book with the "Find" button available in any browser and any software, without a specific search engine.

The assets of digitization in "text format" are numerous. It makes a smaller and more easily sendable computer file, unlike digitization in "image format", which produces a bulky "photo" file. Contrary to other formats, the files are accessible for low-bandwidth use. They can be copied as much as needed to produce new digital or print versions for free. The typos pointed out after the text is released can be fixed at any time. Readers can change the font and size of characters, the margins or the number of lines per page. Visually impaired readers can increase the letter size. Blind readers can use speech recognition software. All this is very difficult, if not impossible, with many other formats.

If the books released are 99.9% accurate in the eyes of the general reader, the goal is not to create authoritative editions, and to argue with a picky reader whether a certain sentence should have a colon instead of a semi-colon between its clauses.

Project Gutenberg is convinced that proofreading by human beings is a very important step, and that this step makes all the difference. The use of scanned books as is --converted to text format by OCR software with no proofreading-- gives a much lower quality result. After running OCR software, the text is 99% reliable, in the best of cases. After proofreading, the text becomes 99.95% reliable (a high percentage which is also the standard at the Library of Congress).

For this reason, Project Gutenberg's perspective is rather different from that of the Internet Archive. In its Text Archive, books are scanned and "OCRized", but they are not proofread. The main formats used are XML, TIF and DjVu. Books are not proofread either in other main collections: Open Content Alliance (OCA), Google Books Search or Microsoft Live Books Search.

Project Gutenberg provides a "Nearly Full Text" search (on the first 100 K of each file) using Google, with a database updated approximately monthly. It also provides a search of book metadata (author, title, brief description, keywords) as a participant in Yahoo!'s Content Acquisition Program, with a database updated weekly. Both are available in the Online Book Catalog (at the bottom of the page). In the Advanced Search, several fields can be filled: author, title, subject, language, category (any, audio book, music, pictures), LoCC (Library of Congress Catalog classification), filetype (text, PDF, HTML, XML, JPEG, etc.), and eText/eBook No. A field "Full Text" was also added as an experimental feature.

On Project Gutenberg's website, a File Recode Service allows users to convert books in one format (ASCII, ISO-8859, Unicode and others) into another, and vice versa. A much more powerful conversion program may be launched in the future, with a conversion into still more formats (XML, HTML, PDF, TeX, RTF), including Braille and voice. It will then also be possible to choose the font and size of characters and the background color. Another eagerly expected conversion is that of a book from one language to another by machine translation software. This may be possible in a few years, when machine translation is accurate to 99%. Still, these books will certainly need some proofreading too by human translators.

4. SHARED PROOFREADING

The main "leap forward" of Project Gutenberg in the last few years is due to Distributed Proofreaders. Distributed Proofreaders was launched in October 2000 by Charles Franks to help in the digitizing of public domain books. Originally meant to assist Project Gutenberg in the handling of shared proofreading, Distributed Proofreaders became the main source of Project Gutenberg books. In 2002, Distributed Proofreaders became an official Project Gutenberg site. In May 2006, Distributed Proofreaders became a separate entity and continues to maintain a strong relationship with Project Gutenberg.

Volunteers don't have a quota to fill, but it is recommended they do a page a day if possible. It doesn't seem much, but with hundreds of volunteers it really adds up. In 2003, about 250-300 people were working each day all over the world, producing a daily total of 2,500-3,000 pages, the equivalent of two pages a minute. In 2004, the average was 300-400 proofreaders participating each day, and finishing 4,000-7,000 pages per day, the equivalent of four pages a minute. The number of books that have been processed through Distributed Proofreaders has grown fast, with a total of 3,000 books in February 2004, 5,000 books in October 2004 and 7,000 books in May 2005, 8,000 books in February 2006 and 10,000 books in March 2007, with five books produced per day and 52,000 volunteers in December 2007.

From the website one can access a program that allows several proofreaders to be working on the same book at the same time, each proofreading on different pages. This significantly speeds up the proofreading process. Volunteers register and receive detailed instructions. For example, words in bold, italic or underlined, or footnotes are always treated the same way for any book. A discussion forum allows them to ask questions or seek help at any time. A project manager oversees the progress of a particular book through its different steps on the website.

The website gives a full list of the books that are: a) completed, i.e. processed through the site and posted to Project Gutenberg; b) in progress, i.e. processed through the site but not yet posted, because currently going through their final proofreading and assembly; c) being proofread, i.e. currently being processed. On August 3, 2005, 7,639 books were completed, 1,250 books were in progress and 831 books were being proofread. On May 1st, 2008, 13,039 books were completed, 1,840 books were in progress and 1,000 books were being proofread.

Each time a volunteer (proofreader) goes to the website, s/he chooses a book, any book. One page of the book appears in two forms side by side: the scanned image of one page and the text from that image (as produced by OCR software). The proofreader can easily compare both versions, note the differences and fix them. OCR is usually 99% accurate, which makes for about 10 corrections a page. The proofreader saves each page as it is completed and can then either stop work or do another. The books are proofread twice, and the second time only by experienced proofreaders. All the pages of the book are then formatted, combined and assembled by post-processors to make an eBook. The eBook is now ready to be posted with an index entry (title, subtitle, author, eBook number and character set) for the database. Indexers go on with the cataloging process (author's dates of birth and death, Library of Congress classification, etc.) after the release.

Volunteers can also work independently, after contacting Project Gutenberg directly, by keying in a book they particularly like using any text editor or word processor. They can also scan it and convert it into text using OCR software, and then make corrections by comparing it with the original. In each case, someone else will proofread it. They can use ASCII and any other format. Everybody is welcome, whatever the method and whatever the format.

New volunteers are most welcome too at Distributed Proofreaders (DP), Distributed Proofreaders Europe (DP Europe) and Distributed Proofreaders Canada (DPC). Any volunteer anywhere is welcome, for any language. There is a lot to do. As stated on both websites, "Remember that there is no commitment expected on this site. Proofread as often or as seldom as you like, and as many or as few pages as you like. We encourage people to do 'a page a day', but it's entirely up to you! We hope you will join us in our mission of 'preserving the literary history of the world in a freely available form for everyone to use'."

5. BECOMING MULTILINGUAL

What about languages? First Project Gutenberg's books are mostly in English. As it has been based in the United States since 1971, it has focused on the English-speaking community in the country and worldwide. Multilingualism started in 1997.

In October 1997, Michael Hart expressed his intention to include books in other languages. At the beginning of 1998, the catalog had a few titles in French (10 titles), German, Italian, Spanish and Latin. In July 1999, Michael wrote: "I am publishing in one new language per month right now, and will continue as long as possible."

In February 2004, there were works in 25 languages. In July 2005, there were works in 42 languages, including Iroquoian, Sanskrit and the Mayan languages. The seven main languages -- with more than 50 books -- were English, French, German, Finnish, Dutch, Spanish and Chinese. In December 2006, there were books in 50 languages. They were ten main languages, the above ones plus Italian, Portuguese and Tagalog. In April 2008, there were books in 55 languages, with eleven main languages, the above ones plus Latin. Esperanto was not far with 45 books, and Swedish followed with 40 books.

French is the second main language after English. On February 13, 2004, there were 181 books in French (out of a total of 11,340 books). On May 16, 2005, there were 547 books in French (out of a total of 15,505 books). The number tripled in 15 months. On July 27, 2005, there were 577 books in French (out of a total of 16,800 books). On December 16, 2006, there were 966 books in French (out of a total of 19,996 books). On April 21, 2008, there were 1,168 books in French (out of a total of 25,004 books). The number of French books is expected to rise significantly in a few years, when Project Gutenberg Europe will run at full speed.

What were the first books posted in French? They were six novels by Stendhal and two novels by Jules Verne, all released in early 1997. The six novels by Stendhal were: L'Abbesse de Castro, Les Cenci, La Chartreuse de Parme, La Duchesse de Palliano, Le Rouge et le Noir and Vittoria Accoramboni. The two novels by Jules Verne were: De la terre à la lune and Le tour du monde en quatre-vingts jours. In early 1997, whereas Project Gutenberg offered no English version of any of Stendhal's writings (yet), three of Jules Verne's novels were available in English: 20,000 Leagues Under the Seas (original title: Vingt mille lieues sous les mers), posted in September 1994; Around the World in 80 Days (original title: Le tour du monde en quatre-vingts jours), posted in January 1994 and From the Earth to the Moon (original title: De la terre à la lune), posted in September 1993. Stendhal and Jules Verne were followed by Edmond Rostand with Cyrano de Bergerac, posted in March 1998.

In late 1999, the "Top 20" --the 20 most downloaded authors-- included Jules Verne at 11 and Emile Zola at 16. They still have a very good ranking in the present "Top 100".

As a side remark, the first "images" ever made available by Project Gutenberg were French Cave Paintings, posted in April 1995, with an XHTML version posted in November 2000. This book contains four photos of paleolithic paintings found in a grotto located in Ardèche, a region of south-eastern France. These photos, which are copyrighted, were made available to Project Gutenberg thanks to Jean Clottes, a French general curator for cultural heritage (conservateur général du patrimoine), for everyone to enjoy them.

In 2004, multilingualism became one of the priorities of Project Gutenberg, like internationalization. Michael Hart went off to Europe, with stops in Paris, Brussels and Belgrade. He gave a lecture on February 12, 2004 at UNESCO (United Nations Educational, Scientific and Cultural Organization) headquarters in Paris. He chaired a discussion at the French National Assembly on February 13. The following week, he addressed the European Parliament, in Brussels. He also met with the team of Project Rastko, in Belgrade, to support the creation of Distributed Proofreaders Europe (launched in December 2003) and Project Gutenberg Europe (launched in January 2004).

The launching of Distributed Proofreaders Europe (DP Europe) by Project Rastko was indeed a very important step. DP Europe uses the software of the original Distributed Proofreaders and is dedicated to the proofreading of books for Project Gutenberg Europe. Since its very beginnings, DP Europe has been a multilingual website, with its main pages translated into several European languages by volunteer translators. DP Europe was available in 12 languages in April 2004 and 22 languages in May 2008.

The long-term goal is 60 languages and 60 linguistic teams representing all the European languages. When it gets up to speed, DP Europe will provide books for several national and/or linguistic digital libraries, for example Projet Gutenberg France for France. The goal is for every country to have its own digital library (according to the country copyright limitations), within a continental network (for France, the European network) and a global network (for the whole planet).

A few lines now on Project Rastko, which launched such a difficult and exciting project for Europe, and catalysed volunteers' energy in both Eastern and Western Europe (and anywhere else: as the internet has no boundaries, there is no need to live in Europe to register). Founded in 1997, Project Rastko is a non-governmental cultural and educational project. One of its goals is the online publishing of Serbian culture. It is part of the Balkans Cultural Network Initiative, a regional cultural network for the Balkan peninsula in south-eastern Europe.

In May 2005, Distributed Proofreaders Europe finished processing its 100th eBook. In June 2005 Project Gutenberg Europe was launched with these first 100 books. PG Europe operates under "life +50" copyright laws. DP Europe supports Unicode to be able to proofread books in numerous languages. Created in 1991 and widely used since 1998, Unicode is an encoding system that gives a unique number for every character in any language, contrary to the much older ASCII that was meant only for English and a few European languages.

On August 3, 2005, 137 books were completed (processed through the site and posted to Project Gutenberg Europe), 418 books were in progress (processed through the site but not yet posted, because currently going through their final proofreading and assembly), and 125 books were being proofread (currently being processed). On May 10th, 2008, 496 books were completed, 653 books were in progress and 91 books were being proofread.

6. PUBLIC DOMAIN VS. COPYRIGHT

As stated in the Project Gutenberg FAQ, "the public domain is the set of cultural works that are free of copyright, and belong to everyone equally", i.e. that books that can be digitized to be freely available on the internet. But the task of Project Gutenberg isn't made any easier by the increasing restrictions to the public domain. In former times, 50% of works belonged to the public domain, and could be freely used by everybody. A much tougher legislation was set in place over the centuries, step by step, especially during the 20th century, despite our so-called "information society". In 2100, 99% of works might be governed by copyright, with a meager 1% for public domain.

In the Copyright HowTo section, Project Gutenberg presents its own rules for confirming the public domain status of books according to US copyright laws. Here is a summary. Works published before 1923 entered the public domain no later than 75 years from the copyright date. (All these works are now in the public domain.) Works published between 1923 and 1977 retain copyright for 95 years. (No such works will enter the public domain until 2019.) Works created from 1978 on enter the public domain 70 years after the death of the author if the author is a natural person. (Nothing will enter the public domain until 2049.) Works created from 1978 on enter the public domain 95 years after publication (or 120 years after creation) if the author is a corporate one. (Nothing will enter the public domain until 2074.) Other rules apply too. The copyright law was amended 11 times between 1976 and now.

Much more restrictive than the previous one, the current legislation became effective after the promulgation of amendments to the 1976 Copyright Act, dated October 27th, 1998. As explained by Michael Hart in July 1999: "Nothing will expire for another 20 years. We used to have to wait 75 years. Now it is 95 years. And it was 28 years (+ a possible 28 year extension, only on request) before that, and 14 years (+ a possible 14 year extension) before that. So, as you can see, this is a serious degrading of the public domain, as a matter of continuing policy."

These amendments were a major blow for digital libraries and deeply shocked their founders, beginning with Michael Hart, founder of Project Gutenberg in 1971, and John Mark Ockerbloom, founder of The Online Books Page in 1993. But how were they to measure up to the major publishing companies?

Michael wrote in July 1999: "No one has said more against copyright extensions than I have, but Hollywood and the big publishers have seen to it that our Congress won't even mention it in public. The kind of copyright debate going on is totally impractical. It is run by and for the 'Landed Gentry of the Information Age.' 'Information Age'? For whom?"

John wrote in August 1999: "I think it's important for people on the web to understand that copyright is a social contract that's designed for the public good -- where the public includes both authors and readers. This means that authors should have the right to exclusive use of their creative works for limited times, as is expressed in current copyright law. But it also means that their readers have the right to copy and reuse the work at will once copyright expires. In the US now, there are various efforts to take rights away from readers, by restricting fair use, lengthening copyright terms (even with some proposals to make them perpetual) and extending intellectual property to cover facts separate from creative works (such as found in the "database copyright" proposals). There are even proposals to effectively replace copyright law altogether with potentially much more onerous contract law."

The political authorities continually speak about an information age while tightening the laws relating to the dissemination of information. The contradiction is obvious. This problem has also affected Australia (forcing Project Gutenberg of Australia to withdraw dozens of books from its collections) and several European countries. In a number of countries, the rule is now life of the author plus 70 years, instead of life plus 50 years, following pressure from content owners, with the subsequent "harmonization" of national copyright laws as a response to the "globalization of the market".

But there is still hope for some books published after 1923. According to Greg Newby, director of PGLAF (Project Gutenberg Literary Archive Foundation), one million books published between 1923 and 1964 could also belong to the public domain, because only 10% of copyrights were actually renewed. Project Gutenberg tries to locate these books. In April 2004, with the help of hundreds of volunteers at Distributed Proofreaders, all Copyright Renewal records were posted for books from 1950 through 1977. So, if a given book published during this period is not on the list, it means the copyright was not renewed, and the book fell into the public domain. In April 2007, Stanford University used this data to create a Copyright Renewal Database, searchable by title, author, copyright date and copyright renewal date.

7. FROM THE PAST TO THE FUTURE

The bet made by Michael Hart in 1971 succeeded. Project Gutenberg counted 10 books online in August 1989; 100 books in January 1994; 1,000 books in August 1997; 2,000 books in May 1999; 3,000 books in December 2000; 4,000 books in October 2001; 5,000 books in April 2002; 10,000 books in October 2003; 15,000 books in January 2005; 20,000 books in December 2006 and 25,000 books in April 2008.

But Project Gutenberg's results are not only measured in numbers, which can't compete yet with the number of print books in the public domain. The results also include the major influence that the project has had. As the oldest producer of free books on the internet, Project Gutenberg has inspired many other digital libraries, for example Projekt Gutenberg-DE for classic German literature and Projekt Runeberg for classic Nordic (Scandinavian) literature, to name only two, which started respectively in 1992 and 1994.

Project Gutenberg keeps its administrative and financial structure to the bare minimum. Its motto fits into three words: "Less is more". The minimal rules give much space to volunteers and to new ideas. The goal is to ensure its independence from loans and other funding and from ephemeral cultural priorities, to avoid pressure from politicians or economic interests. The aim is also to ensure respect for the volunteers, who can be confident their work will be used not just for decades but for centuries. Volunteers can network through mailing lists, weekly or monthly newsletters, discussion lists, wikis and forums.