The Project Gutenberg FAQ 2002

Chapter 1

Chapter 110,566 wordsPublic domain

ODYSSEY X, 24-56. 173

ODYSS.EY XI, %4-56. 173 lef white incal thereon, and entreated with many prayers strengthless beads of the dead, and prornised that on my return to Ithaea 1 would offer in my halls a barren heifer, the best 1 had, and fil the pyre with treasure, and apart unto Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when 1 bad hesought the tribes of the d with vows and prayers, 1 took the sheep and cut their s over the trench. and the dark blood flowed forth, he spirits of the dead that he departed gathered from out of Erebus.

It's clear that we should tidy up the page headings and numbers that have been scanned in with the main text, and that we should separate the paragraphs and remove the spaces inserted by the scan at the start of some lines. We also need to restore some of the text that got lost in the scan. Since there isn't much of it, we just type it in. Having done this, we get to . . .

Stage Two--First pass through the scanned text

At this point, we have a complete text. All of the words are actually there, and we have eliminated page breaks and other extraneous artifacts of proofing. Again, mileage varies: some people like to preserve page breaks and numbering until much later, to make it easy to refer back from the e-text to the book.

Our job in this phase is to fix all of the obvious scanning errors and double-check that we really do have all the text. Our aim here is to create an e-text that is ready for First Proof. In fact, since it's fairly clear what all the words are, this text could be considered ready for first proof.

1There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and there after with sweet wine, and for the third time with water. And 1 sprink- led white incal thereon, and entreated with many prayers the strengthless beads of the dead, and prornised that on my return to Ithaea 1 would offer in my halls a barren heifer, the best 1 had, and fill the pyre with treasure, and apart unto Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when 1 bad besought the tribes of the dead with vows and prayers, 1 took the sheep and cut their throats over the trench. and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.

Now we convert those numeral 1s to capital Is and to quotes, where appropriate, we straighten up the quotes and we deal with other obvious scanning errors, which brings us to . . .

Stage Three--The First Proof

At this point, we could hand over the text to an experienced proofer who doesn't have a copy of the book. This would be called a "first proof". An e-text is at first proof stage when there are still plenty of errors, but in each case it's pretty obvious what the correct word is. The excerpt now looks like normal text.

Unfortunately, in stage two above, we accidentally deleted a line.

'There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and there after with sweet wine, and for the third time with water. And I sprink- led white incal thereon, and entreated with many prayers the strengthless beads of the dead, and prornised that on my return to Ithaea I would offer in my halls a barren heifer, Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when I bad besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.

Stage Four--Corrections from First Proof

We receive the first proof back from the proofer, and find that it has been mostly corrected.

The corrections made were "l/I", "there after/thereafter", "prornised/promised", "bad/had", and "rarn/ram".

We have also wrapped the lines--at 60 characters in this case, but it is commonly as much as 70 characters per line. Sentences which look wrong, but where it isn't clear what the right text should be, have been marked with asterisks (*).

'There Periniedes and Eurylochus held the victims, but I drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water. And I sprinkled white incal * thereon, and entreated with many prayers the strengthless beads of the dead, and promised that on my return to Ithaea I would offer in my halls a barren heifer, * Teiresias alone sacrifice a black ram without spot, the fairest of my flock. But when I had besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.

We look up the text where the first proofer has asterisked it, and make the corrections.

The text is now ready for second proofing. An e-text is ready for second proofing when you can skim through the text without noticing that there are errors.

We can either do a second proof ourselves, or send it out for second proofing.

Second proofing involves a very careful reading of the text, looking for small errors. In some ways, it's much harder than first proofing, since it's very easy to let your eyes run on auto-pilot and in doing so, miss subtle errors.

Having performed the second proof, which caught errors like "beads/heads", "Ithaea/Ithaca", "Periniedes/Perimedes" and "he/be", we now have our final e-text.

'There Perimedes and Eurylochus held the victims, but I drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water. And I sprinkled white meal thereon, and entreated with many prayers the strengthless heads of the dead, and promised that on my return to Ithaca I would offer in my halls a barren heifer, the best I had, and fill the pyre with treasure, and apart unto Teiresias alone sacrifice a black ram without spot, the fairest of my flock. But when I had besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that be departed gathered them from out of Erebus.

Hooray! At long last we have an e-text to post, which can be downloaded, read and enjoyed by anyone in the world from now on.

About Net searching:

V.62. I've found an eligible text elsewhere on the Net, but it's not in the PG archives. Can I just submit it to PG?

You can submit it, but you can't "just" submit it.

We wish we could give a permanent home to all the etexts that people have produced and placed on the Net, but without proof of their public domain [C.10] status, we can't.

We need to be able to prove that the eBooks we publish are in the public domain, so, in order to use one of the many texts that are just floating around the Net, you need to find a matching paper edition that we can prove is eligible [V.18].

(By the way, please be sure that it isn't already in the PG archive. A lot of texts circulating on the Net originated at PG, and people quite often submit them back to us.)

Before you get into this, you should check whether the text you have found is likely to be in the public domain in the U.S. A quick way to verify this is to hit the Library of Congress Catalog site at <http://catalog.loc.gov> and search for the title or author. If you find no publications before 1923, then you should probably move on; the Library of Congress doesn't list every book, and in particular doesn't list all books published outside the U.S., but, if there isn't a pre-1923 copy there, it may be difficult to follow up on. If you're not dissuaded, do a search on the Net for used book shops that might have pre-1923 copies.

Sometimes, with a text on the Net, you know who typed it; it's on someone's website, or the transcriber is named in the text. Sometimes, the text has just been floating around Usenet or old gopher sites for years, with no attribution.

The first thing to remember is that we would like to give credit to the original transcriber if they want it, and if we can identify them.

The next thing to consider is that the original transcriber may well have an eligible copy of the book, and may be able to provide TP&V [V.25] for it.

So, if you can locate the original transcriber, it makes sense to e-mail them, explain what you propose to do, and ask them whether they can help with copyright clearance and whether they would like to be credited in the PG edition. Often, you will get no response, or a response but no prospect of material that will help with clearance, but sometimes you will get lucky.

If the transcriber can't help with TP&V, it's up to you to find a matching paper edition of the same book. This may not be as hard as it sounds. Libraries can help, and may get editions for you on interlibrary loan.

This is an ideal way for students, academics and librarians to contribute texts to PG, since you probably have access to a good library with stocks of old books to find matching paper editions.

If you find a matching paper edition, you then need to compare the etext you found with the book. Legally, what we're trying to prove here is that we have done "due diligence"--that we have done our best to prove that the etext is indeed a copy of a public domain work.

The minimum "due diligence" we can perform is to compare the first and last pages of each chapter, (or every 20 pages where the book is not neatly divided into chapters of about that size). You should list all of the differences between the book and the etext that you find on those pages. It is to be expected that there will be some minor differences of punctuation, spacing and spelling, and even perhaps of wording. Minor differences are OK, but we do need to list them, to prove that we did the comparison. When you have your lists, you can send in the TP&V as normal, accompanied by your lists, for clearance.

Many texts floating round without attribution, and indeed many with attribution, could do with a thorough checking, and another option you have is "comparative retyping", where you go through the whole etext, proofing it carefully against the cleared paper book, and changing everything that is different in the etext to match the paper edition. If you do this, you don't need to produce a list of differences, since there won't be any by the time you've finished; you can just submit it as a normal text--_and_ it may well be a lot cleaner! However, if you do take this path, please do a very thorough job on the proofing and comparison.

If the etext you find has been marked up, in HTML for example, you should remove all HTML for the PG edition, because, even though the text itself has been proved to be in the public domain, the original transcribers may hold copyright on the HTML markup, even if you can't find them. If you do want to make a HTML edition of it for PG, strip out all of the original markup and then re-add your own markup.

If you do find the producer and he or she wants to be identified, you may submit a double credits line like:

Transcribed by Sally Wright <[email protected]> Produced for PG by You <[email protected]>

V.63. I've found an eligible text elsewhere on the Net, but it's not in the PG archives. Why should I submit it to PG?

The first reason is file safety.

Yes, we accept that the file is already available to everyone today, but it may not be safe in the long term. We've seen college students who put books on their personal site, and then lose that site when they graduate. We've seen individuals who transcribe several books, and later lose interest, or move, or die, and the work they've done is lost. We've seen small projects with a few volunteers who produce and post books for a few years, but then break up or run out of funds to maintain their site. We've seen large institutions drop their collections as part of a cost-cutting exercise. We've even seen organizations lock public domain works up behind licenses, requiring users to commit to registration and a "no copying" agreement before downloading them.

Whenever a set of etexts is published and distributed by only one person or organization, there is a danger that their etexts will disappear from the Net sometime. We want _all_ etexts to be spread as widely as possible, copied as much as possible, so that no one event or loss, or whim of a sponsor, can obliterate them.

We think that the PG collection is, for that reason, the safest place to put a text for its long-term survival. There are copies of the PG archives all over the world, on public servers and private CDs. PG publications are widely converted, collected and read on PDAs. Other text projects copy works from PG.

The PG archive is so valuable, yet free and easily portable, that even if every current PG volunteer vanished overnight, people around the world would copy and preserve it. Even if PG itself decided to withdraw all our texts, we couldn't do it, because so many people have made copies.

The second reason is legal safety.

Unlike some other projects and individual efforts, PG retains documentary proof of the public domain status of its texts. This is more valuable than it might appear at first glance.

Publishers often claim a new copyright [C.17] on works that they republish, and as time goes on, it becomes harder and harder to prove that a particular book is in the public domain. Walk into your local bookstore and check out how many works by Shakespeare, Poe, Dickens, and Twain have copyright notices on them! People who want to translate these, or create derivative works like screenplays or lyrics or films must first prove that they are basing their work on a public domain edition, but the creeping copyright practices of commercial publishers make that difficult.

Here's a practical example: we were approached by a film student who wanted to make a short piece based on characters from James Joyce's "Ulysses". But before he could do that, he needed to confirm that the material on which he was basing his movie was in the public domain, and all the editions he could find were copyrighted. However, because PG had already established the public domain status of Ulysses, we could point him to our established PD version, and even tell him where to find a paper copy published in 1922. Without that evidence, he could not have made his project.

V.64. I have already scanned or typed a book; it's on my web site. How can I get it included in the Gutenberg archives?

Great! We get these a lot, but it's always nice to see another!

You need to send us the TP&V [V.25] so that we can prove that your edition is in the public domain. If you don't have the TP&V, you will need to find a matching paper book with eligible TP&V for us to be able to use it.

V.65. I have already scanned or typed a book; it's on my web site. The world can already access it. Why should I add it to the Gutenberg archives?

The Project Gutenberg archives are widely copied and searched, and much safer and more permanent that any individual website can possibly be. We aim to keep this collection together over not just years, but centuries. You took the trouble to transcribe this book. We can relate; that's what _we_ do, as well. We know you want this work to survive you and your ISP, and we believe we can do that. And it's not as if you have to take it off your website when we make a copy; you're just using your candle to light another!

If you want to let readers know that your site has other related material, you can put that information in the Credits Line [V.47]. Taking a real-world example, you could ask us to add this to the Credits line for a C. M. Yonge text:

A web page for Charlotte M. Yonge will be found at www.menorot.com/cmyonge.htm

V.66. I have already scanned or typed a book, but it's not in plain text format. Can I submit it to PG?

Yes, of course. We'll be happy to discuss format options with you, and we're quite experienced in converting between multiple formats and deciding which formats work best and will have the longest life. All you need is to get us a copy of your TP&V [V.25].

About author-submitted eBooks:

V.67. I've written a book. Will PG publish it?

Maybe.

PG gets submissions from young people, for example, who just want to get a story they wrote published in PG. We wish them well with their writing, but that's not really why we're here.

If you are a published author, or perhaps an academic who wants to put a textbook into the archives, it's quite likely that we will publish it.

V.68. I have translated a classic book from one language to another. Will PG publish my translation?

Yes, if we can.

The book that you translated needs to be in the public domain, and we will need the same proof of eligibility that we would use if you were contributing the book in its original language.

For example, if you were translating Hesse's Siddhartha (published pre-1923 in German, but no pre-1923 English translation available), we would need to copyright clear [V.25] the original German edition from which you worked--it needs to be a pre-1923 or otherwise public domain edition. (We actually did this one, thanks to the hard work and scholarship of some volunteers.)

V.69. OK, this is one of the cases where PG will publish it. What do I do next?

You need to decide about copyright issues. Do you want to release your work to the public domain, or do you want to retain copyright? If you want to retain copyright, what terms do you want to release it under? The next few questions deal with those issues.

Having decided that you want PG to publish it, and decided what restrictions (if any) you want to place on further distribution, you just need to write the appropriate letter and send the text to us. [V.46]

V.70. I hold the copyright on a book. Can I release it to the public domain?

You can. All you need to do is put a statement into the released version of the text saying that you have.

If you want to release it into the public domain and distribute it through Project Gutenberg, you should send us a letter to that effect.

To: Michael S. Hart Founder, Project Gutenberg 405 West Elm Street Urbana IL, 61801-3231, USA

Dear Project Gutenberg:

I am the sole copyright holder for the book, "Wallaby Happiness." It gives me pleasure to release this work into the public domain, and I invite Project Gutenberg to publish this public domain edition.

Sincerely,

Gregory B. Newby

Once you have released it into the public domain, neither we nor anyone else needs your permission to publish it, but for us to be sure that it _is_ a public domain version, we do need a signed letter.

V.71. I hold the copyright on a book. Do I have to release the book into the public domain for Project Gutenberg to publish it?

Absolutely not! For example, many contributors of copyrighted material want to share it with the world, but do not want it commercially republished by other companies.

You can grant Project Gutenberg perpetual, non-exclusive, world-wide rights to distribute your book on a royalty-free basis by sending a letter to Michael Hart. Your letter may be brief, but must be signed, and must include the name of the book and the assertion that you are the copyright holder or the agent for the copyright holder.

If you want some related information, like a link to your website, included in the text, we will be happy to oblige.

Once we have posted a text, many people will copy it. We have no effective mechanism for "recalling" texts that we have posted, so please be sure, before you commit to this, that you intend to follow through with it, because there is no way to change your mind later.

Here is a sample letter, including the address to send it to:

To: Michael S. Hart Founder, Project Gutenberg 405 West Elm Street Urbana IL, 61801-3231, USA

Dear Project Gutenberg:

I am the sole copyright holder for the book, "Wallaby Happiness." It gives me pleasure to grant Project Gutenberg perpetual, worldwide, non-exclusive rights to distribute this book in electronic form through Project Gutenberg Web sites, CDs or other current and future formats. No royalties are due for these rights.

Sincerely,

Gregory B. Newby

V.72. I hold the copyright on a book, and would like Project Gutenberg to publish it. Can I choose what rights to assign?

For PG to be in a position to copy it, we do need perpetual, worldwide, non-exclusive, royalty-free rights to distribute the book in electronic form. What rights you choose to assign to readers after that is a decision for you to make.

The Creative Commons site <http://www.creativecommons.org> may give you some ideas of what practical use you can make of your copyright to see that the work is used in the ways you intended.

About what goes into the texts:

V.73. Why does PG format texts the way it does?

PG texts are formatted as plain ASCII, with 60-70 characters per line, with a hard return [CR/LF] at end of line, and some people ask "Why do it _this_ way? You could omit the hard returns and let the reader's word processor or Reader software wrap the lines. You could use "8-bit" accented characters for non-English characters." "You could use ' - ' instead of '--' for an em-dash." And so on, through a different choice we could make for every formatting feature. And the answer, of course, is that we _could_ do it differently, and sometimes we do, but mostly we keep to one consistent style.

We'll be discussing each of the formatting decisions below, not only giving the summary PG answer, but also discussing the plusses and minuses of each, and the possible options.

Like any question beginning "Why does/doesn't PG . . . ?", the answer is "Because that's what the volunteers and readers want!". These conventions have been worked out over the years, largely by Michael Hart, our founder and chief volunteer, in conjunction with all of us volunteers, as the result of feedback from readers.

We are guided throughout by the principle that we want to produce texts in the simplest format that will adequately express the content. Quoting Michael Hart (1994):

Etext as developed and distributed by Project Gutenberg since 1971 was never intended to be a copy of a paper or a parchment [remember, first Project Gutenberg Etext was typed in from parchment replicas of the US Declaration of Independence].

The major purposes of Project Gutenberg have always been:

1. to encourage the creation and distribution of electronic texts for the general audience.

2. to provide these Etexts in a manner available to everyone in terms of price and accessibility [i.e. no special hardware or software], and no price tag attached to the Etexts themselves.

3. to make the Etexts as readily usable as possible, with no forms or other paperwork required, and as easily readable to the human eyes as to computer programs, and in fact, more readable than paper.

There is sometimes a conflict between "simplest format" and "adequately express the content"; further, different people have different views on what is "simple" or "adequate". You, the producer of the text, have spent the time and effort to make the eBook available to the world, you have thought more about it than anyone else, and we respect your informed judgment. However, please make sure that your judgment _has_ been informed, by studying the precedents and reasons behind our guidelines.

Where a simple, standard PG-ASCII layout does not, in your view, "adequately express the content", you should think of making your text in another open format, perhaps HTML or XML or TeX, that allows you to use more characters, more formatting options, and images. We are always happy to accept these kinds of files. In these cases, you should also provide a standard PG-ASCII version, even if you feel it is unacceptably degraded, for those who cannot use your preferred format.

Just ten years ago, presentation as plain ASCII was not only a universal standard, it was effectively the only way that most people could view the books. The first version of the HTML specification had been drafted, but was unknown among the general public. XML did not exist. SGML was (as it still is) the province of specialists. Specialized eBook readers and PDAs had not yet appeared.

In 2002, plain vanilla ASCII is still readable everywhere, but people also want to convert our texts into other formats for more convenient loading on readers and web sites. We therefore have to keep in mind that our works will be processed by automatic conversion programs, none of which is perfect, and we have evolved some "defensive formatting" practices, which, while retaining the universality of plain text, also supply clues to automatic converters about how they should treat the layout. These do help to keep converters from making at least the worst mistakes. The most significant "defensive formatting" practices are indenting unwrappable text like quotations, and using _underscores_ rather than CAPITALS for italics. Different volunteers have different priorities: at one extreme, some people want to make the best plain text they can, giving no weight to conversion issues; at the other, some people emphasize the cues that will allow automatic reformatters to convert the texts well, even if that causes some ugliness in the plain text. Most of us operate somewhere between, making the choices we feel are best depending on the context. Getting a text on-line is the important thing; which choices you make in doing so is a matter of detail.

About the characters you use:

V.74. What characters can I use?

a) You should use plain ASCII for straight English texts.

b) When producing a text partly or completely in a language that requires accents, you should use the appropriate ISO-8859 character set for the language, and specify which you are using, and also provide a 7-bit plain ASCII version with the accents stripped.

c) When producing a text in a language that doesn't use one of the ISO-8859 character sets, you should use the encoding most commonly used for that language. [e.g. Chinese--Big 5]

d) When producing a text containing more characters than can be found in any one of the ISO-8859 character sets, you should use Unicode.

You should use plain ASCII wherever possible--that is, the letters and numbers and punctuation available on a standard U.S. keyboard, without accented letters. The immediate and major exception to this is when you are typing a text written in a language like French or German that requires accents.

There is a problem with using non-ASCII characters. They do not display consistently on all computers; in fact, they do not even display consistently on the same computer! On my computer, for example, what looks like an e-acute in this editor just shows as a black box in another editor, or even using a different font in the same editor. And this is by no means confined to some theoretical minority; we have to deal with it all the time when posting texts.

Further, standards are changing: ten years ago, the character set Codepage 850 [MS-DOS] was very common; now it's rare except in some texts that have survived those ten years.

We want to preserve these texts over _centuries_, not just decades, and at the moment there is no single clear standard that we can use across all texts. Unicode may perhaps be a future standard, but, right now, it's not something that people use every day, and it's not supported by a lot of common software.

ASCII, while limited, is supported by almost all computers everywhere, so we make a point of always supplying an ASCII version where possible, even if the ASCII version is degraded when compared to the 8-bit original. When we get a text in, say, German, we post two versions of it--one with accents and one without.

V.75. What is ASCII?

Don't get scared by the computer jargon; ASCII (pronounced ASS-key) is just a name for the set of unaccented letters, numbers and other symbols on a standard U.S. keyboard.

ASCII (American Standard Code for Information Interchange) is a set of common characters, including just about everything that you can type in on an English-language keyboard. It includes the letters A-Z, a-z, space, numbers, punctuation and some basic symbols. Every character in this document is an ASCII character, and each character is identified with a number from 0 through 127 internally in the computer.

Just about every computer in the world can show ASCII characters correctly, which makes it ideal for PG's purpose of providing texts that can be read by anyone, anywhere, but ASCII does not include accented characters, Greek letters, Arabic script and other non-English characters, which causes some problems when we produce texts that need non-ASCII characters.

V.76. So what is ISO-8859? What is Codepage 437? What is Codepage 1252? What is MacRoman?

Today's computers mostly work on the basis of dealing with one "byte" at a time. A byte is a unit of storage than can contain any number from 0 through 255--256 values in all. It's very convenient for computers to associate one character with each of these numbers, so that we can have up to 256 "letters" viewable from the values stored in one byte. The first 128 values, zero through 127, are defined by ASCII--so, for example, in ASCII, the number 65 represents a capital "A", 97 represents a lowercase "a", 49 stands for the digit "1", 45 for the hyphen "-", and so on.

ASCII doesn't define characters for the values 128 through 255, and in early days computer manufacturers used these values to hold non-ASCII characters like accented letters and box-drawing lines. Of course, 128 wasn't nearly enough values to hold all of the characters that people needed to use for different languages, so they made the character sets switchable, so that a PC in France could use a different set of accented letters from a PC in Poland. Microsoft's version of this was called Codepages. Each Codepage held a different set of non-ASCII characters. Codepage 437, and later Codepage 850, were commonly used for English and some major Western European languages on MS-DOS.

MacRoman was Apple's first codepage, containing most of the accented letters in Latin-derived languages, and MacRoman is still in common use on Apple Macs today.

Later, the International Standards Organization ISO got around to looking at the problem, and defined ISO-8859-1, ISO-8859-2 and so on, as the standards for different language groups. These sets all define the characters 160 through 255 as accented letters and other symbols, and define the 32 characters from 128 through 159 as control characters.

Since Microsoft Windows has no use for the control characters 128 through 159, Windows fonts commonly use Codepage 1252, which has ASCII in the first 128 characters, ISO-8859-1 in characters 160 through 255, and other symbols in the characters 128 through 159. Just to make an already chaotic system worse, all characters can be defined differently in different fonts!

Of course, most of these codepages are incompatible with each other. For example, the byte value 232 shows as a lower-case "e" with a grave accent in ISO-8859-1 and CP1252, a capital letter "E" with diaeresis in MacRoman, a Latin capital letter "Thorn" in CP850, a Cyrillic lower-case "Sha" in ISO-8859-5, a Greek capital letter "Phi" in CP437, and so on. So if you view a text intended for one of these character sets with a program that assumes a different character set, you see gibberish.

The good news, for mostly-English texts at least, is that ISO-8859-1, Codepage 1252 and Unicode agree on the numerical values of the accented characters and symbols to be represented by the values 160 through 255. And everybody accepts ASCII--a pure ASCII file is valid ISO-8859-anything, valid Codepage-anything, and valid Unicode UTF-8.

For more detail about the mappings between Unicode and other formats, you can view Unicode<-->ISO-8859 mappings at ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/ Unicode<-->Windows mappings at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/ and Unicode<-->Apple mappings at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/

If you're not confused enough by now, please read the excellent guide to the whole "alphabet soup" problem at <http://czyborra.com>.

V.77. What is Unicode?

Recognizing that no single set of 256 characters can hold all of the symbols necessary for true multi-lingual texts, ISO 10646 was created. This defined the Universal Character Set (UCS) using 31 bits, which has the potential for a staggering _2 billion_ characters.

The Unicode Consortium is a group of computer industry companies who agree the Unicode standard. Unicode accepts the ISO 10646 standards, and adds some restrictions and implementation processes. It plans for a modest million or so characters; however, this is enough for all living and extinct languages, and imaginable future ones too.

Using 4 bytes for each character is wasteful, though, when most characters need only one or two, and there are programming problems with implementing 4-byte characters, so Unicode provides Transformation Formats (UTF) which allow the characters to be encoded using fewer bytes where possible. UTF-8 and UTF-16 are common.

UTF-8, which is the most practical of these from the PG point of view, allows ASCII to be encoded normally, and usually uses two or three bytes for other non-ASCII characters.

Because of the extra work needed to support this extra space, and the fact that most people work mostly in one or maybe two languages, Unicode is being adopted only slowly, and most computer programs in 2002 do not fully support it. But when you need to mix Arabic, Greek, Ogham and Sanskrit in one text, it's the only possible answer!

For more about this, go straight to the source at <http://www.unicode.org>.

V.78. What is Big-5?

Big 5 is an encoding of a set of 13,000+ traditional Chinese characters.

V.79. What are "8-bit" and "7-bit" texts?

For practical purposes, 7-bit texts are plain ASCII; 8-bit texts have accented letters.

This comes from computer jargon. You can represent the 128 characters of ASCII using 7 bits--binary digits--but to represent the 256 characters needed for the various codepages and ISO-8859 standards, like accented letters, you need 8 bits. Hence, we call a text that uses non-ASCII characters in a character set like Codepage 850 or ISO-8859-1 an "8-bit" text.

When we post a text as both 8-bit and 7-bit, as we do when ASCII is not enough to render the text acceptably, we name the file with an "8" or a "7" at the start. So, for example, Crime and Punishment by Dostoevsky is named 8crmp10 for the 8-bit version with accents, and 7crmp10 for the 7-bit version without accents.

See also FAQ [R.35]: "What do the filenames of the texts mean?"

V.80. I have an English text with some quotations from a language that needs accents--what should I do about the accents?

If stripping the accents would unacceptably degrade the book, then submit two versions, one "8-bit" with the accents included and one "7-bit" plain ASCII, and we will post both.

This is a hard choice. What constitutes "unacceptable degradation"?

Clearly this is a decision that all of us in PG have to make. It's a very common problem, and different people have different views. For that matter, different print publishers have different views; you will see the words "debris", "facade" and "cafe" printed with and without accents in different books, and even in different editions of the same book.

We don't want to post two versions when we don't have to. It doubles the posting work, doubles the disk space needed, potentially confuses downloaders, doubles the maintenance when we need to correct the text. On the other hand, we don't want to degrade the text.

There is no clear line, no definitive answer to what level of degradation is acceptable. Most producers feel that there is no point in making a separate version when dealing only with a few foreign words thrown in among the English, but when, for example, some significant dialog between the characters is in French or Spanish, it's harder to say that stripping the accents is acceptable. You, the producer, need to decide this on a case-by-case basis. If you're not sure, discuss it with one of the Directors of Production or one of the Posting Team.

If you have made the text with accents, you can choose to make your own 7-bit version and send it to us, or just send the 8-bit version and we'll make the 7-bit version from it. Some people prefer to make their own 7-bit editions; some don't. Whether you use a Microsoft Codepage, one of the ISO standards or MacRoman doesn't matter--we can convert any of them for you.

V.81. I have some Greek quotations in my book. How can I handle them?

There is no way to show Greek letters in ASCII. You have three options:

You can just replace the Greek words with [Greek] to indicate to the reader that you have omitted it.

You can "transliterate" the Greek to ASCII. Greek letters do have a correspondence to plain "Latin" letters--for example, the Greek letter "delta" can be represented by the letter "d". There is a simple PG guide to transliteration at <http://www.promo.net/pg/vol/greek.html>. This practice has had a long and honorable history: words like "amphora" and "hubris", for example, are straight transliteration from the Greek. This is usually the best option.

If there is enough Greek to warrant it, and no other accented characters, you may be able to use the ISO-8859-7 character set, and submit both 7-bit and 8-bit versions [V.79]. ISO-8859-7 is for modern rather than classical Greek, but, if necessary, you will surely be able to express the Greek fully in Unicode. However accurate your Greek, that still leaves the issue of what to do with the 7-bit ASCII version, where transliteration is probably still your best bet.

V.82. I want to produce a book in a language like Spanish or French with accented characters. What should I do?

Use the appropriate ISO-8859 Character set [V.76] for your 8-bit version.

About the formatting of a text file:

This section of the FAQ goes into great detail about all kinds of formatting questions. However, looked at from a higher level, the only real issue is that we want to render texts clearly, with formatting that reflects the original, so that readers of the plain text format can read them easily, and people converting them to other formats can do so reliably. When you come across a case that is not covered by the detailed guidelines below, keep this ultimate aim in mind, and make the best decision you can. Don't get hung up for hours or days over a question of formatting--if you want advice, look at how other people have handled the same situation in previous texts, or ask other volunteers for their ideas.

V.83. How long should I make my lines of text?

For normal prose, such as you find in a novel, your lines should mostly be 60 to 70 characters long, not shorter than 55, not longer than 75 except where it can't be helped. Never, ever longer than 80, except where you're trying to render a non-text structure, like a family tree.

For poetry, make the text look as much like the book as possible. This also applies to some plays where the lines are clearly intended to be broken at specific points, whether blank verse or not.

V.84. Why should I break lines at all? Why not make the text as one line per paragraph, and let the reader wrap it?

We could either use 70-character lines and let readers unwrap them if they want to, or use infinite-length lines and let readers wrap them if they want to. We choose to wrap the lines so that they are readable on even the simplest of text editors and viewers.

V.85. Why use a CR/LF at end of line?

CR/LF can lead to double-spacing, notably on Mac and Unix, but at least there _is_ a CR in there for Mac users, and there _is_ an LF for *nix users.

If you don't know or care what this is about, please skip blithely on.

There are three differing standards for how to represent the end of a line of text. In brief, Apple Macs use the CR character. Unix and its variants use the LF character. Microsoft systems, from MS-DOS through Windows, use both together.

If you want the history behind these:

CR stands for Carriage Return, and comes from the old typewriter / teletype idea of a command to move the print head from the right of the page back to the left when it reaches the end;

LF stands for Line Feed, and comes from the old typewriter / teletype idea of a command to move the print head down a line;

CR/LF together indicate moving down a line and back to the left of the page.

The history is not relevant to today's computers in principle, but in practice they all use one of these legacy conventions, and there's nothing we can do about it but pick one.

V.86. One space or two at the end of a sentence?

Whichever you prefer, but if using two spaces, please use them only at the end of a sentence, not after abbreviations like "Dr." and "per cent.", and not after non-sentence-ending punctuation like the question-mark in the sentence: "Must you go? when the night is yet so black!"

Many people have strong views on either side of the "one space or two?" question, and we're not about to try and argue with them. Use whichever is most natural for you.

However, if using two, you take responsibility for deciding where the sentence ends. You can't just place two spaces after every period, question-mark and exclamation mark, since periods are also used for abbreviations end ellipses, and question-marks and exclamation-marks don't always end sentences.

V.87. How do I indicate paragraphs?

Just leave a blank line before each paragraph.

V.88. Should I indent the start of every paragraph?

No.

Printers do this when publishing paper books because they do not leave blank lines in the text, but there is no need for indenting in our eBooks.

V.89. Are there any places where I should indent text?

Yes. You should always make poetry look like the original, and that may mean indenting some lines, for example:

I was a child and she was a child, In a kingdom by the sea; But we loved with a love that was more than love-- I and my Annabel Lee;

Even when poetry doesn't have indented lines, it is a good idea to indent quotations embedded in prose. Remember, others will be converting your text later--to HTML, to PDA reader formats, to formats that don't even exist yet--and much of this conversion will be done automatically, by computer programs. It is very hard for a program to know when it can and can't re-wrap lines to fit a screen size unless it has a clear signal that _this_ line should not be wrapped. This is one of the biggest problems with auto-converting PG texts.

Just about all formatting programs "know" that lines that are indented shouldn't be wrapped, so by indenting lines just a space or two, you can prevent

I think that I shall never see A poem lovely as a tree.

from turning into

I think that I shall never see A poem lovely as a tree.

in some future reader's eBook.

You don't really need to do this in texts where the whole book is poetry or blank verse, since these will probably be recognized as whole books that shouldn't be rewrapped, but when there are a few lines of quotation amid an acre of straight prose, a few spaces will be a life-saver. Even in the original plain text version, the extra spaces serve to set the quotation off from the main text.

You shouldn't get carried away and indent things 20 spaces for this reason, though. Anything up to four spaces is reasonable; more is excessive. If you're indenting many short verses in this way, keep your number of spaces for indentation consistent throughout the book.

There are some other times when you may judge it best to indent, where text is indented in the paper book, like newspaper headlines or pictures of handwritten notes.

V.90. Can I use tabs (the TAB key) to indent?

No.

The problem with tab characters is that they act differently in different applications. Typically a tab will move the text to the next tab stop, which might be four spaces on your PC, but 20, or none, on someone else's. The effects are unpredictable.

V.91. How should I treat dashes (hyphens) between words?

In typography, there are four standard types of dashes: the hyphen, the en-dash, the em-dash, and the three-em-dash.

Originally, printers called these the "em-dash" because it was the same width as the capital letter M in whichever font they were using, the "en-dash" because it was the same width as the capital letter N, and the "three-em-dash" because it was as long as three capital Ms.

The hyphen is used for hyphenated words, like "en-dash" itself, or "to-day" or "drawing-room". For this, you just press the single dash or hyphen key on your keyboard.

In typography, the en-dash is a little longer than the hyphen, and is typically used for duration, where you could substitute the word "to". For example, if you were printing "1830-1874", or "9:00-5:30", you would use an en-dash instead of a hyphen. The en-dash is also sometimes used as hyphenation between words that are already hyphenated, for example, "bed-room-sitting-room" might use an en-dash as its central dash to emphasize that it is a different type of separator from the plain hyphens before "room". However, there is no ASCII character for an en-dash, and we use the hyphen in these cases. (HTML and some character sets do provide separate entities for en-dash and em-dash.)

The em-dash is shown in print as a longer dash, and for PG purposes, you should render it as two hyphens with no spaces around them.

You use the em-dash as a kind of parenthesis--as I am doing here--or to indicate a break in thought or subject within a sentence. There is no ASCII equivalent of the em-dash; there is no key on your keyboard that you can press to get one. For PG texts, we represent the em-dash as two dashes with no space between or around them--like this.

The em-dash can also be used at the end of a sentence or speech to indicate that the speaker stopped or trailed off. For example:

"When I saw you with Emily, I thought you were-- I thought she was--"

In a case like this, there may be a space following the em-dash, and the context may demand that there _should_ be a space following the em-dash, not because of the em-dash as such, but to make the break between the statements or sentences clear.

These two hyphens represent _one_ character, so you should never break them at line end, with one hyphen at the end of the first line and the other at the start of the second. If you have an em-dash near line end, you can break the line either before or after the em-dash, but never in the middle.

The fourth type of dash, the three-em-dash, is used to represent a missing word, or an undetermined number of missing letters. You will often see it in a sentence like:

Dr. P------ was known for his honesty.

or

Dr. ------ was known for his honesty.

where there is a convention that the character's name has been redacted. Logically, we should represent the three-em-dash as six dashes, but you may reduce that to four. Whichever you choose, do use it consistently in the text you're producing.

Unlike the em-dash, you should leave a space in such cases wherever a space would have been before the letters were replaced by dashes.

Here's a summary table of the dashes:

Name ASCII Used for

Hyphen - Hyphenated Words En-dash - Durations, like "3:00-5:30" Em-dash -- Break in sentence or parenthetical comment Three-em-dash ------ Indicating a word that was edited out.

V.92. How should I treat dashes replacing letters?

If the dashes obviously represent individual letters, use the same number of hyphens. Otherwise, you can use a three-em-dash (see above: 6 or 4 hyphens) in such places.

A common convention when a character in a novel is using bad language, or when reference is given to a character whose full name is not being used, is to replace the letters with dashes. For example,

"That D---l, Mr. C------s will regret his hasty actions!"

In this case, it is clear that "D---l" is meant to represent "Devil" and that there is a character whose name begins with "C" and ends in "s" whose name is not spelled out in full. Where the book makes it clear how many letters are represented by hyphens, just use that number of hyphens.

Where the number of letters omitted is not clear, you can decide how long you want to make your extended dash. Typographers often use the "three-em-dash" for this, so called because it is as wide as three capital Ms. Logically, since we represent an em-dash by two hyphens, we might represent a three-em-dash as six, but if you feel that six hyphens is too long, you can choose a shorter length, like four, but if you do, keep it consistent within your text:

It was in the town of S----, walking on M---- Street, that Sowerby came upon Dr. T---- taking the morning air.

V.93. What about hyphens at end of line?

Remove the hyphens from single words that were wrapped by the printer at line-end on the paper copy. Where two words are joined with a hyphen, you can leave the hyphen at end of the text line.

Books are usually printed with words broken at end of line to make the right side of the text perfectly even. You should remove all such hyphens. For example, in the sentence:

Mary's mouth tightened as she saw the marks on the car- pet, and her hands balled into fists.

you should remove the hyphen from "carpet".

Words which are strung together and hyphenated by the author pose a different question. It is perfectly OK from the point of view of a reader of the plain text version for such a hyphen to occur at end of line, for example:

Now that the guns were silent, convoys brought badly- needed medical supplies and food.

However, be aware that if somebody later rewraps the text for use in a different format like HTML, it is possible that they will introduce a space where it should not be:

Now that the guns were silent, convoys brought badly- needed medical supplies and food.

so there is still a small disadvantage to having a hyphen at line-end.

Sometimes it's not entirely clear whether the hyphen is there because it has to be, or just because it happens to fall at the end of the line:

Daisy rushed to the door, but there were no letters for her to- day, and she retreated sadly.

Sometimes "today" is written as "to-day", especially in older works. So which is this? Should we remove the hyphen or not? In this case, the best thing to do is search the rest of the text for the same word, and see whether it is consistently hyphenated or not in other places.

V.94. What should I do with italics?

There are three different ways volunteers currently render italics: like THIS, like _this_ and like /this/. Pick one, and use it consistently in your text.

There are really two questions here: "How should I render italics?" and "When should I render italics?"

The original PG standard for italics was to render emphasis italics as CAPITALS, using underscores for an italicized _I_, and do nothing for non-emphasis italics like foreign words and names of ships, and this is still the most common usage. For reading a plain-text file in a plain text editor, it is still arguably the most reader-friendly usage as well.

It has two drawbacks:

1. if you do want to preserve italics for non-emphasis words, you may end up with a very ugly text where there are too many capitals.

2. it is impossible to convert CAPITALS reliably back into italics, since the original text might have had a capital letter, or even been all capitals in the first place. This is especially true of automatic conversion for people who want to read PG texts on eBook readers.

To overcome these problems, many volunteers now use _underscores_ or /slants/ to render italics. These allow you to preserve all italics without creating an ugly plain-text, and to remove the ambiguity of CAPITALS. Underscores are more popular than slants, but some people feel that underscores should properly be reserved for underlined text. Since printers tend to avoid underlines, however, there aren't many books where this causes a real conflict.

V.95. Yes, but I have a long passage of my book in italics! I can't really CAPITALIZE or _otherwise_ /mark/ all that text, can I?

No, you really can't. On the other hand, if the author intended that section to stand out, you don't want to ignore that information and withhold it from future readers.

What you _can_ do is format it differently from the rest of the text. For example, if you're averaging a 68-character line throughout normal paragraphs, you could reasonably use shorter lines, like 58 characters, for the italicized section. Going a step further, you could shorten the lines and indent them a space or two as well. This will give a clear signal to future readers and converters that this section is to be treated specially.

V.96. Should I capitalize the first word in each chapter?

No.

Capitalization of the first word is often used in printed material to emphasize the break at the start of a section or chapter on the paper, but it is not necessary in an eBook, and leads to the same kind of ambiguity as does the capitalization of italics, and for far less reason.

If you feel you really _must_ capitalize the first word, we probably won't stop you, but if so, please do it consistently throughout the book, not just in one or two places, so that a future reader can be certain that these capitalized words were a chapter-head convention, and not otherwise intended for emphasis.

V.97. What is a Transcriber's Note? When should I add one?

A Transcriber's Note is a small section you can add to a text you produce to give the reader some information about changes you made to the book when rendering it into text.

A Transcriber's Note is not the same as a footnote--a footnote is part of the text you have transcribed; a Transcriber's Note is a note that _you_ add to the text, explaining something _you_ have done or omitted. If there is a Transcriber's Note, it may be at the top or the end of the text, and it should be clearly marked so that a reader cannot confuse it with the main text or an introduction.

The main thing is to ensure that a reader cannot confuse text that you have added with text that was in the original book.

Transcriber's Notes are rarely needed, but if, for example, you found misprints in the text, or things that might look like misprints even though they're not, you may note them here, if it seems relevant. If there is an image in the book that is important to the content, you may describe it in a note. If there was unusual typography that you had to represent in some uncommon way, you might well explain that here.

You don't need to add a Transcriber's Note just for common conversions like italics, and you should not use such a note to add your own comments or views about the text or the author. It's just there to let the reader know what decision you have made about rendering the text.

Here are some examples of Transcribers' Notes:

Transcriber's Note:

The irregular inclusion or omission of commas between repeated words ("well, well"; "there there", etc.) in this etext is reproduced faithfully from the 1914 edition . . .

Transcriber's Note:

Inserted music notation is represented like [MUSIC--2 bars, melody] or [MUSIC--4-part, 8 bars]

[Transcriber's Note: This letter was handwritten in the original.]

Transcriber's Note:

The spelling "Freindship" is thus in the original book.

Transcriber's Note: Some words which appear to be typos are printed thus in the original book. A list of these possible misprints follows:

If there is an image that is important to the content you may describe it at the point in the text where it appears, for example:

[Transcriber's Note: Here there is a map of three islands just West of and parallel to a coastline running SW to NE, with a big X marked on the North of the middle island. A spur of land extends from the mainland, sheltering the islands from the north-east.]

Transcriber's Notes that apply to the whole text should be placed at the start or end of the text--your choice. Notes that pertain to a specific point in the text, like the map example above, should be placed at the point where in the text where they are relevant, but not interrupting a paragraph except where it cannot be avoided.

V.98. Should I keep page numbers in the e-text?

No. But there are exceptional cases . . .

In general, the page numbers of the original book are irrelevant when making a reader's edition for PG; they are annoying and intrusive for anyone trying to read it, and if you did keep them, they would probably be removed by anyone converting it. Get rid of them!

But there are a few books where page numbers are appropriate. Non-fiction books that use page numbers as internal cross-references are the prime example; if, on page 204, the text reads

"Our studies of plants (see pp. 141-145) show that this is true."

and this kind of cross-reference is frequent throughout the text, then it is probably best to keep the page numbers, since it is otherwise very difficult to honor the author's intent.

In the more common case where cross-references exist, but are not frequent, and not essential to the text, you have several choices: leave the cross-references in, meaningless though the page numbers are, remove the cross-references, change the cross-references to something relevant (like "Start of Chapter 12" instead of "pages 141-145"), or, if you can make it work in context, insert references in the text for the cross-references to point to, like [Reference: Plants] and then reformat the cross-reference like "Our studies of plants (see [Reference: Plants]) show that this is true."

There are a few other cases, where the text you create is likely to be the subject of study or reference, in which it may also be desirable to retain page numbering.

When there are pages at the end of the book with notes referring to page numbers, the simplest answer is to change the page number references to chapter numbers, and add a quote from the page referred to if it's not already in the book's end-notes. That way, a reader can search for the phrase.

V.99. In the exceptional cases where I keep page numbers, how should I format them?

Within brackets of your choice, with one space either side, simply added to the text at the exact point of the page break. Unless there is some [142] special reason, you shouldn't insert a line break or new paragraph when indicating a page number; just insert it in the text, as I did with "142" above.

You should use whichever of round brackets, (143) square brackets, [144] or curly brackets {145} is not used (or least used) within the main text itself, and then use it consistently. Try to make sure that your page numbers cannot be confused with anything else.

Don't run your[146]page[147]numbers right up against words with spaces omitted; this just makes the text hard to read. Use spaces before and after.

Where the page break is at the start of a chapter or headed section, you can put it on a line of its own, for example:

[148]