The Project Gutenberg FAQ 2002
Chapter 7
Garden at the Manor House. A flight of grey stone steps leads up to the house. The garden, an old-fashioned one, full of roses. Time of year, July. Basket chairs, and a table covered with books, are set under a large yew-tree.
[MISS PRISM discovered seated at the table. CECILY is at the back watering flowers.]
MISS PRISM. [Calling.] Cecily, Cecily! Surely such a utilitarian occupation as the watering of flowers is rather Moulton's duty than yours? Especially at a moment when intellectual pleasures await you. Your German grammar is on the table. Pray open it at page fifteen. We will repeat yesterday's lesson.
About problems with the printed books:
V.125. I found some distasteful or offensive passages in a book I'm producing. Should I omit them?
Please don't. Readers understand that books are works of their time and place, reflecting the opinions and prejudices of the people who wrote them, and the people they observed. We shouldn't try to pretend those prejudices out of existence. It may be, in a century or two, that our descendants are repulsed by _our_ prejudices.
It is perfectly normal, for all kinds of reasons, not to want to produce a particular book, but producing one while deliberately removing passages is censorship, and is unfair to our readers.
If you find it too disturbing to handle the content, you can of course abandon the book, or pass it along to some other volunteer.
V.126. Some paragraphs in my book, where a character is speaking, have quotes at the start, but not at the end. Should I close those quotes?
Probably not.
When one character is making a speech that spans more than one paragraph, it is usual _not_ to close the quotes until the speech is finished. This avoids confusion about whether the next paragraph is the same speaker or another--once a character has started speaking, there are no closequotes until the speech is finished. However, there are openquotes at the _start_ of each new paragraph during the speech. This makes the quotes unbalanced, but it isn't a misprint; it's deliberate.
If this is not the case, if the same character is not continuing the speech in the next paragraph, then you may have found a typo in the book. [R.26]
V.127. The spelling in my book is British English (colour, centre). Should I change these to American spellings?
No.
Stay true to the edition you have. And this applies the other way, as well: if you have an American edition of a work by an English author, please leave the spelling as it is.
V.128. I'm nearly sure that some words in my printed book are typos. Should I change them?
The first thing to be aware of is that typos in books are not as rare as most people think. You may never have noticed typos in your normal reading, but under the kind of scrutiny that a book gets while being produced for PG, they often do become noticeable. It's quite common to find anything up to ten typos in a book.
Before you decide it's a typo, though, check that the same word doesn't occur elsewhere in the book with the same spelling. Often, the words or spelling used by pre-20th Century authors may just not be familiar to you.
When you find something that you believe to be a typo, you have four options: pretend you didn't see it :-), change the typo and add a transcriber's note [V.97], change the typo without a transcriber's note, or leave the typo as it is and add a transcriber's note. If you are adding a note, do it at the top or bottom of the file; don't try to work it into the text, and don't use the [sic] convention, since the reader won't know whether the [sic] was added by you or an earlier publisher.
In general, it's safest to leave the typo in place and add a note at the end of the file, listing the words you believe to be typos; that is the least contaminating and intrusive method. When adding the note, you don't need to leave a mark in the main text. You can just say something like:
[Transcriber's Note: "haw" near the end of chapter 15 appears to be a misprint for "hawk".]
The danger in making changes is that you may be wrong, and we really don't want to corrupt the text. This is particularly so in some old books where archaic usages, now obsolete, may look downright wrong to modern eyes. Sometimes, though, a typo is just so blindingly obvious that it warrants immediate replacement. Even in these cases, conscientious people will sometimes add a note, something like:
[Transcriber's Note: in chapter 12, I have changed "he stood on the tock", to "he stood on the rock".]
V.129. Having investigated what looks like a typo, I find it isn't. Do I need to do anything?
Often in PG work, you come across an odd word or usage. Might be a typo; might not. You check it out, and find that it is deliberate--perhaps a word from local dialect that just happens to resemble a different word, perhaps the author is using an odd word or spelling to make a point with the language. Especially if it's an isolated incident, and especially if it's not obvious, you can add a transcriber's note to the end noting that the word is thus in your edition, and that it is probably right. This may prevent some well-intentioned converter from changing it.
It's rare that you will need to do this; you may encounter such a case only once in a hundred PG books, but it is an option.
V.130. Aarrgh! Some pages are missing! Do I have to abandon the book?
No. It happens more often than you might think, and we're quite used to dealing with it.
Finish the book, and ask other volunteers to help by finding another copy of the book to fill in the missing section. For something like this, you can try asking on [V.12] the WebBoard, or gutvol-d, or ask Michael Hart to put a note in the Newsletter asking for assistance. We can post the book incomplete, and put a Transcriber's Note [V.97] in the header asking any future reader who has a copy to fill in the gap.
V.131. Some words are spelled inconsistently in my book (e.g. sometimes "surprise", sometimes "surprize"). Should I make them consistent?
No.
English spelling didn't really standardize until the start of the 20th Century (and even then it fractured; e.g. "standardize" vs. "standardise") and the further back you go, the more inconsistent it becomes. Shakespeare, for example, signed his own name with several different spellings.
Where your printed edition genuinely uses alternate spellings of the same word, you should preserve them.
Word Processor FAQ
W.1. What's the difference between an editor and a word processor?
An editor shows you the characters you type, exactly as you type them. It puts new-line characters in when you hit the Enter key, and only when you hit the Enter key. Its ultimate aim is to give you exact control of plain text. EDIT in DOS, Notepad in Windows, vi and emacs in *nix, Tex-Edit Plus and BBEdit Lite in Mac, are all editors.
A word processor, in addition to entering the characters, also lets you change the font, the size of individual words, and whether they are italic or bold. It doesn't generally want individual line-ends put in on each line; it just rewraps the text as you change it. Its ultimate aim is to print your document on paper with full formatting facilities. WordPerfect for MS-DOS and Windows, MS-Word for Windows and Mac, AbiWord for Windows and Linux, and Nisus Writer for Mac are all word processors.
W.2. Should I use an editor or a word processor?
For dealing with plain text, which is what PG is about, you might expect a text editor to have the edge, since the formatting features of word processors can get in the way of making a clean text.
However, if you use a word processor, and you ignore all of the layout and formatting that have to do with fonts and paper, it will work equally well. There are a few common problems associated with Word Processors mentioned below.
W.3. Which editor or word processor should I use?
The one you like best!
Any of them will do the job. Even the most primitive editors of 1971 will do the job. The most feature-bloated word processor of tomorrow will do the job. No editor or word processor affects in the slightest the "quality" of the text produced.
For PG purposes, therefore, the only difference between them all is how easy you find them to use, and what facilities they have for helping you--and those are decisions that only you can make.
If you already have a favorite editor or word processor, stick to it. If you don't, there's a huge selection available for you to consider, on any type of computer.
Sometimes, using a word processor, you may encounter some problems in saving your book as plain text. You have to figure out how to get it right just once, and then use that same method thereafter. If you have problems with this, ask other volunteers or one of the Posting Team for help.
W.4. How can I make my word processor easier to work with for plain text?
First, switch off _everything_ called "Smart ------" or "Automatic". Modern word processors commonly offer lots of typical typing support features--"Smart Quotes", "Auto Correct", automatically capitalizing the first word in each sentence, anything like that. By all means, leave on any informative highlighting of misspelled words or other errors that it offers, but switch off any feature that changes what you type without asking you. Older books contain text that doesn't sit comfortably with modern rules, and we don't want your word processor deciding what Chaucer really wrote!
Now, choose a non-proportional font, and apply it to the whole document. It's important to work in a non-proportional font, because you may have to line words up underneath each other and it is not possible to do this consistently in non-proportional fonts like Times or Arial.
If you work in Courier, size 10, 11 or 12, and your word processor is set for a normal page size, about 7 inches across excluding margins, then what you see in your WP is a pretty good approximation to how the text will look in PG plain text format. One formula, suggested by John Mamoun in the Volunteers' Voices section, is to Select All the text, choose Courier New font, 10 point size, and set the margins at 5.5 inches, then Save As "Text with layout".
W.5. What is the difference between proportional and non-proportional fonts?
A non-proportional, or "monospaced", or "typewriter" font, is one where all of the letters take up exactly the same amount of space on screen: a capital "W", a lower-case "i" and a space are all equally wide. The Courier family of fonts is commonly used for this.
A proportional font is one where each letter takes up just the amount of space it needs, so that a capital "W" is much wider than a small "i".
Unfortunately, the different sizes of the letters in different proportional fonts means that it's not possible to line up letters consistently: a "W" may be equivalent to three "i"s in one proportional font, and to four "i"s in another. This means, for example, that it is not possible to use a proportional font to format plain text tables or poetry correctly--lining up the spaces and words using one proportional font will cause it to look skewed using another.
You should always look at PG texts in a non-proportional font, even if you prefer to work mostly using a proportional font, because readers and automatic converter programs will assume that you meant to your text to be viewed using a non-proportional font.
W.6. I can't get words in a table or poem to line up under each other.
You are using a proportional font. You should always use a non-proportional font like Courier for PG work. Change the font of the entire document to Courier and try again.
About using Microsoft Word:
PG volunteers use many different word-processors, but Microsoft Word is the one we hear most queries and problems about.
W.7. I've edited my book in Word--how do I save it as plain text?
First, make sure that all text is using Courier or Courier New and is at the same point size (usually 10-12). Move your right margin so that you see roughly the right number of characters per line (usually 65-70). Then choose File / Save As and then choose the format "Text Only with Line Breaks". Save your file with the extension ".txt" to distinguish it from your Word format file.
After saving, open your text file using Notepad or some other simple text editor and look at the results. You should see a typical PG layout of the text--lines up to 70 characters long, a blank line between paragraphs and no indentation at the start of each paragraph. If so, you're done.
W.8. Quotes look wrong when I save a Word document as plain text.
You may have left "Smart Quotes" on in Word options. This tells Word to use left- and right-slanted quote marks at the beginning and end of a quote instead of the plain ASCII straight quotes. When you save a document that contains these angled quotes as plain text, they come out as non-ASCII characters that look wrong on most editors and viewers. The solution is to turn off Smart Quotes in Word and/or replace the ones it has already created.
W.9. Dashes look wrong when I save a Word document as plain text.
When Word recognizes an em-dash as such, it may try to use a special character for it. This may appear as a black square, an empty box, or a funny accented letter when you Save As text and look at it in a different editor.
You can usually do a Find and Replace on this character either in Word or in another editor after Saving As text to change it to two dashes.
For those interested, the "funny character" is character 151 (97H), and is specific to Codepage 1252 [V.76].
W.10. I saved my Word document as HTML, but the HTML looks terrible.
Yes. Word is not unique in having this problem, but HTML saved from Word is the case we hear most about. Microsoft themselves offer a free plug-in to Word that saves the file in "Compact HTML", which is a bit better. You can fix it by hand, or you can use Tidy <http://tidy.sourceforge.net>, a handy utility, which will do some clean-up on the HTML. If you're working with HTML, you really need a copy of Tidy anyway, because it's such a great way to do a check on the correctness of your HTML.
Tidy is also embedded in some Windows GUI tools, like Tidy-GUI, HTML-Kit and NoteTab.
Scanning FAQ
S.1. What is a scanner?
A scanner is a machine that makes an image, a picture of the page that is fed to it, and sends that image to your computer. It only makes an image, like a camera does; it doesn't turn that image into text.
S.2. What types of scanners are there?
The most common type of scanner, the kind you're likely to find in your local computer store, is a flatbed scanner. It has a glass bed usually a bit bigger than Letter paper size (or A4 if you live in Europe! :-) and most of the common models are optimized for typical office correspondence. One of these may cost anything from under $100 to $400, depending on its features, or you can pick them up cheaper second-hand. You use this by placing the paper or book face-down flat onto the glass, and scanning from there. This is the kind of scanner most commonly used by PG volunteers.
Some stores will call sheetfed scanners a different category. These are flatbed scanners with Automatic Document Feed (ADF), but they are fundamentally the same machine, and the ADF sheetfeeder unit may often be bought as an accessory to the flatbed scanner. Recently, a few sheetfed scanners have appeared that are very small, without a full flatbed, just a narrow strip that the paper rolls through. Avoid these for PG work; you often need to be able to scan the book flat.
Hand scanners, as their name implies, are much smaller, and typically very cheap, or even thrown in free. You use these by holding them in your hand and running them along the text like a brush. These are really not intended for PG work; you need a very steady hand movement to get them to scan a page of text into a readable image, and they shouldn't be considered as an option for a 400-page book--scanning and OCR is tough enough without that!
You can think of production scanners as industrial-strength flatbed scanners. The basic mechanisms are the same, but a production scanner will certainly have ADF (sheetfeeder), more features and speed, and be rated for very high volume scanning. Production scanners are used by publishers, businesses with high-volume paper processing needs, and print shops. This last is useful, because you may be able to get some scanning done by a print shop. It can't hurt to ask. If you're thinking about buying one of these babies (and who among us hasn't? :-), be sure you have $2000 or more to spend.
Drum scanners are mostly used by publishers for professional, high-quality artwork. The paper is placed on the surface of a drum that rotates past a fixed scanning head. The drum can be very large. Because the sensors don't have to move, the electronics and optics can be of higher quality, and produce very accurate, high-definition images. They are exactly what you would want for making professional quality scans of old movie posters, but they're expensive, and not very useful for scanning War and Peace to OCR.
Planetary scanners are a different breed to all the others. They are really not scanners at all, but a very high-end digital camera on a stand. You place the book face-up with the pages open, with the camera looking straight down on it. It takes a picture, and passes it on to the connected computer. Planetary scanners are ideal for old, fragile, valuable books that can't be exposed to the stress of normal scanning. They typically come supplied with specialized software, sometimes even their own dedicated computer, and they are very, very expensive--$20,000+.
S.3. Which scanner should I get?
For most people, the answer is simple. Unless you have a lot of money and are sure you will be scanning a lot of books, you should get a normal, consumer-or-office type flatbed scanner, with or without an ADF sheetfeeder.
Having decided that, you're faced with the question of which scanner to buy. More good news! The market in scanners is very competitive, and there are many top-line vendors all watching each others' features like hawks, eager to deliver the highest-spec machine they can. There are only a couple of critical factors in this decision--most of it is about getting the best buy.
For PG work, you really _need_ an optical resolution no less than 300 by 300 dpi (dots per inch), and 600 by 600 is very desirable. Obviously, more is better, but it would be very rare to need more than 600 dpi for PG work. Pay no attention to the "interpolated" or "enhanced" resolution, where the software "guesses" what dots should fill in the gaps--you're only interested in the optical resolution. The good news is that it's very difficult to find modern scanners with a maximum optical resolution of less than 600 dpi, but if you're buying second-hand, you should check this out first.
You will also _need_ a scanning surface on the glass big enough to place your book with two facing pages flat. Again, the good news is that it's very hard to find a flatbed whose scanning surface is too small for PG work, since these scanners tend to be designed to handle office paper, which is about the right size. Most flatbed scanners have scanning surfaces of about 8.5" by 11.5", and this is standard for PG work. If you're working on books with very large pages, you may need to resign yourself to scanning one page at a time, but buying a scanner with a big flatbed for these rare occasions will be much more expensive.
You must make sure that you get a scanner that will connect correctly to your computer. There are currently (mid-2002) three main types of connections commonly available: SCSI, USB, and parallel.
SCSI (Small Computer Systems Interface) is the highest-quality option, but it means that you need a SCSI card in your computer, and be willing to figure out how to install it. If you're already a SCSI enthusiast, you don't need to read further; if you're not, I suggest you avoid it unless you enjoy tinkering. Production scanners mostly require SCSI.
Parallel-port connections used to be common, as a cheaper, easier alternative to SCSI. Since the introduction of USB they have become rarer, but you will still see them for sale second-hand. These plug into your printer port, and don't require any further engineering skills.
Most new scanners hook up using a USB (Universal Serial Bus) interface, which is a no-muss, no-fuss "plug-in and go" option, but be sure, if you have an old PC, that it actually has a USB port and that your operating system supports it; some older Windows PCs and Macs may not. If your PC doesn't support USB, you should probably look at Parallel-port scanners.
By the time you read this FAQ, FireWire and USB 2.0 interfaces may also be common. For your purposes, these are like more advanced versions of USB. Just make sure that your computer has the right support to match the scanner.
If you're buying second-hand--and used scanners can be very cheap--make absolutely sure that you're getting the original software that came with the scanner, and that that software will work with your current operating system on your PC.
Having ensured that your choice of scanners passes these tests, you're now free to indulge your tastes for any extras you like. Color is nice, but rarely used, since we mostly transcribe older books that have no color printing. Higher resolutions are comforting to have, both since you may occasionally find them useful and because it shows that the optics are of higher quality than you actually need for your PG scans.
If you are nervous about your choice of scanner, or how easy it is to get one working, feel free to contact other PG volunteers for their opinions, as described in the FAQ "How do PG volunteers communicate?" [V.12].
S.4. What is ADF?
ADF stands for Automatic Document Feed, and it's just a jargon term for a sheetfeeder, where you put in a stack of pages to be scanned and go away while that's happening instead of putting in each page manually.
S.5. Should I get ADF?
That depends. Yes, ADF is a great idea, and can be a huge work-saver, and if you have the cash to spend, it may well be worth it. But ADF has a dirty little secret: like any other gizmo with moving parts, it occasionally jams. The sheetfeeders built into these low-cost machines are aimed at handling typical office paper straight from the laser printer--large, smooth, good quality, with perfectly-cut, perfectly-aligned edges. In your PG work, you will be dealing with hundred-year-old pages of various thicknesses and textures, usually much smaller than the sheetfeeder was designed to work with. And you will have to have cut the pages, and may leave ragged edges in doing so.
Under these conditions, you may find that paper often jams in your sheetfeeder, and it defeats the purpose if you have to stand over the scanner while it works, or if you end up having to lift the cover and use your scanner as an ordinary flatbed, or, worse, if your paper gets scrunched up as if a dog had been playing with it.
And of course, in order to feed the pages through, you will have to cut them out of the book, destroying it. (It may be possible, with the help of a bookbinder, to have the pages professionally cut, and later re-bound.)
With ADF, you probably won't actually scan much faster than scanning flat, but you won't have to keep turning over the pages during that time.
So when you're making that choice, think carefully. If money isn't a problem, or you do expect to be working with cut sheets, then go ahead and get a sheetfeeder--it's great when it works! But don't be disappointed when it doesn't work all the time.
S.6. What's a "TWAIN driver" and why do I need one?
A TWAIN driver (see <http://www.twain.org>) is a piece of software that installs onto your Windows PC or Mac and controls your scanner from there. With any modern scanner, there will be a TWAIN driver included in its software package. Once installed, you shouldn't have to think about it again, or even know it's there.
A modern OCR package will usually find your TWAIN driver and use it to control the scanner. This is very handy. There may also be a small scanning package with your TWAIN driver, which will provide a screen where you can make fine adjustments to scanner settings, and start scans. You probably won't _need_ this, since your OCR package will probably do it for you, but it may be useful for semi-manual control of the scanner.
Unix-based systems like Linux use SANE <http://www.mostang.com/sane/> rather than TWAIN drivers.
S.7. How do I scan a book?
This depends on whether you have cut the pages out, or whether you are working with an intact book.
If you have cut the pages out, and you have an ADF, then you will obviously feed them through that.
If you don't have an ADF, there usually isn't much point in cutting the pages. Most modern OCR will recognize a "dual-page" or "two-up" scan, and, if yours does, then that's normally the best option. Scanning the uncut book, open and flat, is the most common scanning method used in PG.
Take the book and place it open, flat on the scanner glass. To fit both pages on the glass, you may need to position it lengthways, at 90 degrees to its natural angle. Most OCR software will recognize that the image has been rotated through a right-angle, and will correct it when it reads the text.
A common problem with scanning an opened book is "guttering", which happens when the spine of the book is not pressed flat enough, and the inside of each page, where it meets the spine, is curved against the glass. There's more about this, and an example, scan3, in the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?". To avoid guttering, make sure that the spine is held down throughout the scan. (Some people put a weight on the spine to hold the spine down on each scan; others just press their hand against it.)
Another common problem is light scattering, when too much light gets into the scanner. The scanner head detects light, and you want the only internal light source to be from the scanner itself, not ambient room light or sunlight. Scanners have covers, that are intended to be closed while scanning, for a controlled light level, but when you're scanning a book held open and flat, you can't close the cover fully. In a bad case, this can lead to a condition of the scan like overexposure of film and you can see an example in scan4 of the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?". If this happens, just make sure that your room is dim while you scan--don't have a ray of bright sunlight bouncing around the inside of the scanner!
Occasionally, when scanning cut pages with very thin paper, you may get a shadow of the text on the other side showing through. If this happens, you can try covering the inside of the scanner lid, which is normally white, with a piece of black paper.
Many modern OCR packages will control the scanner automatically, and you may be able to set your OCR so that it does an automatic timed scan every, say, 30 seconds. This is a great timesaver, since you don't have to go back and forth between the scanner and the screen. Just set your timer, hold down the book for the scan, take the book up, turn the page, put it down again, and wait for the next scan to start. Set the timer for whatever interval you are comfortable with. Highly recommended, if your OCR or scanning package can do it.
By default, most scanners will always scan the entire area of the flatbed, but usually, your book will occupy only about half of it. Look for a setting on your OCR or scanning package which allows you to reduce the area that the head scans. Just scan enough to get the image of your pages. This makes the time for each scan and subsequent OCR recognition shorter, and in a really good case can cut your total scanning and OCR time in half.
Scanning all pages together is usually fastest, but you may prefer to scan each double-page, then correct it in your OCR package's editor, then scan the next. This is a more leisurely approach favored by some volunteers.
S.8. My book won't open flat enough for a good scan, and I don't want to cut the pages.
Well, then, you have a difficult choice to make, but you do still have several options:
You can accept a poor-quality scan, and spend a lot of time fixing up the guttering on the margins.
You can bite the bullet, and cut the pages.
You can type the book, or find a typist who will work on it for you.
You can find a print shop or bookbinder who will cut the pages professionally, and re-bind the book when you're done. You may even replace it with a fresh new binding that will give the book a new lease of life.
Take your choice.
Most books will open flat enough for an adequate scan, though you may have to put stress on the spine to do it.
If you have a really precious book, and you can't find a typist, you might consider the options of a digital camera [S.11] or finding someone with a planetary scanner [S.2] to scan it for you.
Michael Hart said: "I would give up every book I own, including my first edition of the OED, my Civil War edition of the Merriam Webster's Unabridged, etc., etc., etc., so everyone could use it any time they wanted rather than that only I or my friends could use it . . . and obviously _I_ could use it too."
Fortunately, it rarely comes to that.
S.9. How long does it take to scan a book?
Putting the book flat on the glass means that you scan two pages at a time. A reasonable modern scanner will scan the area of two typical pages at 400dpi in anywhere from 20 to 40 seconds--let's call it 30 seconds for two pages. That's four pages a minute, or 240 pages an hour. You could reasonably get through a 400 page book in two hours, even allowing for an occasional break or glitch.
Of course, you should also allow time for scanning a few trial pages with different settings before you start, to decide which settings to use. Ten minutes spent here can save you hours of proofreading time.
There are two big tips that can save you a lot of scanning time:
If your OCR or scanner control package has a timer setting, that automatically keeps scanning without user intervention, you can forget about the screen and just keep turning the pages as needed.
You should set your scanner just to scan the area the book covers on the glass. By default, your software will probably scan the full area of the glass, and usually, your book won't need that. By scanning only what you need, you may typically save anything from 20% to 70% of the time taken to scan the full area. If your book is small enough to open flat _across_ the scanner instead of "down" the side, 400 pages an hour is not out of the question with this trick.
S.10. What scanner settings are best?
For a given book, scanner, PC and OCR software, there must be some "ideal" scanner settings, but if you change any of these components, the ideal scanner settings will change with them. Some OCR packages recognize greyscale better than black and white; some don't like greyscale at all. Some books have small print needing higher resolution; some are speckled so that higher resolution leads to more errors.
Obviously, the best settings also depend on the individual book, and some books will require you to get downright creative with the settings, but most PG books are scanned in Black and White or greyscale, somewhere between 300dpi and 600dpi.
This decision is a trade-off between speed and accuracy, and an illustration of the difference between principle and practice. In principle, a true-color, 9600dpi scan is a much better rendering of the page than a B&W 400dpi scan. In practice, all that extra information doesn't usually help the OCR make better distinctions between letters, and the larger and more detailed the scan, the longer it takes to make the scan, the more disk space the image file takes, and the more processing time and memory the OCR package needs to recognize it.
A further paradox emerges when considering higher vs. lower resolutions: depending on the paper and ink quality, you may see _more_ errors start to appear on very high resolution scans. These are caused by small imperfections in the paper or ink spots that show up on the high-res scan, and that the OCR tries to interpret as letters or punctuation.
So, in summary, bigger is better, but only up to a point.
Brightness is a setting often neglected, that can make quite a big difference to your results. Look at the scanned image: if you see lots of dark patches, make your scan lighter; if your letters appear thin and faded, make your scan darker.
See the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?" for some typical scans and results.
S.11. Can I use a digital camera in place of a scanner?
Digital cameras are getting better resolution all the time, and some volunteers have experimented with making a kind of home-made planetary scanner from a digital camera and a stand. So far, the results don't quite match a dedicated scanner, but as digital cameras improve, this may become a common option. One problem, which planetary scanners use specialized software to correct, is that the natural curve of the pages near the middle of the book tends to give a foreshortened aspect to the letters there, which can cause problems for OCR software, like guttering.
Whatever the current problems, the prospect of using digital cameras is exciting, because it will mean that non-typists will be able to produce old books borrowed from libraries without worrying about scan quality vs. damage to the spine.
S.12. What is OCR?
OCR stands for Optical Character Recognition. This is very important software that looks at the picture of the page that your scanner has supplied, and turns it into text.
When the scanner delivers the image of the page, that image is only a picture. You can't, for example, search for text in it, or edit the text to add a blank line. Your editor or word processor can't work with it. The OCR program does the job of "reading" and "typing" the image for you. OCR packages call this "reading" or "recognizing".
S.13. What differences are there between OCR packages?
One word: huge. All OCR packages do the same job, but they do it in different ways, with different features, and with different levels of accuracy. OCR can save you a lot of time, or cost you a lot of time. It's really worth putting some effort into making sure you get the right OCR package, and, once you have it, into understanding how to use it. It'll save you time in the long run.
S.14. How accurate should OCR be?
OCR packages commonly say that they are "99%+" accurate, or something like that. Let's analyze what that actually means: say there are 1,000 characters (letters) on each page, then with 99.9% accuracy, you would expect to have to make 1 correction per page. With 99% accuracy, that would be up to 10 corrections per page. And in a 400-page book, this all adds up.
But there's a "Your Mileage May Vary" clause built into that. Typically, the manufacturers test their OCR on fresh, laser-printed or press-printed copy with perfect scans, and this is fair, since they are aiming their products primarily at businesses that process these kinds of materials. _You_ are not dealing with fresh print; you're dealing with old books, yellowed, spotted, marked, imperfectly printed in the first place, and possibly using unfamiliar fonts. And it's unlikely that you will have the patience to get a perfect scan on every page. The result is that the accuracy of OCR for typical PG work doesn't match the accuracy on images of perfect, fresh paper.
Apart from the scan quality, OCR also has to contend with different fonts and sizes for the letters.
However, if you're getting more than 10 errors per page, you should look at some examples of OCR in the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?".
S.15. Which OCR package should I get?
The accuracy of OCR software has improved enormously in the last few years, and OCR technology looks likely to keep improving even faster than software in general. Further, there is competition in this area, and products leapfrog each other with new versions regularly. The brands most commonly mentioned by PG volunteers (mid-2002) are Abbyy, OmniPage and TextBridge [P.1], and trial versions of all three have been available for download over the Web, and may still be when you read this. [Warning: these are big downloads--40MB or more.]
Most common OCR packages will offer two main working options: to scan a page and view/edit the resulting text on the spot before saving, and to scan a whole batch of pages together and view/edit them all later. Some people like to fix up one page at a time; others prefer to get all of the OCR work done at once, then get the whole text into their editor. Most OCR software will cater for both, and if this is important to you, you should check that the OCR you're buying supports the way you want to work.
If you intend to work in a language other than English, make sure that the OCR you buy supports the characters in your language.
Some OCR software has a "training" or "learning" mode. Using this mode, it scans and "reads" or "recognizes" a page, then you correct that page, and the OCR "learns" from its mistakes and tries to do better on the letters it misread when it recognizes the next page. If you're dealing with a very rare font, this can make a difference to your OCR quality, but modern OCR packages come with enough inbuilt font knowledge for most languages, and you probably won't need this.
If possible, try a couple of OCR packages before you decide. If you want opinions on specific versions, contact other PG volunteers and ask for their opinions, as described in the FAQ "How do PG volunteers communicate?" [V.12].
S.16. What types of mistakes do OCR packages typically make?
Each text has its own peculiarities, but there are a number of well-known scanning errors you will be dealing with all the time.
Punctuation is always a problem. Periods, commas and semi-colons are often confused, as are colons and semi-colons. There are also usually a number of extra or missing spaces in the e-text.
The problem of quotes can assume nightmarish proportions in a text which contains a lot of dialog, particularly when single and double quotes are nested.
The numeral 1, the lower-case letter l, the exclamation mark ! and the capital I are routinely confused, and often, single or double quotes may be mistaken for one of these.
Lower-case m is often mistaken for rn or ni.
The letters h and b and e and c are commonly mis-read, and these are probably the hardest of all to catch, since ear/car, eat/cat, he/be, hear/bear, heard/beard are all common words which no spell-checker will flag as problems.
For example:
" Hello1' caIled jirnmy breczily. 11Anyone home ? "
There seemed to he no-oneabout. Only tbe eat beard him."
should read:
"Hello!" called Jimmy breezily, "Anyone home?"
There seemed to be no-one about. Only the cat heard him.
S.17. Why am I getting a lot of mistakes in my OCRed text?
If you're new to OCR, you may have come with the idea that OCR is almost perfect, and just makes a few mistakes now and then. No. It's slightly amazing that OCR works at all, and when it does, it isn't perfect.
You might reasonably expect to average anything up to 10 errors per page for typical PG work; if you're seeing more, then there is a problem with
a) your printed book b) your scan, or c) your OCR package
Problems with the printed book fall into three categories: bad printing, age, and unusual fonts. Bad printing consists of problems like too much or too little ink on the press at the time the book was printed, and irregularities in the print where the metal type was damaged. Age causes yellowing--even browning--of the paper, and faded print. Unusual fonts may be hard for OCR to recognize, and very tightly-spaced print may make adjacent letters seem to touch, which confuses OCR software.
There are many ways for you to have problems with your scan. Obviously, if your scanner is defective or the glass is dirty, you will notice it immediately, but there are many mistakes you can make that will result in a poor-quality image, and cause later problems for your OCR.
You may not be able to control the quality of the paper you have to work with, but there is a lot you can do about the quality of your scan.
The two mistakes that people inexperienced with scanners most commonly make are not holding the spine down firmly enough to get a flat image of the paper, and not setting the brightness correctly, or letting too much light get in. In your early scans, watch out for these problems.
First, if you haven't already, read the FAQ "How do I scan a book?" [S.7] and check that you're following the basic recommendations there.
Now let's look at some samples, and see the kinds of problems you might encounter.
A disclaimer about these samples: specific OCR packages are named, but you should _not_ take these as a fair and comprehensive comparative review of the software. The object of this exercise is to show typical scanning conditions and problems, and the resulting OCR output. OCR packages have quite a range of variance within themselves, may work better on some texts than others, may improve with "training" or different settings, and I have even seen the same OCR package produce different text from the same image with the same settings! Further, since OCR quality is improving rapidly, and packages leapfrog each other in quality, the next version of a particular brand may be vastly better than any of the software mentioned here. Of particular interest in this context is the leap in quality between OmniPage 10 and OmniPage 11.
* * * * *
Scan 1--A perfect Scan
Scan1 is as near to a perfect scan as you can expect in PG work. It comes from "The Founder of New France" by Charles W. Colby. It is only a 300 dpi image, but given the quality of the print and of the scan, 300dpi is all we need. Ironically, it comes from Gardner Buchanan, who complains about the age and infirmity of his scanner in his description of how he produces a text. The moral is that you don't have to have the latest equipment to get good results!
The actual scan is in the image file scan1-3.tif
It doesn't really need any comment, and all of the packages except gocr rendered it perfectly. Note the fake "space" before the semicolon--if you look closely at the image, you will see why the OCR packages mistook it for a full space, as discussed in the FAQ [V.104] "My book leaves a space before punctuation like semicolons, question marks, exclamation marks and quotes. Should I do the same?"
Champlain was now definitely committed to the task of gaining for France a foothold in North America. This was to be his steady purpose, whether fortune frowned or smiled. At times circumstances seemed favourable ; at other times they were most disheartening. Hence, if we are to understand his life and character, we must consider, however briefly, the conditions under which he worked.
gocr 0.3.6 converted this as:
Champtain was now definitely committed to the task of gaining for France a foothotd in _orth America. This was to be his steady purpose, whether fortune frowned or smiled. At times circumstances seemed favourable ., at other times they were most disheartening. _ence, if we are to understand his life and character, we must consider, however brieRy, the conditions under which he worked.
* * * * *
Scan 2--A Typical Scan
Scan2 is a paragraph from Baroness Orczy's "Castles in the Air". Notice the ink-splotch above the capital "I" in the first line, which will give our OCR some problems. The page is also unevenly inked elsewhere, and I have scanned it with the brightness level a bit too high.
I have made two separate scans, one at 300dpi and one at 400dpi, both Black and White, named scan2-3.tif and scan2-4.tif respectively. The page was cleanly cut, and carefully placed straight onto the scanner glass with the cover down. The original print is somewhere between the size of Times New Roman 10 and 11, with capital letters about 2.2 millimeters high, but better and more clearly spaced. These scans are fairly typical for PG work. Because of the relatively large letters, and the reasonable scan, there isn't much difference between the text produced from the 300 dpi scan and the 400 dpi scan.
I actually cut this book to get the pages out so that I could feed it through my ADF, but the paper is so thick and textured that it sticks together, and jams when feeding through. The thick, absorbent paper, combined with the uneven inking, means that, no matter how good the scan, any OCR has to contend with the irregular edges of letters, which are clearly visible even at 300dpi.
Here is the output for these scans from some OCR software packages. I changed just one thing: Abbyy recognized the em-dashes as such, and output them as a special character in Codepage 1252 for em-dashes, which isn't available in ASCII, so I converted that to the PG standard 2 dashes.
Abbyy FineReader 6:
Yes, indeed, I was on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain %vas seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs--a goodly sum in those days, Sir--was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, Twas on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs--a goodly sum in those days, Sir--was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
gocr 0.3.6:
__e_, indeed, f___as on_the track of h_. hristide Fournier, 3nd of one of the most im__ant hau1s of enem)_ goods ___hich had e__er been made in France. h?ot onl3_ that. I had a1so before me one of the most brUtish crimînat_s it h__4 e___er been m31 misfortune to co_me acro__3. A bu113_, a tiend oí cruelt__. In very truth m3_ fertiIe brain ___as s_e_1_::_g __-ith planS for e__entua113_ _ay:ng that abominab1e ru_iin b.__ t1_e hee1s . hanginig __ou1d be a n_erciful pun- i;__,i__gnt íor such a miscreanf. yes, in_i__ee3, fj_1e thou3and francî-a b_ood13_ sum in those days, _ir-_vas practica1l3_
a3_ured me. _ut o___er and above n_ere lucre there was the certaint_v that in a few_ da3_s' ti_e I shou1d see the lib_ht of gratitude shininb_ out of a pair _f _usLtrous btue e3_e3_, and a ___inning smi1e chasing a__ay the Ioo_ of _ear and of sorrow from the s__eetest iace T had Seen fof man)_ a day.
Yes, indeed, f___as on the track of h__. Ariseide Fournier, and of one of the most important hau1s _f enemy goods ___hich had ever been made in France. NoEUR on1y that. I had also before me one of the most brutish crimina1s it h_ad ever been my misfo__tune to come acros__. A bu11y, a fiend of crue1ty. _n very truth my fertib brain _vas seeî3_:i_g __ith plans for e__entua11p 1aying _at abom_in_ ab1e ru_an by the heels. hanging _____ou1d _ a merciful pun- iï_h_ment for such a miscreant. Yes, indeed, five thou__and f_ancs-a b_ood1y sum in those days, _ir-_vas practica1ly a3îured me. But over and above mere _ucre th.ere was th_e certainty that in a few days' ti_e _ shou1d see the 1i__t of gratjtude shining out of a pair o_, _userous b1ue b . e__es, and a __inning smi1e chasing away the l_k of _,ear and of sorrow from the s___,eetest face _ _ad _.een _o_ many a day. . .
Recognita Standard 3.2.7AK:
~'es, indeed, ~w-as on the track of ltT. Aristide Fournier, and of one of the most important hauls of enemy goods "=hich had ever been made in France. ~Tot only that. I ha~i also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully-, a fiend of cruelty. In very truth my fertiIe brain was s; ething w-ith plans for eventually iaying that abominable ruffian by the heels : hanging ~-ould be a merciful pun- ishment for such a miscreant. ires, indeed, five thousand franes-a goodly sum in those days, Sir-was practically as~ured me. But over and above mere lucre there was thP certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous btue ey·es, and a winning smile chasing away the hk of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, l~was on the track of h~i. Aristide Fournier, and of one of the most important hauls of enemy goods w~hich had ever been made in France. lVot only that. I had also before mP one of the most brutish criminals it had ever been my misfortune to come acrass. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for ez~entually laying that abomin_ able ruffian by the heels : hanging ~~.-ould be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand f:ancs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should~ see the Iight of gratitude shining out of a pair of iEustrous blue eyes, and a w inning smile chasing away the Iook of fear and of sorrow from the s"-eetest face ~ had seen ~'or rr~any a day.
OmniPage Pro 10:
Yes, indeed, twas on the track of 11T. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I ha(i also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, fwas on the track of h-I. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
OmniPage Pro 11:
Yes, indeed, twas on the track of AT. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, fwas on the track of h-I. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Textbridge Millennium Pro:
Yes, indeed, rwas on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I hail also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day. - - -
Yes, indeed, f was on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for manyaday. -
* * * * *
Scan 3--Guttering and Smaller Print
Scan3 is a paragraph from "The Egoist" by George Meredith. It was scanned in a dim room, with the scanner cover open and the book held open, flat against the scanner glass. However, the spine was not pressed firmly enough against the glass, and as a result you can see that the words on the left-hand edge (which were near the spine) appear to be slanted, a bit distorted, and not well lit. This problem is familiar to people who scan for PG--everybody gets distracted sometimes, and fails to keep enough pressure on the spine. As you see from the results below, it caused problems for all of the OCR packages on the words affected. If you find this kind of "guttering" regularly in your own scans, where the characters near the spine are not being recognized correctly by your OCR, you need to make sure that your book is down as flat as possible before making a scan. Because of the smaller size and the guttering problem, the 400dpi scan made for better quality text in this case.
Here's the output from the sample OCR:
Abbyy FineReader 6:
NEITHER Clara nor Vernon appeared at the mid-day table, n Middleton talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an uncdified audience might really suppose, upon seeing her over the difficulty, she had done something for herself. Sir \Villoughby was proud of her, and therefore anxious to soltlo her business while he was in the humour to lose her. He hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from him, had vaguely frightened even more than it offended hia nrido.
NEITHER Clara nor Vernon appeared at the mid-day table. Dr. Middleton talked with Miss Bale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon seeing her over the difficulty, she had done something for herself. Sir "VVilloughby was proud of her, and therefore anxious to settle her business while he was in the humour to lose her. He hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from him, had vaguely frightened even more than it offended his pride.
gocr 0.3.6:
__,,,____,_ Cl,_I._c nor Vernon a__e_Ped _t tl_le _id_da_ tab1e_ _, _ii_(__etoiI f,,_lk(;cl with _MiSs _ale _U_1d_ abS8iG_l I_i_t_t_l.__ i,_i,;,_ .,, _(_u_-i,L_t_ii.e(l 6iiLIblt 6'7_V. ill_ _ C 'll . tf e__Ul__b rU_l gt(),ii_, tu _fj(),I(, ,_uruSS.,__ T__ Illl_ g UlOUUt_lU o_ _ 8O .t _' t_ail u,,_,_ifj(;il ;,_i((ic,IGG l_i_' lt re_ y 8UE)_OB_'_ U_Oll 8eelll6 lttr _,__i. t_ic (li__icu1ty, SIIe t1_d iluI_e 8ol_eth_ng_ fo_ be_.Self. _i__ _ji___()_i___lIl)y w,,s prui_il of heT_ and k__eTefope an_iouS to _(_(.__u l___i. i)i__, ii,ess wIlile he Wa8 in the hU_ouT to luse Iier_ j__ l_()_)(_(l t() tiiIish it b_ ShOOtiltg a WOTd o__ t_O &t Verno_ _o__(),__ (li,_iIci._ Cl__T_'S _eti_tio_ tO be Set fTee_.Te1ea8ecl fro_ )ii))),, lIL_Ll v_b__uely f_.ighteUe eVen _OTe kba_ lt OfEe_ded hi_ pi_i..(l_u- . _ , , --.___ _ _,- - -__-
________ Cl__i.a nop Vernon appeared &t t'h_e _id_day t__le_ D_. _id(lle_oi_ t_lked with Miss _ale ,on _ _Ssi__l __i tt_r_'_ iij_e _ 6ood-n___tLi_.ed 6iai_t 6_i_ing & Ghild the ___np _'_.on_ _tune to _tone aGro_S a braWlin( __ inOU__taiß _foPd_ So t2_at a__ u__p,(_ified ___idiei_Ge _ni62it real y 8uppO.8e_ upon _seeii_6 l_e_ o______ the difhculty_ she had done _o_neth_n6 fop ber_elf_ _i_ _viljoli____k)y w__s proud of heT, and the_efo_e an_iouS to ___.tle li__i. i)u__inesS Whike he W_S î_ the hum'ou_ to_ lose her_ __e l_op(_d to finish it by 8hooting a wopd o_ tWo ak Verno__ _ _eforR_ _(in_icr_ Clara's petition to _ Set _free, releaSed fro_ )ii__, h_d va6uely frigbte_ed eve_ _ore tban it o_e_ded hiD pi.icle. -. - - - - - '
Recognita Standard 3.2.7AK:
~rFr~rrmx Clara nor Vernon apneared at the mid-da~'table. Dr. bLidrlleton talkc;d wi.th Miss Dale vn elassieal matters, like a ~n~a-mZtured giant gi.ving a child th© jucnp frvm stonc to stone across a brawling mounta,in ford, so that au uiicilificd .ruciicucc mil;·ht really suppasc, upon seeixig hor ·n~er thc ciillicul.ty, she had clouo something for herself. Sir ~Villcm;;lrlry wvs proua of her, and therefors angiaus to sct.tla lrur tn~sincss while he was in the humoar to lose her. lle lu,hcot to iinish it by shooting a word ar two at Vernon bol'ore ~linncr. Clara's petition to bo set froe, released £rom JGGnt., hvd vagucly frighteued even more than it offended hia ri~le. p
NEITfi~R Clara nor Vernon appeareci at the xnid-day table. Dr. Middleton talked with Miss Dalo on classics,l rnatters', like a good-natured giant giving a child the jtimp from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon ~ seeing her over the difficulty, she had done something for herself. Sir yillon ;hby was proud of her, and therefore anxiotis to scttle luer business while he w~as in the hurxiour to lose her: He hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from jcLm, had vaguely frighteued even more than it offended his pride.
OmniPage Pro 10:
NF r~rn,Px Clara nor Vernon appeared at the mid-dap table. Dr. Middleton talked with Miss Dale on classical matter, like .t good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an uneVified audience might really suppose, upon seeing her over the difficulty, she had done something for herself. Sir jV;llo,r;;lrl>y was proud of her, and therefore anxious to set.tlo lror Uusiness while he was in the humour to lose her. Ile. lropcol to finish it by shooting a word or two at Vernon bol'ore dinner. Clara's petition to beset free, released from )zinc, had vaguely frightened even more than it offended his pride.
NEITHER Clara nor Vernon appeared at the mid-day table. Dr. Middleton talked with Miss Bale on classical matters', like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon ~ seeing her over the difficulty, she had done something for herself. Sir yillou ;hby was proud of her, and therefore anxious to settle her business while he was in the humour to lose her. He hoped to finish it by shooting a word or two at Vernon before dinner. Clam's petition to be set free, released from him, had vaguely frightened even more than it offended his pride.
OmniPage Pro 11:
NF f,rnMR Clara nor Vernon appeared at the mid-day table. Dr. Middleton talked with Miss Dale on classical matters, like .t good-natared giant giving a child the jump from stone to stone across a brawling mountain ford, so that an une(lifie(l audience might really suppose, upon seeing her over the difficulty, she had done something for herself. Sir jVillon;hl)y was proud of her, and therefore anxious to setale leer business while he was in the humour to lose her. lle hoped to finish it by shooting a word or two at Vernon bofore dinner. Clara's petition to beset free, released from )lint, had vaguely frightened even more than it offended his pride. -.2 ..1_ - ____
NEITHER Clara nor Vernon appeared at the mid-day table. Dr. Middleton talked with Miss Dale on classical matters', like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon,seeing her over the difficulty, she had done something for herself. Sir Willoughby was proud of her, and therefore anxious to settle her business while he was in the huniour to lose her. Il"e hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from hint, had vaguely frightened even more than it offended his pride. - -
TextBridge Millennium Pro:
NErr'!'~~ Clara nor Vernon appeared at the mid.day table. pr. ~1id(lIeto11 talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that au ~1edifi~ tLU(llCIlCC might really suppose, upon seeing her over the (hjiheulty, she had done something for herself. Sir wiflouighby was proud of her, and therefore anxious to settle her business while he was in the humour to lose her. lie ho1)ed to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from him, had vaguely frightened even more than it offended his prú~t~.
NEITHER Clara nor Vernon appeared at the mid-day table. Pr. Middleton talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an une(lified audience might really suppose, upon - seeing her over the difficulty, she had done something for herself. Sir Willoughby was proud of her, and therefore anxious to settle hier l)uSifleSS while he was in the humour to lose her. lie hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from hirn~, had vaguely frightened even more than it offended his pri(le.
* * * * *
Scan 4--A Really Bad Case!
Scan4 is a paragraph from Pope's translation of Homer's "Odyssey". This is a very, very tough one. It was obviously a cheap printing to begin with, using thin, poor-quality paper in a page size of 6" by 4.5", with capital letters about 1.5 mm high, a little bigger than Times New Roman size 8. Text this small really needs a higher-resolution scan. The book was falling apart when I got it, the ink was fading and flaking, and there was no point in even thinking about trying to scan it flat, so I cut the pages. To add an extra challenge, I scanned the sample with the cover open in a medium-lit room for the 300 and 400dpi scans, but closed the cover for the 600dpi to show the best quality I could possibly get. (I was pleased to note that Abbyy, while recognizing the page in the 300dpi and 400dpi images, flashed up a suggestion that I should lower the brightness of the scan.)
This particular book was one I sporadically tried to produce, without success, on an older scanner and a bundled OCR program over a period of two years, back in 98/99. Eventually, in 2000, it was the first book processed through Charles Franks' Distributed Proofreaders site. The initial text produced by the OCR was very poor, but the human volunteers made up for it! Thanks, guys! Today, just two years later, with a better scanner and better OCR, I could have done it myself, as you will see from the best of the results of the 600dpi scans. That's how much things have improved recently.
A separate point to note here is that you can see the "three-quarter space" effect before the exclamation mark and semi-colon that was discussed in [V.104].
The results of the OCR are:
Abbyy FineReader 6:
" Ah me ! on what inhospitable coast, On Tvh.it new region is Ulysses toss'd ; Possess'd by wild barbarians fierce in arms ; Or men. whose bosom tender pity warms ? What sounds are these that gather from the shores ? The voice of nymphs that haunt the sylvan bowers, The fair-hair'd Pryads of the shady wood ; Or azure daughters of the silver flood ; Or human voir-e? but issuing1 from the shades, AVhv cease I straight to learn what sound invades?"
" Ah me ! on what inhospitable coast, On what new region is Ulysses toss'd ; Possess'd by wild barbarians fierce in arms ; Or men, whose bosom tender pity warms '? "What sounds are these that gather from the shores ? The voice of nymphs that haunt the sylvan bowers, The fair-hair'd Dryads of the shady wood ; Or azure daughters of the silver flood ; Or human voice? but issuing from the shades, Why cease I straight to learn what sound invades?"
" Ah me ! on what inhospitable coast, On what new region is Ulysses toss'd ; Possess'd by wild barbarians fierce in arms ; Or men, whose bosom tender pity warms ? "What sounds are these that gather from the shores ? The voice of nymphs that haunt the sylvan bowers, The fair-hair'd*Dryads of the slrady wood ; Or azure daughters of the silver flood ; Or human voice? but issuing from the shades, Why cease I straight to learn what sound invades?"
gocr 0.3.6:
[The 300 and 400 dpi scans produced nothing recognizable. The result of the 600 dpi scan is below.]
'' _hh i_3e ! o_1 ___l_at_ i__l__sl__ it_nble CoaSt_ On ___l_,__ _)e_v i_e_io__ i__ ___ _._____ses toss'd ; _(3s3gs3_d l3.__ ___iiíi l3_3__b___i_c_i3_ fie_Ce in il__S- _ Or i11pn, __-i)c3se l_osonl te_1de_ _it____ __ai_n3__ ? ___l_at __o__i1ds Qre tlipse tliat g__tl_p_r fE_oi33 the shoTes ? '_ilie __oi__e of i)____ E1)l3l3s tl3nT 1i_n__nt the s__l__inn bo_Ye_5_ 3'l_e fni___i____ir'd _____-ads of' il_e sli__d__ i___oOd _ Op az(_pe da_____litc__s of _tlie sil __?r t1ood ; Or l___i31_nn ___)i___? l3__t i3____ii_6 fi_oi11 tlie __hiade__ _ __'!3.__ _ea___e _ s_rai__li.t to l_ar_i1- i_--li__t so_nd- in__ad_S___''
Recognita Standard 3.2.7AK:
.: lh nt"'. on w-hat inlu,;y:t, I,:e co;;~t, On ~cli^t ne~- re~ion i.. 1= 1-.-:.:e~ tm:'d ; Possea'd 1n- wil~l L;,rba~:c, .~ fierce in arm~ ; Or u.~u. w-Ln.e bossum tender pit~- warna'? ~l-u:lt .<,:~;;::;3s are tll~ce that ~atl:er from the shnre~ ? 'I'l.e -;;o'.re :,; nwtthil: tW ,t l:aa;nt the s~-l:c 1llJOR'er5, 'lhe :a,:~-h ~;r'd~It.wa~i~ ot' tl:e ~Il;;dv vood; Or az.lre dau~~l.ts~: oY tl:c ·:iv-~~r floo;:3 ; C?r humnn ~-<:i: e'? l,~:tt i~~; from tl:c· ~had~~, 11-lts- cea~e I ctrai rlit to learn ~s-l:, t socud incades %"
" ~h me ! ou "-Mat iuMospita~le coast, On ~i-lmt ne~c reyion is L 1~-~ses to~s'd ; Pos:e;s'd 1"~ w-iMl lrvrbaria:ns fiet~ce in arms ; Or m~ n, "-hose hosom tender pit~- warm5 ? ~~~hat ~ounds are tlmse tMat ~;atMer from t:he shores ? ~t'I~e ~-oi~~e of n~-Inhhs t.hat liaunt the s~-l~~a n howers . Tlie fair-hnir'd D~ vads ot tl:e shad~- "-ood ; Or aznre dau~liters of tMe sil~-~r fiood ; Or lmman ~-oi:~e'? but iauin~ frotn the shades, a lVly cea.~e I straibht to learn "-Mat souud in~ad°s?"
" Ah me ! on what inhospitable coast On ~~-hat new r e~ion is L;1 ~-sses toss'd ~ , Possess'd 1J~- "-ilil I:OII'uai'la ils fierce in arms_ · Or men, whose hosom tender pit~l ~varn~s ? ~'G'l~at somnds are these tliat ~atl~er from the shores ? ~I'Iie v oice of n~-mpl~S that ~munt the sy Ivan bowers, Tlie fair -hair'd D~~~-ads of tl~e slmdy wood ; Or azure daylltcrs of tlle silver flood ; Or lm:nan voice? uut issL~ing from the shades, ~~'lm cea~e I strai~ht to Iearn ~~-lmt so~nd inv ades ?"
OmniPage Pro 10:
,. _lh in- ' on "-hat inh-slit al.:e coast, On "M.^t new reion is 1=1;-a:e~ to-s'd ; P"::e:~'d hw "ild Larba.:an~ fierce in arms ; Or inn. "-hnse bo.,om tender pity warms What <m-,n ds are thFSe that gather from the shores? '1-l.e vo_,e o2 u~vnhit: thm hn,,-,nt The sylvan bowers, The is ;r-ha;r'd h.-;-ads of the liz-Ay iNood Or azure dau_ht;- of tl:c o=1 cr flooj ; Or hnnmn wire? l,11t i--rii:g from the shadP3, Al-ly cease I straiAlit to learn what sound invades?"
'Wh me ! on what inhospitable coast, On what new region is L fusses toss'd ; Possess'd br wild barbaric ns fierce in arms ; Or men, whose bosom tender pith- warms AN-hat sounds are these that gather from the shores ? The voice of nymphs that Haunt the sylvan bowers, The fair-hair'd IWvads of the shady -wood ; Or azure daughters of the silver flood ; Or human voice? bat iauina from the shades, Why cease I straight to learn what sound invades?"
" Ah me! on what inhospitable coast, On what new region is Ll ysses toss'd ; Possess'd bv -wild barbarians fierce in arms ; Or men, whose bosom tender pity warnis ? AVlia± sounds are these that gatller from the shores The voice of nYI11pliS that haunt the -sylvan bowers, The fair -hair'd D.-yads of the shady wood ; Or azure daughters of the silver flood ; Or human voice? lout issuing from the shades, Why cease I straight to learn what sound invades?"
OmniPage Pro 11:
.` lh in-' on what inhospital,le co-st, On xclznt near region is t 1:-sse~ toss'(: ; Possess'd bY Mild barbarians fierce in aims ; Or inn. whose boson tender pity warms What <m-,n ds are tlipse that gather from the shores ? '_I-I.e 1-o=,- of nv:npii? that haunt the sylvan bowers, She ra;r-ha;r'd 1):, ads of the shad- wood ; Or az.ire dau_lit~- of tl:e silo-:-r flood ; Or human voice? l,,tt i?snina from the shadpq, Al-lry cease I straiAit to learn shat sound invades?"
''' :Ah me ! on what inhospitable coast, On iyhat new region is Ulysses toss'd ; Possess'd br wild barbarimis fierce in arms ; Or men, whose bosom tender pity warms AN-hat sounds are tliese that gather from the shores ? The voice of nymphs that haunt the sylvan bowers, The fair-hair'd D~ yads of the shady -wood ; Or azure dau.L-hters of the silver flood ; Or human voice? but issuing from the shades, Why cease I straight to learn what sound invades?"
" Ah me! on what inhospitable coast, On what new region is Ulysses toss'd ; Possess'd by -wild barbarians fierce in arms ; Or n1en, whose bosom tender pity warnis ? AVliat sounds are these that gather from the shores The voice of nyniplis that haunt the sylvan bowers, The fair-hair'd Dryads of the shady Wood ; Or azure daughters of the silver flood ; Or human voice? but issuing from the shades, Why cease I straight to learn what sound invades?"
TextBridge Millennium Pro:
no on what inhe~ptaEie coast, On what new realun is hivs,e' to5sd ,s~s Ä-~d liv wild lie il)~m.ihI fir see in al-rn~ Or u~,-n. w'linse bo,uuiu tender pity warnls Wl at ~ are t1ie~e that ~atler from the shores ? 'n.e a oro of imvntpirs tint he~nt the sad van bowers, 'flie tah'-ha~r'd D~vahs ct the shady wood 1)1' az Ire dauul~t ~ of tl,e shvr flood Or liunian vi i 'I ? h'tt is- eng from the shades, \VIiv cea-~e I straight to learn w hat sound invades 1"
Ah me on what inhospitable coast, On what new region is U vases toss'd Possess'd by wild barbarians fierce in arms Or men, whose bosom tender pity warms ~ What sounds are these that gather from the shores? The voi'e of nymphs that haunt the sylvan bowers, The fair-baird Prvads of tl~e shady wood Or azure daughters of the silver flood Or human vuiae? but issuing fi'om the shades, Why cease I straigl~t to learn what sound invades?"
Ah me on what inhospitable coast, On what new region is Ulysses toss'd Possess'd by wild barbarians fierce in arms Or men, whose bosom tender pity warms? What sounds are these that gather from the shores? rfhe voice of nymphs that haunt the sylvan bowers, The fair-hair'd Dtyads of the shady wood; Or azure daughters of 'the silver flood Or human voice? but issuing from the shades, Why cease I straigl~t to learn what sOund invades?"
What can we conclude from this?
Small mistakes in scanning, like letting too much light in, getting your scanner settings wrong for the page, or not pressing the paper flat enough, can make a major difference to the final quality of the text that you will have to correct.
Sometimes, no matter what you do with your scanner, problems with the paper or the print will make it difficult for your OCR package to give good output.
Generally, bigger is better within the range 300dpi-600dpi, but you only need higher resolution with more difficult material.
Different OCR packages will produce widely differing texts from the same images. Given a really good image, most OCR software will work acceptably, but when you have lower quality material to work with, the gap between OCR packages shows clearly.
S.18. I got an OCR package bundled with my scanner. Is it good enough to use?
That depends on how well your package performs on the actual scans that you do, and how much you value your time vs. money. Most scanners are bundled with OCR software, but these OCR packages are often older or "brain-damaged" versions, with their functionality deliberately lowered. It's unlikely that you'll get a current-version, top-of-the-line OCR package thrown in for free.
You may have to pay extra for better OCR, but it means that you spend less time making corrections. The question is how much better you want your OCR to be.
Save the images from the FAQ "Why am I getting a lot of mistakes in my OCRed text?" [S.17] and try processing them with the OCR you have. Compare the quality of the text produced with the quality of the samples. This should give you some idea of how your OCR compares to others.
Try a few pages from your book with your OCR. How many mistakes do you see on each page? Do you find that acceptable?
S.19. I want to include some images with a HTML version. How should I scan them?
We don't often see color prints in our books, but if you do have one, then scan it in color. Otherwise, try both greyscale and B&W, and see which gives you the best image.
It's usually better to scan images in a higher resolution than you're going to use, and then use an image manipulation package to reduce them [H.10] to a size appropriate for your HTML file. An initial scan at 600dpi is often good. Image manipulation programs will also allow you to "clean up" the pictures, by increasing contrast, despeckling, or other filtering.
S.20. I want to include some images with a HTML version. What type of image should I use?
GIF, JPEG and PNG images are supported by current browsers, and you should stick with those unless you have a specific reason not to.
GIF and PNG tend to be more efficient--provide better quality at a given file size--for simple line-drawings; JPEG is usually better for photographic images.
S.21. Will PG store scanned page images of my book?
No. Or, at least, not yet.
The idea has been kicked around a bit. There's no question of replacing etexts with page images, but many volunteers who have already scanned the book anyway like the idea of saving page images as well--for general information, and as a means of checking future correction suggestions against the original. Some volunteers already keep their page images, stored for possible future use.
Working some back-of-the-napkin figures: a page of text might take up 1KB of space on a computer as plain text or HTML or XML. The same page might take 70KB if stored as a black-and-white image, of just enough quality to serve as a reliable guide to making corrections. Pages with pictures, or stored with enough resolution to allow some future researcher to write a paper on the changing shape of serifs in the 18th and 19th centuries, would start at around 350KB per page, and go up from there.
A 300 page book thus becomes
about 300KB as plain text (and around 150K zipped) about 20,000KB as minimal-quality images about 100,000KB as high-quality images
and with the images, we won't save much space on the zipping, because they're already compressed.
On a normal "56K" modem, getting about 4KB / second, it would take:
75 seconds to download the text file (40 for the Zip) 80 minutes to download the minimal images over 5 hours to download the high-res images.
Someday, the disk and bandwidth capacities that we will take for granted will be such that uploading images, when we have them, will be quite natural, just for the few people who will want them. But we're not quite there yet.
Late flash! As of late 2002, the Internet Archive is providing space to volunteers for storing page images. To see the images, and find out more, go to <http://texts01.archive.org/gutenberg-images/>
HTML FAQ
H.1. Can I submit a HTML version of my text?
Yes.
H.2. Why should I make a HTML version?
Well, you can make one just because you want to, but on some texts there is special reason to.
If you want to preserve the pictures that accompany the text, making a HTML version means that you can specify where and how those images appear.
If there is particular meaningful information in the layout of the text that can't be expressed in ASCII, like special characters or complex tables or fonts, HTML may offer an open format alternative.
H.3. Can I submit a HTML version without a plain ASCII version?
You can submit it, but the Posting Team will then consider whether we should also make an ASCII, or perhaps ISO-8859 or Unicode version of it. We really do want our texts to be viewable by everybody, under every circumstances, and we do not want to start posting texts that are in any way inaccessible to anyone.
See also the FAQ [G.17] "Why is PG so set on using Plain Vanilla ASCII?"
H.4. What are the PG rules for HTML texts?
1. The only absolute rule is that the HTML should be valid according to one of the W3C HTML standards.
You can verify that your HTML is valid at the W3C's HTML Validator at <http://validator.w3.org/>
For a more convenient and friendly, though less official, check of the correctness of your HTML, you should use Dave Raggett's Tidy program at <http://tidy.sourceforge.net>, which not only points out any messiness in your HTML code, but also has some neat modes to clean it up and standardize the formatting.
After that, we have some requirements and recommendations. Compliance with the requirements might be waived if there is a really good reason to make an exception in this case.
2. Requirement: File names and extensions
If you want your text to work within 8.3 filename conventions, you may use .htm as the extension for your HTML files; otherwise, use .html as the extension. If you are working to 8.3 conventions, all of your images as well as your HTML files should have 8.3-compliant filenames.
All file names and extensions should be in lower-case throughout. Yes, we know this is not strictly necessary, but we don't want to have to correct every file that comes with "image.gif" referenced in the HTML accompanied by a file IMAGE.GIF.
3. Requirement: HTML and plain-text
Project Gutenberg does publish well-formatted, standards compliant HTML. However, we insist that a plain text version be available for all HTML documents we publish (even if images or formatting are absent), except when ASCII can't reasonably be used at all, for example with Arabic, or mathematical texts.
4. Requirement: Archive format for posting
If the HTML book contains more than one file (including images), create a ZIP (preferable) or TAR archive containing all of the files in the book. The ZIP file may, if you wish, unzip to a subdirectory named for the book. For example, a book called 'The Humour of Mark Twain' might unzip in a directory called 'mthumor'. Make sure directory names contain only alphabetic and numeric characters, no spaces, and are 8 characters or less, even if you're not sticking to 8.3 conventions for filenames.
5. Recommendation: Simplicity
Make your HTML as simple as possible. HTML is an evolving standard, and one that may be completely obsolete in the long term. Use of advanced features may just mean that your version will be obsolete or unreadable that much faster.
6. Recommendation: Images
Images included with your HTML should be in a format that Web browsers can read: GIF, JPEG or PNG. Images should be edited for high quality in a reasonably small file size. Make the best decision you can concerning the image size and placement in the text. Every image included must be linked into (referenced by) the HTML.
7. Recommendation: Line lengths
If it is reasonable to do so, try to wrap paragraphs of text at around the normal PG margin of 70 characters. Ideally, your HTML should be as near as possible identical to your text version except for the HTML tags and entities. People who open your HTML won't all be using browsers, people will need to make corrections, not all editors can handle very long lines, and even with editors that can handle long lines, it's easier to work with short lines.
Apart from these rules and recommendations, we also have a rule about the PG header, but that will normally be handled by the Posting Team. Where your HTML is all in one file, the header text will be inserted within PRE tags in that file. Where the HTML is split into multiple pages, the header will be put into a separate file named index.htm or index.html, and will link to the first page of your HTML.
H.5. Can I use Javascript or other scripting languages in my HTML?
No.
We don't want our readers to have to worry about any potential for malicious or just plain buggy code.
H.6. Should I make my HTML edition all on one page, or split it into multiple linked pages?
For a typical novel, one page or HTML file is appropriate, but when that single HTML file gets up around 2 megabytes in size, it may be worth considering a split because of the difficulty of loading it in some browsers.
In some other cases, where the content requires different styles on different pages, or different pages need different character sets, or the page, with images, just gets too heavy, you may need to split the HTML even if the HTML itself isn't technically too big.
When we post a HTML eBook containing multiple files, whether they contain text or images, we post them only in zipped format, so if you don't have images, and want your text to be directly accessible, you should stick to one file where possible.
H.7. How can I check that I haven't made mistakes in coding my HTML?
There are two kinds of mistakes you can make in coding HTML: you can produce invalid HTML, or you can produce HTML that doesn't do what you want.
Checking for invalid HTML is straightforward. The W3C site <http://validator.w3.org> will formally validate your file and point out any mistakes, and this is the official standard. However, it is not always convenient to use, especially when you're in a cycle of fix-and-retest. For this, you should try the program Tidy <http://tidy.sourceforge.net>, which runs on your computer, tells you about errors, and has other useful functions as well. Tidy is available for just about every operating system, and there are several Windows utilities that include Tidy. The links on the main Tidy page will lead you to the right version for you. Tidy is fast and friendly, compared to validation over the web, but it is not the last word. The W3C Validator may find formal errors, such as DOCTYPE mismatches with HTML tags or entitles, that Tidy may not. The best solution is to complete your HTML tests using Tidy, and then, when Tidy finds nothing further to gripe about, submit it to <http://validator.w3.org> for the official seal of approval. Please run these checks before submitting your HTML; we can generally fix it for you, but it may take us a lot of work.
Producing HTML that actually does what you want is equally important. If you've converted the eBook from text, you may have created inconsistencies, or closed an italics tag in the wrong place, or used the wrong tag at some points. The only way to check this is by reading through the HTML in a browser.
H.8. Can I submit a HTML or other format of somebody else's text?
Maybe.
This question has several complications. First, you must understand that it is quite possible, even likely, that your HTML file will eventually be overwritten by better information.
The value of a HTML file, as opposed to a plain text file, lies in its ability to capture elements of the original that have been lost in the plain text. A plain text file, using extended character sets like ISO-8859 [V.76] or Unicode [V.77] and _underscores_ for italics, can capture all of the author's intent in almost all cases. Sometimes, images and other important features of the original cannot be captured in plain text alone, but can be captured in HTML, or other markup.
When Michael Hart stopped posting books, in September 2001, we had HTML formats of about 1.6% of all our eBooks. At the end of 2002, that has risen to nearly 11% of all our eBooks. If you have a clearable copy of an existing posted book, with extra features not included in the original plain text, we would encourage you to make a new edition, or version, or format, correcting any errors in the original, and adding any new information not included there.
If, on the other hand, you just want to make a "blind format change"--making your best guess at what the HTML, or other format, layout should be for a book you've never seen, based on the original producer's work--your best bet is to get in touch with the original producer, and ask whether they can supply more material for you to work with. Otherwise, you are at best just rearranging information rather than contributing something new.
A blind format conversion can be done in anything from 2 minutes [R.33] to an hour. It just doesn't make sense for us to keep posting these files when they contain nothing new, and especially when two people may want to convert the same text. It is likely that, at some time in the next couple of years, we will start on a large-scale conversion project, to add some form of markup to all of the existing text files for ease of serving, and having a mish-mash of existing markup styles to deal with at that point won't help either.
H.9. How big can the images be in a HTML file?
The images should be as big as necessary, and no bigger.
Sorry, but there is no clear number to give here. Web page designers sweat blood to save an extra 20K on a page; so should you. If you're an experienced HTML maker, you know this stuff; if you're not, take it as a guideline that you should generally aim to keep your images in the 30K to 50K size range, with occasional forays into 70-80K territory. That's generally big enough for a clear picture, unless you're reproducing fine artwork.
H.10. The images I've scanned are too big for inclusion in HTML. What can I do about it?
This is a common problem, where images from the book occupy a full or half page. Your images should be of an appropriate size for downloading, and 2 megabytes of high-quality scan per image is not really an appropriate size for most PG texts!
You should reduce the size, and maybe the quality, of the original scan for simple viewing purposes. There is lots of image-manipulation software to do this. For Windows, you might look at the freeware Irfanview, and for both *nix and Windows there is ImageMagick [P.1]. Look for the words "resize" and "resample" in the Help.
Apart from simple converters, which do enough for this purpose, you can also manipulate the images in full imaging creation and editing packages like Paint Shop Pro, Adobe Photoshop and The Gimp [P.1].
Different image encoding methods can make a huge difference to the filesize. Any of the packages mentioned above can encode images as GIF, JPEG or PNG, and, particularly for black and white line drawings, these can encode to very different sizes. So, for example, a 60K JPEG may save as a 30K GIF, because the GIF encoding works better for that particular image. Try your images out, and see what works.
When manipulating images, always work from your original. Don't convert your original to a JPEG, and then shrink that and convert it to a GIF. Depending on the format, images may lose definition as they are converted (search for "lossy compression" in your favorite search engine to find out more about this), and they certainly lose definition as they are resized, and you end up with the "imperfect copy of an imperfect copy of an . . ." effect. When you're experimenting, take your original, resize and Save As GIF, then go back to your original, resize and Save As JPG, and so on.
You can also use an image optimizer. These are specialist software programs that try to make image files smaller without sacrificing resolution or detail.
H.11. Can I include decorative images I've made or found?
No.
Please include only the images you got from the book. If you want to make an edition of the book for your own web site, you can of course use whatever you like there, but for PG purposes, we want the book, the whole book, and nothing but the book.
H.12. How can I make a plain text version from a HTML file?
You can edit out the HTML by hand, of course, but there are several easier ways to convert.
You can view the HTML in a browser, Select All text, and just Copy and Paste into your editor. This is easiest, but doesn't handle formatting like tables very well.
You can use the Lynx [P.1] browser to convert your text with the command lynx -dump myfile.html > myfile.txt
Bruce Guthrie's HTMSTRIP for MS-DOS [P.1] is very configurable.
<http://www.w3.org/Tools/html2things.html> has a list of other HTML to plain text converters.
H.13. How can I make a HTML version from my plain text file?
This is not a course in HTML, but, for most books, you don't really need a course in HTML. Making a HTML format of most books is very easy, and doesn't take long, once you have mastered basic HTML. Let's assume you have your completed PG plain text file ready, and walk through the steps commonly needed to make a HTML version. We'll do this by successive approximation, doing the major things first, and then dealing more and more with the detail.
There are lots of specialized HTML editors out there, but you don't actually need any of them. The same editor that you used to create your text will also create your HTML. HTML is just text, with two types of special instructions added: tags and entities.
A _tag_ is an instruction to the browser, usually to display something with specific rules. Tags are shown within angled brackets: for example, is the instruction to start a new paragraph.
An _entity_ is a named special character that might not be available in your character set. Entities are shown starting with an ampersand "&" and ending with a semi-colon ";" : for example, — is the representation of an em-dash.
I'm marking up a made-up short text as I write these steps, loosely based on the sample page from question [V.121]. You can see the changes made at each stage by looking at the files
htmstep0.txt (text before starting) htmstep1.htm (after adding the HTML header and footer) htmstep2.htm (after adding paragraph marks) htmstep3.htm (after marking main headings) htmstep4.htm (after adding special line breaks and indents) htmstep5.htm (after adding italics and bold) htmstep6.htm (after adding accents and non-ASCII characters) htmstep7.htm (after adding an image) htmstep8.htm (showing some extra techniques)
Before you start, make sure that you can see these files both in your browser and in your editor. In your editor, you should see the HTML codes; in your browser, you should see the text as it is intended to be viewed.
Note for people who already know HTML: yes, this example omits lots of possible ways to do things, and lots of refinements. You already know how to do what you want to do--skip onwards, and give the beginners room to learn in peace! :-)
Step 1. Add the HTML header and footer information
Add the following lines at the top of your text file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
The Project Gutenberg eBook of My Book, by A. N. Author
Let's explain these one by one:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
says that your file is HTML 4.01 Transitional, which is the latest version, allowing the widest range of tags and entities.
denotes the start of the HTML
denotes the start of the HTML header information.
says that the characters are text, using ISO-8859-1 encoding. If you need to use a different character set, you should change ISO-8859-1 to whatever you intend to use. ISO-8859-1 is good for lots of PG books in English that use French or German words.
The Project Gutenberg eBook of My Book, by A. N. Author
You should obviously change this to the actual title and author you're producing. The
denotes the end of the HTML header information and
denotes the start of the actual text itself - the body of the book.
At the very end of the file, you should append these two lines
these denote the end of the body of the book, and the end of the HTML.
At this point, you actually have a valid HTML file! OK, if you view it with a browser, it doesn't look anything like the way it's supposed to, but it _is_ HTML. Save it with a name like MYFILE1.HTM or STEP1.HTM and get a copy of Tidy for your DOS, Unix, Mac or Windows system from <http://tidy.sourceforge.net>. Run Tidy on your file, telling it just to look for errors (tidy -e if running from a command-line; if you're using a GUI version, there should me a menu option or tickbox for showing errors only). Tidy should tell you that there are no errors. Yay!
If it does say that there are errors, deal with them now, before you continue. Make sure, at each step, that you have cleaned up any errors; it's a lot easier now than later. Also, when you've finished each step, save your file with a number in its name, so that if you run into problems later and get confused, you can, at worst, drop back to the correct version at the end of the previous step.
The most likely error you might have at this point relates to the characters "<", ">", or "&". These are the characters used by HTML to indicate tags and entities. If these characters are used in the text of your file, (and ampersand is likely to be), you should replace them with entities, so that HTML will know that they are to be displayed as characters, not interpreted as commands.
Replace & with & < with < > with >
There is an example of this in the file htmstep1.htm
Step 2. Add paragraph marks.
For novels and general prose, paragraphs are the main logical and display unit. Paragraphs are marked in HTML with the sign at the start, and at the end. You don't actually need the at the end, but adding these is a good habit to get into. You do, very much, need the at the start.
The line-lengths within a pair are irrelevant; the browser in which the text is viewed will ignore extra spaces and line-ends, and will wrap text to fit the screen. This is bad for poetry and tables, but we will discuss those later. For this step, all you need to know is that you can leave your text exactly as it is, and just add the paragraph marks.
Put a at the start of the line before the first letter of every paragraph, and a just after the last letter or punctuation of every paragraph. If you can do macros in your editor, this will just take a minute; otherwise, it may be rather boring, but at least it is simple. For this step, put the paragraph marks around _everything_ that has a blank line after it, even poetry or chapter titles. We'll come back and change that later.
Now save your text as something like MYFILE2.HTM or STEP2.HTM. Again, run Tidy to check for errors, and fix them before continuing.
If you now look at the file htmstep2.htm in your browser, you will see that it is starting to take shape. Look at it in your editor, and you will see the paragraph marks.
Step 3. Add marks for headings.
We want to indicate to the reader that certain lines are for chapter or other headings. HTML provides the tags , , and so on for this. is for the biggest heading, and usually, you will reserve this for the title, and use for chapter headings. If you find these too big, you could choose for main headings, and for chapters. Whenever you use one of these header tags, you must close it with its equivalent end tag. So a chapter heading might look like:
Chapter XI
Since there won't be many headers, and most headers are only on one line, this is usually not hard. Look at the file htmstep3.htm to see how our sample is improving, and if you're working along with me, don't forget to save your file under a new name and check it.
In our example, we have marked some lines with paragraph marks where we now want to put headings, so we will change those s into s, since we don't need or want to mark a line as both.
Step 4. Line up verse, tables of contents, and other lists.
The HTML tag tells the browser to force a line break without starting a new paragraph. We use this when we don't want text all wrapped together, but not separated with blank lines either, for example in verse and tables of contents.
In our sample, we add the tag to the end of each line in the table of contents and the end of each line of the verse. If we were working on a whole book of poetry, the same principle would apply, but we'd be using the tag a lot more.
Where we want to indent a line of poetry, we can use " " at the start of the line. Normally, however many spaces you leave between words, HTML condenses them to one space, so normal indentation doesn't work. But the "non-breaking space" entity will cause the browser to show one space for each character, so that you can indent as much as you need.
The file htmstep4.htm shows the effect: this is now an entirely readable HTML text!
Step 5. Add back in italics and bold.
The HTML tag tells the browser to start displaying italics, and the tells it to stop. Similarly, the tag tells it to display bold, and marks the end of the bold text. See htmstep5.htm for the changes.
Step 6. Restore accents and special characters.
Since we declared our HTML file to use ISO-8859-1 back at the start, we can use any of the common accented characters for Western European languages, but we may also use HTML entities. For example, for the "a circumflex" in "flaneur", we can use either the ISO-8859 character directly, or the HTML entity name "â" or number "â".
There is a trade-off between characters and entities: entities do not limit you to any particular character set, but characters are directly readable when looking at the HTML source.
Within entitles, there is also a trade-off between entity names and numbers: older browsers may not recognize some of the entity names, but the entities do make the text work in multiple character sets. Which you choose is entirely up to you, but it's best to be consistent; if you like entities, use them everywhere. Entities can be represented by their names--for example, —--or by their number, derived from their ISO-10646 (see Unicode) number--for example, —.
There are other special character entities you may choose, to replace the ASCII equivalents in the main text. Here are some of the common ones:
We've already seen
& & ampersand replaces "&" < < less than replaces "<" > > greater than replaces ">"   space replaces a space when you want to indent
and these are also very useful for many PG texts:
— — em-dash replaces "--" ° ° degree replaces "deg." or "degrees" £ £ British pound replaces "L" or "l" or "pounds"
There are many others. <http://www.w3.org/TR/html4/sgml/entities.html> has a fuller list. Please note that you don't _have_ to use these entities in your HTML; if you're happy with the text reading "500 pounds", there is no need to make that "£500".
I've made a couple of entity changes in htmstep6.htm.
Step 7. Link Images into the text.
First, you need to have your image ready. You should already have resized your image to the size you want it to be viewed at. You should also have saved it as a GIF, JPG, or PNG image, since those are the formats most supported by current browsers.
If your image is named front.gif, and it is a picture of the frontispiece of the book, you should add the line
to your HTML at the place where you want it displayed.
The "alt" text gives a label to the image, and is displayed if the image can't be shown, or in the case of a browser for visually impaired people.
You don't _have_ to add images with your HTML file, unless you want to. In many older books, there are no images at all to be added.
My final HTML text is now in htmstep7.htm. You need to have the image front.gif in the same directory in order to see it. When your HTML text is posted, the images will be zipped with it, so that future readers can see them.
Step 8. Over to you!
This is enough to make a reasonable HTML format of most PG texts, but it doesn't begin to cover everything that can be done in HTML. If you've gone this far, I recommend the W3C's tutorials:
<http://www.w3.org/MarkUp/Guide/>
and
<http://www.w3.org/MarkUp/Guide/Advanced.html>
which cover the ground we've just crossed, and go a bit further.
Here are a few more things you might want to know, but don't go nuts adding tags just because you can! Use them only when you really need them. The file htmstep8.htm shows some of these techniques. Personally, I think that this is a bit overdone, and I prefer the effect of htmstep7, with left-aligned chapter headings, but that's a matter of taste.
Once you're used to the basic HTML needed for most PG eBooks, you'll probably be able to convert one in under an hour.
How do I force more space between specific paragraphs?
Insert a blank paragraph like this: or use an extra tag.
How do I make text, or image, or headings centered?
Put the and tags around what you want centered, like: Chapter 12
How do I make some text bigger or smaller?
Put the and , or and tags around it.
How do I lay out tabular information?
The simplest way to do it is with the and tags. These will cause whatever is within them to be displayed as plain text, just as it was in the original, so that spaces separate the entries just as they did in the text version. You can also use this for poetry, though you usually won't need to. It's not entirely satisfactory, but it will work.
Making a full HTML table requires you to use the , (table row), and (table detail) tags, among others, and a full exposition of tables is beyond the scope of this FAQ.
Briefly, you start a table with the tag.
For each row you want in the table, you open and close a table row tag, like:
and then for each cell within a row, you specify a tag and the contents of that cell:
This is the Top Left cell This is the Top Right cell
This is the Bottom Left cell This is the Bottom Right cell
This only scratches the surface of tables. However, there are many guides available on the Web, and they're easy to find, once you know which tags you're looking for. A brief discussion of tables is provided by the W3C as part of the HTML 4.01 spec at <http://www.w3.org/TR/html4/struct/tables.html#h-11.5> and the tutorial at <http://www.w3.org/MarkUp/Guide/Advanced.html> also shows how to make HTML tables.
Step 9. Some common problems
When you're just starting to code HTML, it may seem that errors are coming at you from all sides. Tidy may spew out a stream of complaints that you don't recognize or understand. If it's any consolation, this is normal!
Just take the error list one line at a time, starting at the top. Often, one actual mistake, like not closing a tag, may cause many errors, since an unclosed tag can cause many subsequent tags to be reported as errors.
Common errors include:
1. Simple typos in tags, like <h2Chapter 3 instead of Chapter 3 2. Unclosed tags, like forgetting to add the in the sample above, or forgetting the slash in the closing tag so that you type italics instead of italics . 3. Not nesting tags correctly. Get used to thinking of tags as brackets; the first one opened should be the last one closed. For example, you should type: This is centered. instead of This is centered.
One option for making a HTML version is to use GutenMark <http://www.sandroid.com/GutenMark/> to create the basic HTML straight from your text, and then edit the resulting HTML to add the features you want. If you're having a lot of problems with your main conversion, this is worth a try.
Programs and programmers FAQ
P.1. What useful programs are available for Project Gutenberg work?
These suggestions came largely from a poll of volunteers in June, 2002. The programs listed are a summary of the programs we actually use. There are many other programs out there that can do the same jobs, so don't limit your search just to these.
1. OCR
Abbyy <http://www.abbyy.com> OmniPage <http://www.omnipage.com> TextBridge <http://www.textbridge.com>
These are the three main commercial packages that volunteers bought specifically for the purpose. In a few cases, people had got older versions of these bundled with their scanners.
Clara OCR <http://www.claraocr.org/> Gocr <http://jocr.sourceforge.net>
These are Free Software packages. Some people who responded to the survey had tried them, but nobody had actually used them to produce a text.
DocMorph -- a free, web-based OCR <http://docmorph.nlm.nih.gov/docmorph/>
This one is interesting--you can just submit your image through a web page, and the service will return OCRed text. However, the process of submission, waiting for your text, and then cutting and pasting into your document is slow.
Other volunteers use various OCR software that came bundled with their scanner.
2. Editing
The main answers, given by more than one person, were:
AbiWord <http://www.abiword.org> emacs Microsoft Word vi Windows WordPad Word Perfect
Other editors mentioned included:
Crisp for Windows <http://www.crisp.demon.co.uk/> EditPad <http://www.editpadpro.com> Editplus for Windows <http://editplus.com/> Foxpro 2.6 for DOS Metapad <http://www.liquidninja.com/metapad/> Windows Notepad
Programs recommended by Apple Macintosh users included:
AppleWorks BBEdit Lite <http://www.barebones.com/products/bbedit_lite.html> Microsoft Word Nisus Writer <http://www.nisus.com/> Text-Edit Plus <http://hometown.aol.com/tombb> TextSpresso <http://www.taylor-design.com/textspresso/> Add/Strip <ftp://mirrors.aol.com/pub/info-mac/_Text_Processing/>
3. Checking and proofing
For spelling, most people just use the spellchecker built into their editor or word-processor. The *nix users running emacs or vi tended to use variants of the standard Unix spell command, such as ispell or aspell. Mac users have the free spelling checker Excalibur, available from <http://www.eg.bucknell.edu/~excalibr/excalibur.html>.
Gutcheck <http://gutcheck.sourceforge.net> was used for format checking, and a few people had written some checking procedures of their own.
4. Working with HTML
In the survey, most volunteers preferred to handcraft their HTML using their normal editor. Those using a word processor edited the HTML as text, rather than composing a word processor file and then Saving As HTML. There was remarkable unanimity on this.
Specific HTML editors that were mentioned for occasional use were:
Adobe PageMill (no longer available) Mozilla Composer <http://www.mozilla.org> HTMLKit <http://www.chami.com/html-kit/> HTMLPad <http://www.intermania.com/htmlpad/>
However, not all HTML work is about editing, and the following packages were honorably mentioned for other functions. Especially important is Tidy, which is pretty much necessary for all but the most experienced people for quick HTML checking. <http://tidy.sourceforge.net> has the original, and links to versions of Tidy for Windows (Tidy-GUI) and just about all other platforms.
GutenMark: Converts Project Gutenberg texts to HTML and TeX. <http://www.sandroid.com/GutenMark/>
HTMSTRIP by Bruce Guthrie: MS-DOS. Converts HTML to text <http://users.erols.com/waynesof/bruce.htm>
Lynx (lynx --dump): Converts HTML to text <http://www.lynx.org>
Dave Raggett's HTML Tidy: Checks HTML for correctness, reformats and fixes <http://tidy.sourceforge.net>
W3C html2txt (web-based): Converts HTML to plain text. <http://cgi.w3.org/cgi-bin/html2txt>
W3C Validator (web-based): The Last Word on the correctness of HTML. <http://validator.w3.org>
wget: A very neat utility for getting web pages <http://www.wget.org/>
5. Working with images.
There are two main applications of images in PG--images to be used within texts, like illustrations in HTML, and the management of page images for scanning. These packages are used by volunteers variously for both of those purposes. Their typical use within PG is indicated. "Advanced image processing" packages will permit you to edit and restore damaged images, but for PG work, we mostly just need to manage, convert, resize and crop them.
ACDSEE for Windows For image reviewing <http://www.acdsystems.com>
Adobe Photoshop For advanced image processing <http://www.adobe.com/products/photoshop/main.html>
ImageMagick for *nix, Mac and Windows Resizing and format conversion <http://www.imagemagick.org/>
Irfanview for Windows Image viewing, conversion, cropping and resizing <http://www.irfanview.com>
The Gimp For advanced image processing <http://www.gimp.org/>
Picture Publisher For advanced image processing <http://www.micrografx.com/mgxproducts/picturepublisher.asp>
VuePrint Pro For viewing images <http://www.hamrick.com/>
Proofreaders' Toolkit (PRTK) For splitting batches of image files into individual pages <http://robertrowe.dns2go.com/>
P.2. What programs could I write to help with PG work?
Look at the programs listed above in [P.1]. Can you write a better version of any of them? Improving OCR and editors constitutes a major challenge, unless you're a world-class expert, but checking and reformatting texts is an area not addressed by large scale programs, and you might contribute there.
Formats FAQ
F.1. What formats does Project Gutenberg publish?
In principle, there's no format that we won't publish, but, in practice, we prefer formats that are open and editable.
An open format is one whose structure is publicly defined and documented, and not burdened with patent or trade secret or copy-protection (a.k.a. "DRM") restrictions. Anyone can write a reader or creator for an open format, and in 500 years' time, anyone interested will still be able to write a program to display the file. Closed formats, by contrast, will almost certainly be unreadable in just a few decades, when the companies now promoting them disappear, or lose interest, or decide to stop supporting them because they want to sell a replacement.
Being able to edit the file is also important. We make corrections to our editions constantly, and it is important to us that we should be able to update our files easily. If adding one word to a sentence involves a complete re-marking of the whole text and a complete rebuild of the file, we have to ask ourselves whether this format is really necessary for this text. Further, the people who re-use our texts should also be allowed to copy and reformat them freely, and non-editable formats restrict their ability to do this in various ways.
F.2. What is, and how do I make or use:
[Note: Character sets and formats are both listed here. Character sets refer to the characters you can use; formats describe how those characters are put together. For non-text formats such as music files, there is no exact equivalent to a character set.]
ASCII (Character Set)
ASCII (American Standard Code for Information Interchange) is a set of common characters, including just about everything that you can type in on an English-language keyboard. It includes the letters A-Z, a-z, space, numbers, punctuation and some basic symbols. Every character in this document is an ASCII character, and each character is identified with a number from 0 through 127 internally in the computer.
You can view or edit ASCII text using just about every text editor or viewer in the world.
Big-5 (Character Set)
Big-5 is a set of 13,494 traditional Chinese characters. You will need to use an editor or viewer that supports the character set.
Codepage 437, 850, 1252, etc. (Character Sets)
These codepages are Microsoft-specific character sets which allow the display of accented characters and other symbols. To view a text that uses one of these, you will have to use a Microsoft application that supports them. Many of the fonts supplied with Word for Windows will display and edit CP-1252 correctly. For Codepages 437 and 850, you may have to open a Command Prompt and use a DOS editor like EDIT. A search form <http://www.microsoft.com> should bring up information about the codepage you're interested in, or you can read the excellent overview at <http://czyborra.com/charsets/codepages.html>. For Unix users, iconv and recode provide translation facilities from one character set to another, and support many or all of the MS codepages.
DVI
DVI stands for DeVice Independent, and is commonly used to store text and instructions for displaying it involving complex mathematical symbols and expressions, though it can be used for any content. Given a DVI file, you need a viewer to render it on the specific device you're using. Specifically, DVI is used as the standard output format for TeX, discussed below.
HTML/HTM (Format)
HyperText Markup Language defines the standard format of web pages. You should be able to view these with any web browser, and edit them with any text editor or a specialized HTML editor. <http://w3.org> is the definitive reference.
ISO-8859/ISO-Latin (Character Sets)
ISO-8859 is a series of character sets used to represent the accented characters most commonly used in European languages. There's ISO-8859-1, ISO-8859-2, and so on. ISO-Latin is just another name for the same thing. You can read the overview at <http://czyborra.com/charsets/iso8859.html>
LIT (Format for PDA-based eBooks)
This is a proprietary, closed format for files that can be displayed only by the Microsoft Reader. Search <http://www.microsoft.com> for more information. It is not possible to edit or correct files in this format; it is not possible to export files from this format; they have to be made in another format and converted.
MacRoman (Character Set)
MacRoman is an 8-bit Apple Mac-specific character set which allows the display of accented characters and other symbols. To view a text that uses MacRoman, you will have to use an application that supports it, and there are few outside the Apple fold. However, iconv and recode are programs that convert between many character sets, and MacRoman is supported by both.
MID/MIDI (Format for music)
Musical Instrument Digital Interface is a music description language, encompassing not only file formats but definitions of interfaces. A MIDI file contains instructions for sending messages to a musical instrument to recreate the sounds. <http://www.midi.org/> has much more on this.
MP3 (Format for any audio file)
MPEG-1, Level 3, was defined by the Moving Pictures Expert Group as a means for encoding sounds. Many, many MP3 players exist for all platforms, and can be found easily with a Net search. The official home page of the MPEG is <http://mpeg.telecomitalialab.com/> and copies of the specification can be purchased from the ISO at <http://www.iso.ch>
MPEG/MPG (Format for moving pictures)
The Moving Pictures Expert Group have released a series of formats for encoding video and audio. MPEG (pronounced EM-peg) formats are published and widely used. The official home page of the MPEG is <http://mpeg.telecomitalialab.com/> but you will find information about MPEG formats, and software to play MPEG files, all over the Net. You can also purchase specifications through <http://www.iso.ch>
MUS (Format for music)
MUS from Coda Music <http://www.codamusic.com/> is a proprietary, closed format for editing and replaying sheet music. However, we do post music files in this format because of its many features. We hope to be able to post these also in more open standards at some point in the future, but at the moment, there is no open format with similar capabilities. You can find out more about this at <http://www.ibiblio.org/gutenberg/music/music_helpex.html#what-software>
PDB (Format for PDA-based eBooks)
The Palm Data Base format can actually be used for purposes other than eBooks, and there are many possible variants of formats for Palm-based readers all using the extension PDB on PCs, and they're not all entirely compatible. Some of them are proprietary, and it may not be possible to edit them directly, or export files from these formats; they have to be made in another format and converted. Some can be converted back to text. The most common, though, is the "Palm-DOC" format, which is an open format and can be edited on the Palm itself.
PDF (Format for eBooks)
Portable Document Format is a format for storing texts, containing any fonts or graphics. It is copyrighted by Adobe, <http://www.adobe.com> but is well and publicly documented. It is sometimes referred to as a kind of compiled Postscript (see PS below). It is viewable using the Adobe Acrobat Reader. It is not possible to edit files in this format.
PRC (Format for PDA-based eBooks)
This is a proprietary format for files that can be displayed only by the MobiPocket Reader. See <http://www.mobipocket.com> for more information. It is not possible to edit or correct files in this format; it is not possible to export files from this format; they have to be made in another format and converted.
PS (Format for text and graphics)
Postscript is technically a programming language, not just a format. It has conditional statements, procedures and program flow control. However, it is commonly referred to as a format. Adobe <http://www.adobe.com> holds copyright on the Postscript specifications (there have been three "levels" published) but Postscript is well and publicly documented and has wide support, not only in printing, but in screen display as well. Apart from Adobe's official version, you can also render Postscript files with Ghostscript, a Free Software package. Postscript can be edited directly, but any complex editing may present difficulties.
RTF (Format for text)
Rich Text Format was originally a Microsoft specification, but it is an open format that is used by many word processors to exchange text and format information in an application-independent way. Nearly all current word processors will read and edit an RTF file, and, like HTML, it can also be edited as plain text.
TXT
TXT is a generic extension used for any plain text file, regardless of the character set. Thus, while most of our .TXT files contain ASCII, some contain ISO-8859 or Big-5 or Unicode.
TeX (Format for typesetting, printing and viewing)
TeX (pronounced "tech"--the "X" is actually the Greek letter chi) is a public domain format created by Donald Knuth for typesetting, though it can also be used for normal printing and viewing. TeX consists mostly of the plain text, with instructions for how it is to be displayed. This is compiled into DVI format (see above) which can be rendered onto any device, like a printer or screen, by a program that is aware of the device's capabilities. The Comprehensive TeX Archive Network <http://www.ctan.org/> is the best place to start looking for TeX-related programs for your platform.
Unicode/UTF-8, UTF-16, UTF-32 (Character Set)
Unicode is intended to be a single character set that can handle all of the characters in all of the languages that ever were, or ever will be. It accords with the ISO-10646 standard for the characters, but, in addition, imposes rules of implementation. UTF-8, UTF-16, UTF-32 and their variants are ways of expressing Unicode using different rules for transforming bytes into characters. Unicode is steadily gaining ground, with at least some support in every major operating system, but we're nowhere near the point where everyone can just open a text based on Unicode and read and edit it. Check <http://www.unicode.org> for more.
XML (Format for . . . well, just about anything :-)
eXtensible Markup Language looks a bit like HTML, but whereas tags such as have a standard meaning in HTML, XML allows anyone to define their own set of tags and meanings using a Document Type Definition (DTD) file. Add a CSS (Cascading Style Sheets) file to that, and you have the ability to display the text according to predefined rules. In principle, this seems to make it ideal for the storage and processing of etexts, since a suitable DTD and CSS, together with the right programs, should make it possible to produce any format of eBook automatically from an XML original. Some PG volunteers have looked at, and are looking at, ways to convert the entire archive using a satisfactory DTD; however, meantime we aren't actually producing much XML, since most volunteers aren't working with it, and nobody wants to start producing many XML texts until we have agreed on a DTD. <http://www.w3.org/XML/> is the definitive source for more information about XML.
Volunteers' Voices
In this section, we asked volunteers to talk about their practical experiences with Project Gutenberg, how they joined, why they give up their hours to work for Free Etexts, how they get down to the nitty-gritty of producing texts.
Some people chose an interview format for their responses, with pre-set questions; others just wrote.
Amy Zelmer
I stumbled across Project Gutenberg a couple of years ago--can't remember just what I was looking for on the web but the idea of PG intrigued me. I was also looking for something to get me reading materials which I wouldn't ordinarily read, so didn't particularly want to find a book in which I was interested--and the whole process of finding a book, finding out if it was already "in progress" and then checking out copyright clearance seemed just a little daunting from what I was able to gather from the info on the web.
Furthermore, I live in a small regional city in Australia, so the possibilities of finding something in either the local library or in a second-hand bookshop was next to nil.
Fortunately I also found Sue Asscher's name and figured that I'd ask a fellow Aussie how to get started. Sue seems to have an inexhaustible stock of books waiting to be entered -- and got me started on Thomas Huxley's "Essays and Lectures". I've now done five other books and am currently working on Darwin's "The Power of Movement in Plants"--quite a variety, but it's at least met my goal of reading something different.
Fortunately Sue was also patient about answering my beginner's questions about formatting dilemmas and has been able to co-ordinate other aspects of the process, like getting scans of diagrams and final proof-reading. That means all I have to do is put in the text.
I'm a reasonably good typist -- and the practice with PG is certainly improving both my speed and accuracy! (That's meant as a word of encouragement to others.) I generally type for about 20 minutes at a time, then take a break; both my concentration and desire to prevent RSI (repetitive strain injury or occupational overuse syndrome) mean that it's better to do shorter sessions more frequently than to carry on for too long a time. I generally use Microsoft Word 2001 for Macintosh for the first entry and spell check, then save the material in "text only" and do a final read through, removing page numbers and correcting errors which the spell-checker missed as I go.
I've also done some data input for another ebook collection. However, they separate the text and send out small batches of pages to many volunteers. I find that rather frustrating since it's impossible to see how your piece fits until the whole thing is finally posted.
I've done some scanning, OCR and proof-reading of material, but generally find the close proof-reading which is required very frustrating. To each his own method.
Ben Crowder
I've been a book lover ever since the day I learned to read. Several years ago I discovered Project Gutenberg while surfing the net and was delighted to find so many good books freely available. I downloaded all the etexts I was interested in and read quite a few of them. After a few years, I decided to get more involved, so I started proofing with Distributed Proofreaders. I liked that a lot -- I was a newspaper editor in high school for two years -- but I felt an itch to try to produce etexts on my own. I didn't have a scanner, however, so the only solution I could see at the time was to find a book and start typing it in by hand. I'm a relatively fast typist and I figured it wouldn't take that long.
So, I went to my university library, found a pre-1923 edition of G.K. Chesterton's _The Ball and the Cross_ (Chesterton is one of my favorite writers), and began typing. It took much longer than I expected -- certainly over 30 hours, perhaps even close to 50. When I finished, I came across a page on the PG site that mentioned there should be two spaces between sentences. I looked at the etext I'd just typed in and realized in horror that I'd used single spaces the whole way through. :) [1] I had been *sure* that PG used single spaces, convinced that I'd read it in one of the PG docs, which had taken a little while to get used to since I normally use two spaces. But all the PG etexts I checked had two spaces between sentences, so I began the monotonous task of adding an extra space between each sentence (and being very careful not to add spaces in where they shouldn't be). Several hours later the book was finally done. I'd gotten copyright clearance before I started, so I soon submitted it and within a few days I saw those lovely words in my inbox, "Posted (#5265, Chesterton)".
[1] Ben was right both times: people have posted advocating both one space and two. Either would have been accepted!--jt
Since then, I've been addicted to producing etexts. Languages interest me greatly, so I found an Old Icelandic primer that someone had scanned in, OCRed the images using DocMorph (it didn't take as long as I thought it would, and the output was decent enough to work with), and realized I would have a problem entering in the foreign characters (o's with hooks underneath, etc.). Thank heavens for Unicode. Vim (my editor of choice) has fairly good Unicode support and it didn't take long to make a list of the Unicode codes for the Icelandic characters.
As noted, I use Vim for all my editing. I can rewrap lines to 65 characters by typing "gq", I can use regular expressions for search and replaces (*very* handy), I can edit in Unicode when I need to, and I can speed things up greatly by making keyboard mappings for repetitive tasks. (On one text I was working on, I had to add a blank line between each paragraph. Each was numbered, but the blank lines had somehow been taken out before I got the text, so I started going through and adding them in by hand. The file was 30,000 lines long, however, and I quickly realized it would take a *long* time. I then noted which keys I was pressing to add the blank line between each paragraph, mapped them to , and held the key down while Vim zipped through the rest of the file. It sped it up by a factor of over a hundred.)
My university library is well-stocked and has lots of old books, so I usually rely on it when I need to get TP&V's for texts I'm not typing in myself. I still don't have a scanner, so I either find already-existing texts on the Internet and reformat them for Project Gutenberg (after getting permission, of course), or find page images on the net and OCR them myself, or type the books in by hand. Typing in by hand takes a long time and so I prefer the first two methods.
Volunteering with Project Gutenberg has been extremely satisfying. The people are wonderful to work with, the work is fun, and it feels very good to know that one is making a difference in the world.
Col Choat
How I got started
People sometimes ask me how I got started in preparing etexts for Project Gutenberg, and while they probably ARE interested in my story often they are really more interested in finding out whether it is something that they might want to get involved with. Jim Tinsley, a colleague at PG, recently prepared a "questionnaire" as a way of stimulating existing volunteers to document their PG experiences. Answering the questionnaire seems as good a way as any to answer the question, "how did you get started".
HOW DID YOU LEARN ABOUT PG?
I think it was probably from a newspaper or a computer magazine. I can't really recall, now.
WHAT WAS YOUR FIRST CONTACT LIKE.
Initially, I visited the site to search for books I was interested in, to see if they had been posted at PG. That was quite a straightforward process. I downloaded a few texts and either read them at my computer or, occasionally, printed them out to read later.
When I became interested in volunteering, I visited the site to get some information about how to go about it. I found it a bit daunting, really. There was a lot of information but it was difficult for me to get it sorted out in my mind. There were copyright issues, editing rules, and procedures for lodging etexts. There was a question and answer page and some background and information for those wanting to subscribe to the PG mailing lists. In the end, I just sent an e-mail to Michael Hart, whose e-mail address was listed on the site, and said "what can I do?" I notice that volunteers still sometimes do that.
WHAT WAS THE FIRST PG JOB YOU DID? HOW DID IT GO?
I decided to prepare an etext from a book I had in my home library, titled "UNDER THE NORTHERN LIGHTS". It is a series of short stories about the Canadian North by Alan Sullivan. I had a small "hand" scanner at home, which I hadn't used much before. I didn't know any better, so I would scan in about ten pages and save them as "tif" files. Then I would use the OCR (Optical Character Recognition) software supplied with the scanner to convert the image to text for subsequent editing. I recently purchased an A4 scanner with state-of-the-art OCR software and I can't believe how I persevered with that hand scanner for so long.
I tried to apply the editing rules outlined on the PG site, though they weren't as prescriptive as I would have liked. I wanted certainty, as I felt that I didn't know enough to apply own editing rules. I didn't have a good text editor, either, so I probably made the job more difficult than it needed to be. More about the "tools of the trade" later, though.
When I submitted the title pages of the book to PG for copyright clearance it was rejected because the book was published in 1926. I don't know what I was thinking about when I chose it. It must have just LOOKED old enough. I had scanned and proofed about half of it, so I just abandoned it and looked for something else. Interestingly, Australians and residents in other countries with similar copyright laws, can now read it as it is in the public domain in Australia and is now on the Project Gutenberg of Australia site. I was able to finish it and post it at PG, after all.
HOW DID YOU DEVELOP YOUR PG EXPERIENCE FROM THERE?
I think that one of the most valuable things I did was to join the volunteer discussion group. I found that I didn't need to take part, but could just take note of all the different issues raised by other volunteers. Some days there was no activity by the group, but then a hot topic would be raised (e.g. whether some books, such as Mein Kampf by Adolf Hitler, should not be accepted by PG, even if eligible) and there would be plenty of comments. I realised also that I could ask for help on specific questions regarding preparation of texts and receive prompt informative answers. Once, when I thought that I was sending to ONE of the members of the group an e-mail with a large attachment, I was quickly made aware that EVERYONE had received it. Some weren't amused, but I am a quick learner--I didn't do it again.
Subscribing to the weekly newsletter is also worthwhile. There is a link on the main page of the PG web site to allow people to subscribe to the mailing list and discussion group. I also found a few people who I began to e-mail privately, outside the discussion group. That helped a lot, too. Perhaps there is merit in instigating a mentor scheme, whereby a new volunteer can refer to another more experienced one for help, guidance and encouragement. I would be interested in taking part in that.
CAN YOU TELL US ABOUT THE FIRST TEXT YOU PRODUCED.
As I mentioned earlier, my first attempt was abortive (initially, at least). However, as I had realised that there was not much Australian content on PG, I decided to go in that direction. Then I found that there were many eligible Australian titles already on the internet, mostly in HTML format. These can only be read using a web browser, so I decided that it would be worthwhile to download them, convert them to text files, compare them with a book of the same title which was eligible for PG copyright approval, and then have them posted at PG. I had learned my lesson, so from then on I always got the approval BEFORE I started work on the conversion.
I prepared a number of etexts using this method and quickly increased the amount of Australian content at PG. However, I still wanted to create an etext from a book. My sister had given me, as a gift, "Australia's Greatest Books" by Geoffrey Dutton, which reviewed approximately one hundred books and I decided to work my way through them. I had already converted a number from HTML, as outlined above, so the first on the list to be scanned turned out to be the journal of Charles Sturt who explored south-eastern Australia between 1828 and 1831. I was quite pleased with myself when the two volumes were finally posted at PG.
WHY DO YOU SPEND YOUR HOURS CONTRIBUTING TO PG?
The simple answer is "because it is FUN". It is easy to make up justifications, but since there is no necessity to do it, it must be because I enjoy it. I get a sense of achievement that the work I do will be "out there" for a long time. We haven't begun to realise where technology will lead us. The books I prepare will be able to be read by people anywhere on earth, and even beyond, by astronauts travelling to Mars. "Send up THE ODYSSEY will you Scottie, I have always meant to read it."
I have had some unexpected pleasures, too. I have "met" some wonderfully generous and interesting people and I have read some wonderful books that I would not have taken the trouble to read if I weren't preparing them for PG.
DO YOU SPECIALISE IN ANY PARTICULAR KIND OF WORK, OR TEXTS?
I started out thinking that I would stick to books with an Australian flavour. But I can't help myself. If I see something that I am interested in, and it is already on the internet, but not at PG, I have to do it. I have submitted etexts of James Joyce's "Ulysses", and works by D. H. Lawrence, and Norman Douglas. I also have a long list of books I would like to scan in myself, not all of which are about Australia--one day.
WHAT DO YOU LIKE ABOUT MAKING A PG ETEXT?
I think I have covered that already. I like the sense of achievement, the fun of reading the book, and the thought that it will be available to many people who would not otherwise have access to it, possibly in a form which has not yet been invented.
WHAT DO YOU DISLIKE ABOUT MAKING A PG ETEXT?
Sometimes the going is not easy. Occasionally I get impatient with the length of time it is taking and sometimes I get bored with the subject matter. I recently purchased a new scanner with excellent OCR software, which converts the page image to text, and that has given me a new lease of life because less proofing is required. I sometimes remind myself that I don't have to do it, then I find that I want to anyway.
WHERE DO YOU GET YOUR ELIGIBLE BOOKS
Local libraries have a surprising amount of eligible material. The main difficulty is finding books with a publication date of 1922 or earlier, for PG in the US anyway. I have found a number of "facsimile" editions which are direct reprints of the original, and these are acceptable. I also look around second-hand bookshops. I recently found a battered copy of "A short history of Australia" published in about 1910, and bought it for $A1.50. For books eligible for posting at the PG Australian site, cheap paperbacks are readily available. I am working on one now, and have ripped all the pages out of it to make it easier to scan. It only cost a few dollars. There are also a number of sites on the internet which list second-hand books for sale.
DO YOU TYPE OR SCAN? WHAT SCANNER/OCR/EDITOR/WORD PROCESSOR DO YOU PREFER?
This section might as well cover all of the "tools of the trade". I have noticed that volunteers have many favourite tools, and from what I can make out most will do the job. The list below covers what _I_ have settled on. I should note that I work in the Windows environment, and tools are readily available for all the things I need to do.
Scanner
I recently purchased a Canon A4 flatbed scanner without a document feeder for under $A200. It has a hinged lid for scanning books and comes bundled with image enhancing software and OCR software for converting image to text.
OCR (Optical Character Recognition) Software
'Omnipage Version 9' came bundled with the scanner. I find that I don't need any of the other software which came with the scanner--Omnipage does it all for me. I can scan, proof, spellcheck and save the output to a text file with very little effort.
Editor
I use Editplus which is available as shareware on the internet. It enables me to read in the file produced by the Omnipage OCR software and reformat it to a line length suitable for PG texts (about 70 characters). It also allows one to display guide lines vertically on the page to help with checking for "long" lines. I have loaded James Joyce's "Ulysses" into Editplus and it handled it, so I presume that it will handle files of any size. As with everything one wants to do at PG, there is always someone more than willing to help with problems encountered, just by posing questions to the volunteer discussion.
FTP (File Transfer Protocol) Software
Some volunteers e-mail their submissions to PG as an attachment to an e-mail. However, it is also possible to place them at the PG site for processing, using FTP. Microsoft Windows Explorer has an FTP facility which can handle this and that suits me. I know that there are many others and SmartFTP is an excellent freeware product for those who need Windows-based FTP software.
Other Tools
I use Microsoft Word to convert HTML files to text files. Firstly, I cut and paste the html document into word, then I convert any italics to upper case, since italics are not supported in plain text files; then I save the document as a text file. Then I use Editplus, mentioned above, to reformat the line length. Sometimes it is necessary to add an extra "carriage return" at the end of each paragraph, to comply with the preferred style for PG texts. This can be done from within Word or Editplus by replacing characters. New volunteers may need to ask for information about this process.
HOW DO YOU CHECK YOUR TEXT? ANY SPECIAL TOOLS? SPELLCHECKER? DO YOU PRINT IT OUT AND READ IT? PUT IT ON YOUR PDA AND READ IT? HAVE A VOICE SYNTHESIS PROGRAM READ IT ALOUD TO YOUR FROM YOUR PC?
I have tried a few different methods. I don't have a notebook computer or etext reader so I must either read it on a PC or print it out. There is a spellchecker with Editplus, which allows one to add new words, so I use that to begin with. I also use GUTCHECK, a program developed by Jim Tinsley, which picks up many errors. One would need to contact him via PG, if one wanted a copy. I travel by train to work, so I often make a printout and read that for the final proof, or co-opt my wife if it is something I can interest her in. I have a checklist, which I have developed over time, that I use to ensure that I have covered all that I need to--but then I AM one for lists.
DO YOU HAVE ANY TIPS 'N' TRICKS OR SPECIAL ROUTINES YOU GO THROUGH WHEN PREPARING A TEXT?
I think I have covered most of my methods already. I sometimes find that "dashes" within sentences need attention. I like to show them as "--" so I try to be consistent and not let them slip through as " - ". I think we at PG could get together a more or less prescriptive list of editing rules for new volunteers to follow. Once they gained experience they could change them if they wanted to. I do like to place an end marker ("THE END") at the end of my progressing work, so that I don't inadvertently lose any of it and I make several rotating backups of the file I am working on. I have "lost" computer files once or twice over the years and don't want to get that sick feeling in my stomach EVER again.
As I said earlier, I do have a checklist, and it could help if PG (that includes me, as PG is "us") provided a downloadable list of things which need to be done to get an etext posted e.g. copyright approval, scanning, editing, proofing, placing relevant information at the beginning of the etext, etc. All the information is there already, it just needs bringing together into one document.
HOW LONG DOES IT TAKE YOU TO MAKE A TEXT?
Obviously it depends on the number of pages, efficiency of the scanner and the number of hours one puts in. The two volumes of Sturt mentioned above probably took me six months, but I was doing many other things in the meantime. To scan in and edit, say, "The Prophet" by Kahlil Gibran would only take a fraction of that time as it is quite thin and easy to read. If one were concerned about getting an idea of the time it would take to complete an etext, I would suggest that he/she do a little casual proofing at the "Distributed Proofreaders" site first, to get an idea of what is involved.
DO YOU WORK ALONE, OR DO YOU SHARE THE WORK OF EACH TEXT? DOES ANYONE REGULARLY HELP YOU PROOF THE TEXT?
I generally work alone, however my wife will proof sometimes. She has become interested in the book that I am working on at present and is waiting for me to supply her with more pages. When I was getting started, a new volunteer agreed to proof something for me (she approached me) but then she never did any of it and didn't even e-mail me to advise that she had changed her mind. Editing and proofing is not for everybody and one needs to find out if one likes doing it. However, courtesy costs nothing.
DO YOU DO SOME PG WORK REGULARLY, OR DRIFT IN AND OUT AS OPPORTUNITY PERMITS, OR WHEN YOU FEEL LIKE IT.
All of the above at different times. I am not an avid television watcher and would rather do some "work" (or should I say "pleasure") for PG much of the time.
HOW MANY DIFFERENT KINDS OF WORK, OR DIFFERENT BOOKS, HAVE YOU DONE?
Because I have converted many books from work already on the internet, I have covered quite a range, though I haven't actually scanned and proofed too many books. Those that I have done have been Australian historical works. But I have rounded up books on philosophy, aboriginal legends, and several novels. Since many internet sites come and go, I am interested in "grabbing" etexts and posting them at PG in case the site disappears from the internet. It has become a pastime in itself. I recently discovered "South Wind" by Norman Douglas, a book which caused quite a sensation when it was first published because it portrayed a bohemian lifestyle. Ironically, I used to have the book in my home library, but dispensed with it when I needed space. Now it is at PG and I can get it whenever I want it.
WHAT DO YOU LIKE ABOUT THE PG PROCESS?
The democratic, helpful, friendly approach of all the people involved is one of the things I like best. I have "met" so many wonderful people, without having to "live" with them, if you know what I mean. Not long after I started associating with PG, Michael Hart posted an e-mail to the volunteer discussion group, advising of the death of a long-time volunteer. It seemed like she had been one of the "family".
One really needs to be indifferent to praise and the prospect of reward to start volunteering for PG. There is certainly no money in it. However, one quickly finds that there is a community of people out there with a common interest, and with the same outlook and the same interest in doing a job well, without tangible reward. There is no lack of praise though, and one soon finds that one is not indifferent to it.
WHAT DO YOU DISLIKE ABOUT THE PG PROCESS?
There isn't much that I don't like. Nothing worth mentioning, anyway.
IS THERE ANYTHING YOU'D LIKE TO SEE PG DOING DIFFERENTLY?
There are a few things, however since I don't know all the reasons for some things being done the way they are, and because everything is done by volunteers anyway, I wouldn't like to canvass them here. To have produced nearly 5,000 etexts over more than 30 years is testament to the fact that most things are being done "right".
IF ONE OF YOUR FRIENDS APPROACHED YOU TO ASK ADVICE ABOUT HOW TO GET STARTED CONTRIBUTING TO PG, WHAT WOULD YOU TELL THEM?
I would spend some time with him/her and work through some of the issues. I know that I would have benefited from that approach. I would gradually introduce her(him) to the different issues which need to be addressed and find out exactly what her expectations were, and try to help her in fulfilling them.
WHAT WOULD YOU EXPECT PG TO BE LIKE IN FIVE YEARS? TEN YEARS?
Much the same as it is now, I hope. After all, the goal will continue to be to provide "fine literature digitally re-published". Though I expect that, like other organisations, it will continue to evolve in response to new challenges and opportunities. Ten years ago, who would have thought that there would be 5,000 etexts posted; that there would be volunteers operating an online proofreading site; and that there would be a volunteer writing free software to read PG etexts? The rapid growth of PG over the last few years will present many challenges for the future.
Writing of etext readers, I am reminded that I recently joked to a volunteer that I wanted him to write software for reading etexts, whereby a hologram would appear on the inside of my eyelids so that I could read etexts with my eyes closed. Who knows, it might be possible. However, whatever advances in technology occur over the next ten years, one thing is certain: the work of all the volunteers to date will ensure that there is an amazing library of ebooks available covering creative works by some of the greatest minds who have ever lived. Future readers of PG ebooks will have been given a wonderful gift by the many volunteers who have contributed to PG over the decades.
Project Gutenberg of Australia
On the wall in a colleague's office was pinned a piece of paper on which was written a quotation. I don't recall now what it was and the colleague has been gone for some time and has taken the paper with him. However under the quotation the author was acknowledged as "Prince Machiavelli". I had a vague idea that the quote actually came from "The Prince" by Nicolo Machiavelli, and wondered how I could satisfy my curiosity. Then I remembered reading about Project Gutenberg and decided to see if the book was posted on the PG site, though I didn't really expect that it would be. Needless to say, the etext WAS there and I was able to download it and read it in its entirety, due to the time spent by John Bickers and Bonnie Sala (their names appear at the beginning of the etext) in preparing it for PG. Interestingly, there were other works by Machiavelli there, which I hope to get back to one day.
Later, when I e-mailed PG and expressed an interest in volunteering I was, because I said that I was Australian, referred to Sue Asscher, the Australian Production Director for PG. Sue asked me to proofread "A Vindication of the Rights of Women" by Mary Wollstonecraft. Also, about this time, a journalist had contacted Sue with regard to a story being prepared for PG. He wanted to contact some volunteers to ask why they were interested in PG. Sue referred the journalist to me, with my permission of course, and one of his first questions was "Is there much Australian content on PG?" After I had checked the PG etext list I could only reply "not much".
So I decided to start creating etexts by Australian authors, for PG. Sue Asscher pointed out that there were many eligible Australian works already in the public domain as etexts, so I started rounding up etexts and matching them with books which had been published before 1923, so that they could be posted at PG. Then I started creating etexts myself, for works I could not find already on the internet. My sister had given me, many years ago, a book by Geoffrey Dutton titled "Australia's Greatest Books", so I decided to start working my way through the eligible titles from the list of about one hundred books reviewed by Dutton. I had already found a number of them on the internet and some were already at PG. But there were still a "few" to be done. There still ARE a few to be done, if anyone is interested in helping.
Then Sue Asscher again had a hand in setting the direction I would take by asking me to proof an etext of "Animal Farm" by George Orwell, whose work had recently entered the public domain in Australia. We didn't know where we would post it, as it is not in the public domain in the US, but I agreed to proof it as I had read it many years ago and enjoyed it.
About this time, I also decided to make up a personal web site. Being a software developer, people were always asking me about the internet and web sites, in the mistaken belief that I knew ALL about computers. I decided to get an idea of how web page design and web site management worked by creating a site that listed all of the "Australian" content at PG. When I couldn't find anywhere to put the Orwell, which I had recently proofed, I decided to create a page on my site for etexts in the public domain in Australia, so that Australians and internet users in other countries with similar copyright laws, could read and/or download them.
Michael Hart, the founder of PG, was quick to interest me in creating an "official" PG site in Australia. After registering a business name, getting a domain name and finding a sponsor to host the site, Project Gutenberg of Australia was up and running.
It all happened very quickly, and as with many things which happen in one's life, it all seems to have come about by serendipity. Even the site's motto "A treasure-trove of literature" was stumbled upon by chance when I looked up, in connection with another unrelated matter, the word "treasure-trove" in a dictionary, to ascertain if the word was hyphenated. Imagine my surprise to find treasure-trove defined as "treasure found hidden with no evidence of ownership". That EXACTLY defined the literature found on PG.
My own association with PG resulted from the culmination of a life-long interest in books and literature and an equally strong interest in computers. Every volunteer brings his/her own particular interests and skills to PG and that, together with the democratic approach taken by the small executive team, is what makes PG the strong, co-operative organisation that it is. My interests and skills, and a generous dose of serendipity, led to the creation of Project Gutenberg of Australia.
Dagny
I discovered Project Gutenberg in 1996 and immediately wanted to help because I love books and wanted everyone to have access to all the wonderful books that, even today with Internet searching, are difficult to find or very expensive when you do locate them.
I began by proofing a few works but what I really wanted to do was share my Balzac collection with other fans. I discovered Balzac in the 1970s and recall my frustrations in trying to find more than a dozen stories of the over one hundred Balzac wrote. It was over a decade before my husband discovered a complete set at a used bookstore while on vacation. Unfortunately, not everyone is so lucky.
With the first few stories I typed for Project Gutenberg I worried about everything: should I correct a type-setting error, leave it, footnote it, etc. This took a long time and involved a lot of correspondence. Now, my idea is to make the text as readable as possible. For me that means correcting type-setting errors I notice. Others prefer to leave them intact. In the end, I don't believe the readers care. I have found them generally to be very grateful to have found some treasure they had been seeking. In some cases of an author's more obscure works, they didn't even know the book existed, a rare find indeed for them.
It is so satisfying to receive an e-mail from someone thanking you for all your hard work. Most readers don't take the time to write but true fans often do and they make it all worthwhile. I have even met people in this way that went on to become a Project Gutenberg volunteer themselves because they wanted to give something back to the Project from which they had received so many pleasurable hours.
Gardner Buchanan
SOURCE MATERIAL
First of all, there is the issue of what texts I choose to do. For me, this is fairly simple. I'm a bit of a small-time book collector already, and have a personal theme: "Canadian English Literature" and "Canadian English-Language History". I have no trouble whatsoever in coming up with submissible editions of works that fit this theme somehow. Nevertheless there are specific authors and works that I'm not having luck with, so I'm still making the rounds of the used book shops regularly and picking up all sorts of stuff.
Eligible volumes have typically cost me $10.00-$150.00 for a collectable edition, or $0.50-$15.00 for a recent paperback edition or garage-sale item. I paid $0.50 for a eligible, but not very collectible copy of Glengary School Days by Ralph Connor at a garage sale. As it turns out someone has beaten me to it--it has been in the collection since 2001. Sometimes if I'm contemplating picking up a more expensive book that I don't already have a personal interest in, I'll go back and double-check The Online Books page to see if someone has already submitted the book.
Another way I obtain texts is from the Early Canadiana Online archive. They host page images of quite a large collection of old books written in or about Canada, or written by Canadians. The page images are reasonably well suited to OCR.
I tend to produce E-texts two different ways. One way is to submit page images to Charles Franks who runs Distributed Proofers and let him worry about bulk-OCR'ing. I then manage the distributed proofing, which is a fairly low-effort business. The other way is to scan, OCR and proof all by myself. I'm currently averaging two of my own projects to every Distributed Proofer one.
SCANNING AND OCR
I have an very slow parallel-port scanner, a UMAX Astra 2000P. It sucks mightily. I'd rate it a 2 out of 5, if it wasn't acting up--creating a black bar across the page, part way along--so I have to scan books a certain way around to avoid having the bar land in the text. As it sits now, it's in 0.5-1 territory. It is glacially slow at the best of times, and due to being a parallel port model, locks up my whole computer during the scan.
Nevertheless, it is completely adequate to my needs for PG work. I've scanned more than a dozen books on it, and it's done yeoman service--despite its warts. Scanners like this one can be picked up used for $30, and are worth the money.
The way I work when I'm producing a book myself, is scanning and proofing page by page. I do the scans two-pages-up, then OCR, proof and copy the pages to a working document, before going on to scan the next pair of pages.
My scanner came with two OCR "packages": Omnipage something-or-other which I was never able to install, and Recognita Standard 3.2.7. I use Recognita, and for 300dpi scans I do, it is adequately fast and accurate. It is a no-frills package, and DOES make many mistakes, but it is entirely useable for my purposes. I rate it 2 of 5.
I've used the Abbyy FineReader 5.0 try & buy. This is a magnificent OCR system. It handles huge batches and is fast and astoundingly accurate. I rate it 5 out of 5. Unfortunately it costs about $million to patriate a web-bought item into Canada, and while priced at a very reasonable US$100.00, would cost me about CAN$600 after exchange-rate, brokerage fees, shipping, more fees, taxes, service charges and more taxes (on the fees).
I could buy Omnipage off-the-shelf here, but frankly if I can't get Abbyy, I'll stick with Recognita.
As I scan each page, I paste it into Windows-95 Wordpad. Sometimes I also do some proofing in Wordpad, but mainly I proof, fix quotes, M-dashes and paragraph breaks in the OCR program before copying to Wordpad. I like to keep the page boundaries intact, and I mark them in my Wordpad document like this:
: : kjdk ldjd ll;llkj dklj dklj kjdk ljd llllkj klj dklj
page 354
kjdk ldjd lll;;llkj dklj dklj kjdk ldd lll;;llkj dklj dklj kjdk ldjd ll;llkj dklj dklj kjdk ljd llllkj klj dklj
page 355
kjdk ldd lll;;llkj dklj dklj kjdk ldjd ll;llkj dklj dklj kjdk ldd lll;;llkj dklj dklj kjdk ljd llllkj klj dklj : :
At this point I also fix-up hyphenated words that straddle page-boundaries. I note paragraphs that start in a new page and mark them with , and I note indented or block-quoted sections and mark these with .. . This helps when I go back to format it since I can easily see where the special cases are.
Wordpad handles large documents reasonably well and will grok UNIX files (ie: only, not ). For this it rates 3.
PROOFING AND FORMATTING
When the whole text is assembled, whether by myself or by Distributed Proofers, I use about the same process for formatting and final proofing.
I use MS-Word 95 to do a spellcheck. This I rate 3 out of 5. I do a select-all, and language appropriately - for me, usually UK rather than American English. I wish I had a Canadian English dictionary for Word 95, but have not needed one badly enough to actually look. Word has a pretty good spell checker and the custom dictionaries are easy to muck around with. I use a custom dictionary for any big project - I have one for Chronicles of Canada, and different one for all the John Richardson books I've done.
At this point in my personal process, I abandon Windows and go over to FreeBSD.
I use vi (rated 9 out of 5) to do a number of hacks. I search for and fix up hyphenations that were broken (peer- less) and such like. I also search for and fix some OCR special case errors like 'you'->'yon' and 'be'->'he'. This latter sometimes requires a while, just to step through all the be and he's to see if they're right.
Still in vi, I next use some incantations to run the UNIX 'fmt' command on each paragraph to get it reformatted. I use:
fmt -55 60
Fmt gets a 3 out-of 5 for what I need it for. It double spaces after sentences, which--although it is probably the right thing to do--is not the PG convention (for me at least). It also adds a space when joining lines with an M-dash. I go back and fix both of these using vi. I take into account the tags and manually format accordingly at this point.
As I reformat, I give the text it's final proofing. I'll have the original text in-hand at this point, and will use the page markers (remember them) to figure out where I am. As I reformat, I delete the page markers and other markup. When I'm finished this step, the book is almost done.
Next, I use Gutcheck 0.2 (5 of 5, for intended purpose - way to go Jim!) to check for all the things it checks for. At this point I usually get something like 50 hits, of which 30 are real. I'm then back in vi, and fix up all those problems. Finally, I'm done.
As I go along, I tend to keep various versions of the document. I'm at version 27 of 'The Imperialist' right now. Each scanning editing, spell checking or whatever type of session gets a new version: imperialist_12.txt, imperialist_13.txt,... At various times I might find it useful to use 'wc', 'grep' and 'diff' to figure out what is going on, where a word appears or whether I deleted something I didn't mean to.
HARVESTING PAGE IMAGES
I mentioned above that I sometimes work from page images that I obtain from the web. There are several archives around that hold eligible materials as page images that you can easily download and OCR. I personally have worked mainly with the Early Canadiana Online archive.
After a bit of poking around with the web interface to this collection, I have been able to work out how the individual pages are numbered and organized. I have written some shell scripts that I can use to fetch all the pages of a volume and convert them from GIF to TIFF format. Harvesting a 200 page book takes a few hours.
Once I have all the pages, I have to do some work with an image editor to get them ready for OCR. I use Corel PhotoPaint 7 to crop each image to just the text area and to remove the black bands at the sides due to the spine or whatever. The page images are often made from microfiche, and dust marks are common as well. These I can sometimes edit out with PhotoPaint.
Because some of the page images, or certain sections thereof, can be completely unreadable, I often find myself either tracking down a modern edition or visiting a local university library to find a copy of the book to look up a few paragraphs or passages that are not readable in the images. Even having to do this, I find that the capture of images from the archive is still a big time saver, and allows me access to an edition that would otherwise be totally inaccessible.
Having gathered the images and prepared them for OCR, I next submit them to Charles at Distributed Proofers, or handle them myself, using the same process as if I were scanning them.
DISTRIBUTED PROOFERS
I've done several books using Charles Franks' most excellent Distributed Proofers web application. I tend to choose DP when I don't have the personal time to read and proof a volume myself, or when the poor quality of the text defies the ability of my (not very good) OCR package.
When scanning for DP, I still scan images two-up. I then have a collection of shell scripts that cut the page images in half to produce single-page TIFF files. I then use a manual procedure with Corel PhotoPaint 7 - if required - to fix up skewed pages or ones with black margins. For the most part, page images that I scan myself are registered exactly enough in my scan area that the page images don't need to be edited.
Page images that I've harvested from a web archive do have to be fixed up before they can be used by DP.
Charles, I believe, prefers that as a project manager I would deal with my own OCR. He has, however, been kind enough to run several batches of page images through his OCR setup for me to good effect. I believe he uses Abbyy Finereader, and my procedure for submitting pages to Charles is to run a subset of the pages I intent to send him through a demo copy of Finereader to make sure that the results are vaguely acceptable. If everything looks good, off it goes.
When the project has run its course with DP, I download the completed text and proceed to format and re-proof it, for the most part, as if I'd scanned and OCR'd it myself.
Jim Tinsley
How I (eventually) got started.
Five years ago, I was the most clueless newbie ever to try volunteering for PG. If you're feeling lost about how to help PG, you can be sure that you're not alone! And if I can write PG's first complete FAQ after my bad start, you can surely do better! :-)
Back in 1997, the web site existed, but there were no FAQs, no Volunteers' Board, no gutvol-d, no Distributed Proofing sites. I started by making a donation and e-mailing Michael, suggesting that I could help out with small jobs, or programming. I didn't get any, and I had no idea what, if anything, I could usefully do by myself.
I looked up the in-progress list at the time, and e-mailed a few people who were listed as working on books, offering to help. None of them were still working on the books. (We no longer show people's e-mail addresses on the InProg list.) I still had no idea how to get eligible books, no scanner, and no idea how to approach producing an etext.
I subscribed to the monthly Newsletter, and just read it for a year. In a "Project Gutenberg Needs YOU" edition, Dianne Bean, the U.S. Director of Production at the time, was given as a contact. I e-mailed her, and finally things started happening.
She sent me a short piece to second-proof, and explained that I should just fix whatever needed fixing. I returned it, and she introduced me to Bill Brewer, who was, at the time, scanning Wisters like they were going out of style. He and I formed a scanning/proofing team for a while.
How I began producing, and my problems with scanning and OCR.
I had some ideas for books I wanted to produce, but I couldn't find them locally, so I turned to the Internet, and discovered how easy it is to find and buy used books on-line.
I bought a HP flatbed scanner. It came with freebie OCR software-- "PrecisionScan"--with images and OCR all in the same interface.
I scanned my first book, which fortunately had large, clear text, and the OCR made a reasonable job of it, according to my standards at the time, which were that getting any text at all without typing was a form of magic :-)
I now know that I could have made a better job of it if I had pressed the spine down hard, either closed the top to keep out ambient light or darkened the room, and made each scan a bit more exact. I'm much better at flatbed scanning now.
My PrecisionScan software _did_ recognize two facing pages, and dealt with them correctly, though IIRC it put some garbage characters between the pages that I had to remove by hand.
It did require a lot of editing, though, and recently I've gone back over my original text and found lots of mistakes. Partly because of the scan, partly because of my inexperience.
Throughout the editing, I kept having to make formatting decisions in a vacuum, reinventing wheels and applying rules from a HowTo. Now, having read and formatted and proofed and produced so many texts, I just _know_ how to format a text without thinking, and just reading or even skimming a few texts before producing my own would have given me a lot of background and saved a lot of time. I had proofed several books, but never thought to look closely at formatting decisions.
That text took me a month of working most evenings, and a lot of sticktoitiveness. I can really appreciate the effort that a volunteer has to put in to produce their first text by casting my mind back to that month. I think it's the not-quite-knowing-what-you're-doing that's the worst part. I remember being soooo relieved when I sent it off for second proofing.
The guy who took it for second proofing didn't get back to me for a month, and then said that he wasn't going to do it. This was disappointing. I sent it to another guy for proofing. He came back after a few weeks asking some questions. I answered them. After a few more weeks, I followed up with another e-mail. No answer. A few weeks after that, I gave up, and just submitted the file for posting.
The next book I produced didn't have such nice, clear, large type, and the scan was what I would today call abysmal. I'd guess that I retyped a quarter of the book. The less said about that one, the better.
My third book just _would not_ OCR sensibly. The print was very small and faint, and the OCR produced gibberish. Even with my low standards, I couldn't kid myself that this was working. I tried 400dpi, 600dpi. No dice. I might get 10 complete words on a page.
It was at this point that I bought TextBridge. I really had no idea about the difference between the freebie OCR programs they give away with scanners and a genuine commercial product, but I was trying in desperation to get _something_ different that would read this image.
Textbridge was an eye-opener for me. It still didn't make a good job of the bad images, but it made a decent shot at maybe half of them, and having bought it, I tried it on the two books I had worked so hard at before--it gave hugely improved results. The book that had only been about 75% OCRed became 100%, but with some errors. I cursed the time I had wasted making up for the deficiencies of my freebie package.
Since then, I've kept upgrading my TextBridge (I think I started on version 8, now on Millennium) and bought OmniPage and Abbyy as well. I mostly use Abbyy 6 now.
Last time I looked, there were downloadable trials of Abbyy, TextBridge, and OmniPage. Big downloads though.
Last year, I got a new Epson Perfection 1640 scanner to replace my old HP Scanjet. I never had any complaint about the Scanjet itself--it served me well--but the new Epson is faster, has higher resolution, and ADF.
Even better, I now know how to scan. I know how to process 200+ pages an hour while scanning the book flat, two pages at a time. I know how to adjust the settings to scan only the area covered by the book. I try different settings for each new book to see what works.
So much for scanning and OCR. I was a _very_ slow learner in this area.
How I prepare a text now.
I was never quite so bad on the proofing end of things. As an editor, I use Brief in DOS and Crisp (a Brief clone) on Windows. (I mostly use vi on *nix, but I do very little-to-no PG work on *nix apart from an occasional scripting thing that I can do in one line of Perl, but would be annoying on MS).
Now, I'm all for tolerance and equality and respect for the faiths of other people, :-) but I gotta say that for someone who has used a powerful editor, editing with Word or any standard Windows editor is like scratching your nose with a rake.
When I first get the text off the OCR, I have many pages with breaks between them, and usually no line-spacing between paragraphs, but each paragraph indented.
I whip out Crisp, and run a macro to search and destroy all page-breaks and page-numbers and blank lines between, and then another to put line breaks between paragraphs and unindent them. Since I watch this process carefully to avoid messing up quotations, it takes me maybe 15 minutes.
Now I have a basically formatted text. The line-lengths are usually too short, and there are hyphenated words at line-ends that I will need to rejoin, and some that I need _not_ to rejoin. Another macro fixes up the hyphenation. At each hyphen, I just decide whether to rejoin or not. Say 20 minutes, max. Then I rewrap. Another 15 minutes.
So in maybe an hour I have a proofable text, and the really nice part about it is that I've had a flying tour of the text three times, so I've already noticed any peculiarities.
If I've noticed any unusual features like letters or poems that need special treatment, I do it at this point.
To prepare the text for proofing, I just flick through it in Crisp with spellquery on, in US or UK English as needed. This puts a red line under queried words, just as Word does. I spend maybe 5 or 10 seconds per 50-line screenful. I don't expect to catch them all; this is just a quick pass to thin 'em out. I may also catch some formatting issues, but I'm not looking for them.
Now I proofread.
I've tried lots of ways of proofreading. Often it's just sitting at the screen. Sometimes I print out the texts or parts of it, and mark errata with a pen. Occasionally, I get the computer to read the text to me, and I follow along in the book, noting any errors. (This is good when you want very high accuracy - do a replace of ":" with "colon", "," with "comma" and so forth before you start the reader.) Recently, I've tried reading the text on a PDA, and bookmarking the problems.
Whatever way I do it, it takes time. I'm better at it now than I was, but I still tend to miss things like he/be.
Some people swear by particular fonts for proofreading, saying that font X shows "1"/"l" differences more clearly than font Y. I just use Arial or Verdana for printouts and Courier or Fixedsys on screen; the special fonts don't seem to make a difference to me.
So I've finished proofing and made my corrections. Now I leave it sit for a few days. I need to get my mind off it, so that I won't miss the same errors I missed before.
When I come back to it, I'm looking at what software people would call a Release Candidate, and something changes in my head . . . I'm thinking of it in a different mode, not as a work-in-progress, but as a potential finished project. This makes me much more critical, and less willing to accept mistakes.
Usually there are dash-problems to fix up (emdashes as " - " instead of "--") and other minor stuff like that. I do global searches for " -" and "- " and "...".
I do a quick skim though it, sampling paragraphs here and there as a test of its quality. I make any formatting adjustments like chapter line spacing or indenting letters that I might notice.
Then I run gutcheck. Gutcheck is a little program I wrote / write / will-write over the years that complains about common problems in a PG text . . . bad line-lengths, common typos, numbers within words (like the "1" in "wor1d") unbalanced quotations, spaced or unspaced punctuation, non-ASCII characters. I fix the problems that Gutcheck points out.
Again, I switch spellquery on in Crisp, and skim through, more slowly than the first time. This time, I'm looking for _anything_ that shouldn't be in a PG text.
I run gutcheck again, just to be sure.
And off it goes!
The Posting Team
For a couple of years, I churned out a text regularly every two months, spending about 40 hours on each, and took on some occasional proofing, but after I became moderator of the Volunteers' Board, people started referring texts to me for checking or reformatting. This took up more and more of my available PG time, and my own production slowed accordingly.
It was in response to these requests that I wrote gutcheck, which embodies all the standard non-spelling checks I would run on a file. Gutcheck allowed me to spend less time on each text, but still feel reasonably sure that there was nothing glaringly wrong with it.
When Michael formed the Posting Team last year, I volunteered, and it was a natural progression for me, since I was already used to doing a lot of last-minute work on texts.
I found posting to be disorienting and confusing at first; people bombard you with half-scraps of information about books to be posted; some texts need serious work; some texts haven't been cleared, and need to be referred back; some people want special treatment for their texts, which may conflict either with my views or with PG precedents, or both; there are lots of questions. But like every other new job, it just takes time to learn the ropes.
The actual process of posting now takes very little time: I can go through the necessary steps in 3-5 minutes. But posters are the last line of defense against errors, and even the most careful volunteers make them (and yes, we do too!). It takes a minimum of 15 minutes to run standard checks on a perfectly clean file, and it can take several hours to fix up a file that needs help. On average, it takes me about an hour to do my reasonable best for every text submitted.
Apart from posting proper, there are a lot of queries to be answered, many of which I hope I've dealt with in this FAQ, "special cases" that eat as much time as I'm willing to give them, corrections to be made to existing texts, and interminable debates about whether PG should do _this_ or _that_.
Now that the learning curve is past, the problem with posting is that it generates a lot of e-mail and discussion, and eats a lot of time, and is a 7-day-a-week commitment. Having posted over a thousand texts, I'm now particularly interested in ways to improve text quality.
John Mamoun
How to create an e-text efficiently or automatically is an interesting logistical problem. Here is my procedure, which I recently used to make an e-text in about a week, with maybe 6 man-hours of work on my part:
I take the book, and use an x-acto blade to cut out all of the pages. I then feed the pages into an HP 4C scanner with an automatic document feeder accessory attachment that I got from e-bay for $200. I feed it up to 50 pages at a time, and it automatically scans them in.
I work the scanner using software called scan2000, from www.informatik.com (30-day shareware trial period, $50 to register). This program automatically works with the scanner to save each image as a CCITT4 standard format TIFF file. Most importantly, it automatically numbers each page, starting with an initial value you specify (typically 001.tif) and increasing the number of the file name by an increment you specify (typically by 2 pages, since you scan double sided pages; you scan the evens first, then flip the pages over and scan the odds, but you want the page numbers in order, right?). So the scanner outputs, say, 001.tif, 003.tif, 004.tif, etc., then you flip the pages over and re-feed them into the scanner; the even pages are saved as 002.tif, 004.tif, etc., after you tell the program to begin the first of the even page files with 002.tif.
So now I have a bunch of consecutively numbered CCITT4 TIFF files. At this point, I could use a freeware program called cc42 (search for it at www.pdfzone.com) to combine all of the sequentially numbered CCITT4 TIF files into a single PDF file with the pages in order.
Or, if making e-texts, not PDF files, I OCR the pages and save them as corresponding pages like 001.txt, 002.txt, etc. I also use Paint Shop Pro (shareware 30 day trial) to batch-convert the tiff files into GIF file format. I can then upload the GIF files and the correspondingly numbered text files to the Distributed Proofreaders page (http://texts01.archive.org/dp/) to have them rapidly proofread by numerous proofreaders, who finish the task at a rate of 50-100 pages a day per book, very roughly speaking. When done, I then download the text files as a single text file combining all of the files. The upload function on the DP site is tedious, requiring one to upload each file one-by-one, but I spoke to the webmaster recently, and he said there are, with special arrangements, ways to FTP them or even e-mail them to him on CD.
Now, hard returns. It was once a grave problem to fix hard returns so that the text outputted to 65 characters per line. Then I got a freeware program called Clipcase at www.shareware.com. With Clipcase, you select a body of text (about 20 pages or so; any more, and the program crashes) in your word processor, copy the text to the clipboard, then load up Clipcase, paste the text into the Clipcase window, the process the text.
When this happens, all of the hard carriage returns within the text are eliminated, EXCEPT for returns between paragraphs. Then, you select the text, copy it, and paste it into any word processor to process it. I use Microsoft Word. After pasting all of the text into it, I select all of the text, choose Courier New font, 10 point size, and set the margins at 5.5 inches. With this setup, when the text is saved as "Text with layout," the resultant text is 65 characters per line, every line. Setting hard returns is automatic.
Then I spell-check the text, and also skim through it to look for typos and "categories" of errors to tend to occur repeatedly within the text. One common error is having a single dash instead of two dashes, for example:
He lingered-slowly. as opposed to: He lingered--slowly.
Another common error is a space between a period, exclamation mark or other punctuation mark, and the letter that came before it, such as:
Hey ! instead of Hey!
or " Hey, " instead of "Hey,"
I then use the "Find/Replace" command within Microsoft Word to efficiently get rid of these. For example, I might tell it to look for ^w", where ^w means "a white space" and " is a quote. This looks for white spaces before quotes. "^w looks for white spaces after quotes. ^w! means a white space before an exclamation mark. I can also have it look for "any letter"-"any letter," so that it finds single dashes between letters, and then I can decide if I want to replace these with double dashes. By using these kinds of find/replace tricks, it becomes easier to remove typos.
When done, I save as "text with line breaks" and it is done.
That's basically my procedure. 1 week turnaround time and 6 man-hours on my part for a 190k text file...
Ken Reeder
The Story of My Life (as pertains to PG) by Ken Reeder June, 2002
I am currently finishing up my fourth etext, with two more etexts in process, another seven books sitting on the shelf waiting, and a lot of additional books that I would like to do when those are done.
Sixteen months ago I was blissfully unaware of PG and of the world of online books. A couple of things seemed to come together to lead to my involvement with PG. I spent some time helping one of my sons, for a school project, in an unsuccessful search for an online English translation of Pliny's Historia Naturalis. About a year before that I had been tinkering, for no particular reason, with trying to type one of my favorite older sci-fi books into a text file. And I had been thinking, occasionally over the course of a few years, about a series of books to which I was avidly devoted when I was about twelve or fourteen years old, which was widely available then but is relatively scarce now. It was a web search on the name of that author, Joseph Altsheler, which happened to lead me to some couple-year-old messages on the PG volunteers' bulletin board.
I poked around the PG web site a little and thought, hey, I think I could be interested in this. Only a few months before I had, for no particular reason, picked up a clearance-model parallel flatbed scanner (for which I paid $36, including shipping). The scanner package included some OCR software, so I already had the basics needed to scan a book to produce an etext.
So I rummaged around on the PG web site a good bit more, and lurked on the volunteers' board, and figured out that I could find the books that I wanted on Ebay or ABEbooks, and bought a couple of books for $10 or $15 each. I scanned a chapter or two and tried out the OCR, which worked very well. (The OCR software that came with my scanner is TextBridge Pro, which it turns out is one of the more highly-regarded OCR packages, so I was just lucky in that respect because I had no clue. I could see that the OCR software was clearly much better than some DOS software that I had used at work about 15 years ago.)
What appealed to me was that, firstly, it seemed like this was a worthwhile thing to do, with a big plus being that you can do the work from your own home, in your pajamas if you want, in whatever time you can spare. And I thought that, being a detail-oriented software-developer geek kind of guy, that I would kind of enjoy it and also be pretty good at it - actually, I've always had an aptitude for proof-reading.
So I went ahead and mailed in a couple TP&V for copyright clearance, and set out to actually produce my first etext, a 348-page book which I completed in about 10 weeks, start to finish.
For a book with nice clear, good-sized print, I figure that it averages out to about 7 or 8 minutes per page to go through my complete production process. Some of the books that I am working on, with smaller or less-perfect print (and/or other complications) take a little (or a lot) longer.
I feel that I've got my process pretty well set by now. I've put together several little home-made utility programs, written in FoxPro, which assist me. (I've put in some effort to try to adapt some of these for possible use by others, but the problems are that it takes a lot more work to polish software to the point that I feel comfortable letting somebody else pound on it, and the scope of what I think the software ought to do gets bigger every time I work on it, and it's not nearly as enjoyable - for somebody who develops software at work every day - as producing etexts.)
My complete production process, with rough time breakdown, is as follows:
1. Scan the book, 2 pages at a time, about 1 minute per scan (30 seconds per page). (I do not cut the pages out of the book, I just lay it flat on the scanner and press down on the spine.)
2. Run the BMP file through TextBridge Pro, about 30 seconds per page. (Again, when working with clear, good-sized print.) I save the output as text with no line breaks.
3. Run a little FoxPro utility that I wrote that massages and formats the file a little bit.
4. Do my first-pass proof-read, about 2 minutes per page, combining the pages into chapters.
5. Run another little FoxPro utility, which checks for some things that I might have missed during proof-reading.
6. Use MS Word to perform a spelling and grammar check, another 30 to 60 seconds per page.
7. Run another little FoxPro utility (number 3), which inserts line breaks, then run another one (number 4) which does some more exception-checking.
8. Do my second-pass proof-read, about 2 minutes per page.
9. Combine the chapters into one big file. Run a couple more little FoxPro utilities (numbers 5 and 6) which do some final formatting, checking and analysis.
10. Send the file to Jim Tinsley, who will graciously run it through his GUTCHECK program which scans for a lot of common errors.
11. Call it an etext and send it in for posting.
My primary goal is to produce a quality etext - I don't particularly care about trying to speed things up. I mean, I don't want to needlessly waste a lot of time, but I look at this as a hobby and I enjoy working on it, so I don't get out my stop watch to see if I can get 20 pages done faster today than yesterday. (When I go out running, then I'm concerned about whether I'm faster today than yesterday.) I generally put in maybe 5 hours a week on PG - actually, it's often easier for me to fit in some PG work on weekday evenings than on the weekend. And it is definitely gratifying when the etext is done and not only does it get posted on PG, but then links and copies pop up in different places like the "Online Books Page", and DMOZ.org, and Blackmask.com and Bookshare.org.
I have not encountered any real stumbling blocks so far. There were a few things that took some time to figure out. For example, when my first etext was ready, I was pretty sure that it was expected that I would put the PG header on myself, but I looked all over the web site and could not find a "master" copy. (Actually, I think the master, such as it was/is, is available on Lyris, but I was not subscribing to Lyris then.) So I just pulled the header from a very-recently posted etext, but then after I sent the etext in it was posted with a different header anyway. (Nowadays, my understanding is that the PG "staff" prefers to put the header on.) I also spent some time researching 8-bit code pages, but I expect that the new big-FAQ will provide easy access to all the answers that I had to hunt down then. There's a lot of good information buried in past messages on the volunteers' board, but no good way to search out information on a particular topic.
So far I've been able to fill all my book needs without spending much money. I find my books through ABEbooks, or from Ebay, plus I've gotten a few at Ohio Book Store downtown on Main Street. I've rarely paid as much as $20 for a book, even including shipping. There's one book that I've purchased (but not yet started work on) which costs $1000 or more for the original edition, but which is also available in paperback reprints for about $10. There are some other books in my future plans which look like they will be more expensive, but we'll worry about that when the time comes.
My wife still cannot understand why I spend my time scanning books, whereas my kids (and, I guess, most other people I know) seem to think it's a little eccentric but basically acceptable behavior. Personally, I definitely enjoy producing etexts and hope to keep doing so for a long time. My thanks to Michael Hart, Jim Tinsley, Greg Newby, and untold others who devote so much effort to nurture the project and grease the skids for the rest of us. Long live Project Gutenberg.
Lynn Hill
I have been involved with PG since 1994, when I first began reading texts on-line during slow times at the office where I worked. (I once got into trouble with a co-worker when she found me "processing" Little Women instead of the week's payroll report.) I was surprised to find, even then, such a wide variety of material in the PG archives. I found myself re-reading favorite books from my childhood, and delighting in finding "new" ones--Little Lord Fauntleroy, The Secret Garden, Heidi, the Oz stories. They were not at all like the sugary old films I had seen on television. They were funny, heartwarming, and utterly charming. After some years as a reader of the texts, I found myself thinking, "I'd like to try this."
When I first checked out the web page for volunteers, I felt overwhelmed. There were all sorts of FAQ's, but when I read them, I was baffled by all the information about file types, fonts, and other details. I didn't even know where to get books, let alone what to do about jagged rights edges or indented lines. It was frustrating -- I had all this enthusiasm but didn't know where to apply it. I dawdled for some months, then came back and turned to the PG Volunteers' message board for help.
Help came from many sources. I found someone who needed a file proofread, so I offered to read it. This worked out well, and I even found a couple of typos in it. I proofed some more files for this person, and then some for other people on the board.
After a while, I was ready to try a whole book -- and from Dianne Bean came my first PG book, "The Golden Slipper" by Anna Katharine Green. When I opened the box, a stale smell floated out, and then I found a chunky book with the ugliest green cover I've ever seen on anything. The date was 1915, and the book was starting to crumble all around the edges. My first reaction was "Who would ever want to read this???" But since I had promised to do it, I dutifully started scanning and reading as I went along. The book was a collection of mystery/suspense stories about a teenage crime-stopper named Violet Strange. (I always felt as if Scooby Doo and his friends might turn up at any moment.) As I read, I began to like Violet, and to notice how different her world seemed from ours. By the time I reached the end of the book, I felt proud of myself for "saving" some good stories for the future, and ready to try another book.
My suggestion to new PG'ers is to jump in and not be shy about volunteering. PG is a big group of great people who care, but they do not know you are out there until you say something. Once you speak up, they will do anything short of triple backflips to help you.
There are many ways new folks can join in, from scavenging old books at yard sales all the way up to proofing files or scanning and typing in whole books. When you send in your first copy of title page and verso, be patient -- it takes time for your copyright research to be done. This is a great time to do proofing on-line at one of the distributed proofreading web sites.
I get my books from library sales, yard sales, friends I met on the PG Volunteer board, and even from elderly neighbors who wanted to lend me favorite books they have saved. When you want old books, tell everybody you know. They may come up with a lot of eligible books you wouldn't have expected.
When you find an old book, my second piece of advice is not to be too hasty in deciding whether you want to read it or not. Old books are dated, naturally, but they can show you things about life in the past which you can't pick up from an A&E documentary. I am especially interested in the way women and children are portrayed in these old books--every woman is not necessarily a lady, and every child is not a sweet little angel. (If you haven't read Little Lord Fauntleroy, you are missing a lot of laughs.) These insights and ideas can keep you going through a lot of long dark winter evenings, and they're handy to think over when you hit the occasional dull chapter or scene.
My hardest text to do was See America First, by Orville Heistand. The author invites readers to join him on a trip from Ohio to Massachusetts, in which he visits several landmarks and historical sites and entertains you all the way with obscure poetry, proverbs, and little moral lectures about each rock and robin he encounters. I told my husband, Chris, that the author's (literally) rambling style was driving me crazy. Chris proofread some chapters for me, then commented, "Boy, you never see anybody these days have such a fun time going nowhere!"
By now, I've done nine complete texts, and have boxes of other books to do. I have found that children's books are my favorites, but I will try anything if it is clear enough to read. I don't work on PG every day, or even every week if I get too busy with other things, but I keep coming back. I find PG projects to be very relaxing, a way to use my computer and writing/proofing skills, and also a refreshing change from my daily work. It's also a great excuse and motivation to read lots of books!
Sandra Laythorpe
HOW I STARTED AS A GUTENBERG VOLUNTEER
I first learned about Project Gutenberg from a Computer magazine, so I searched for it on the Internet, and found all these classic books I had wanted to read for years, and they were free! At that time, I read a paperback copy of The Heir of Redclyffe by Charlotte M Yonge. I thought it was a wonderful book - indeed I still think it is the best novel to come out of the nineteenth century. After reading the 'How To' files on the Gutenberg site, I thought maybe I could produce Miss Yonge's books with the equipment I had. I wrote to Michael Hart and asked him, and got a very positive reply and lots of information from him.
I jumped in the deep end! I bought a very old copy of The Heir of Redclyffe, sent the photocopies of the title pages to Michael, and sat down at the computer, learned to use my OCR facilities, and got on with it, learning by my mistakes. The Instruction files told me most of what I needed to know, and Michael gave me an introduction to David Price, an experienced Gutenberger, who would be able to help me. He has been invaluable in explaining things; I don't think I could have produced my first attempt without his guiding hand.
I buy my books off the Internet, or from local dealers. Most of Miss Yonge's work is still available from second-hand bookshops, and I am happily living in a location where they are not too scarce. I have Gutenberg colleagues, now, helping with CMY, and I post books to them snail-mail, if they can't buy them in their own countries.
THIS IS HOW _I_ DO IT.
I use PrimaPage OCR program; it was on the disc which came with my Primax Colorado Direct scanner, and I do the work on my PC. Before I start, I open my scanner program, and adjust the settings to take black and white photos, and the brightness to about minus 35 or 40. This is crucial, as I won't even be able to _see_ the page until I get it right. When I first began, it took many adjustments to get it right. There should be as few mistakes as possible on the OCR result. If the photograph is too light, the OCR reads words wrongly. If the photograph is too dark, there are shadows which create black patches on the pages. If I can't get rid of these black patches, I have to tear the pages out of the book and do them one at a time. Important: don't buy first editions!
I use the scanner to take a photograph of two pages. The photograph appears on the screen. Then I close the photograph, which my computer calls 'untitl1'. Next I open my OCR program, and search for file 'untitl1', and open that. Then I ask the program to clean it, and then I click onto the button that 'reads' the photograph and converts in from pixels into letters = Optical Character Recognition!
When I get the OCR result (which takes only a few seconds), I save the 'read' text file into my own documents, numbering the file the same as the number of the page of the book. I have created a folder called 'Gutenberg', and I save it in there in a text-only format. So I go to my Gutenberg folder, open this new file, and visually correct the mistakes. I save the finished page, create a Chapter 1 file, and save it and subsequent pages that I have prepared, to build up the whole book. After I have proofed the OCR result, I paste the finished text into a Microsoft Word document, setting the font at Courier New size 10. This sets the lines at the right length for Gutenberg. When I have finished the whole book in Word, I save it as text-with-line-breaks, to get the final text file, which I send to be posted on the Gutenberg site. I proof my work two or three times, depending on the quality of the OCR result, and do a final spelling check with MS Word. I don't ask other people to proof my texts, because Miss Yonge's idiosyncrasies are liable to get edited out, unless the proofer has the book to hand.
It took me 6 months to prepare my first text, The Heir of Redclyffe, but I can do 10 pages an hour now.
In my Gutenberg folder, I have other useful files for reference, mostly downloaded Gutenberg Instructions files. So if I need to find something out, I can look in these files--it is much easier than searching on the Internet. If I need to know something I can't find in these files, I may ask a question on the Volunteers WWW Board, although I try not to, because the answers are nearly always in the files.
I try to process 2 sheets of 16 octavo pages a day, taking about 3 or 4 hours. I do my housework & gardening in the morning, then settle down to an afternoon's happy Gutenberging :-).
WHY DO I GUTENBERG?
When I became semi-retired, I wanted to do some voluntary work on the Internet. Coincidentally I began reading the works of Charlotte M Yonge, and discovered that most of her works are out of print now. I felt that they deserved a much wider audience, so I decided that my voluntary job would be to do just that. Miss Yonge lived in a village only a couple of miles away from me, so I had a local interest, too. On my web page, http://www.menorot.com/cmyonge.htm, you will find out a little about her, and Otterbourne, the village she lived in all her life, and find links to other web sites about her.
I discovered the Charlotte M Yonge Fellowship http://www.cmyf.org.uk/ and am now in contact with other people who appreciate her work, including academics who write clever things about her. Her books are about families, their interactions with each other, and how they, in Christian terms, grow in grace. I don't think there is another writer who can write so well about families. She was a Tractarian, a Christian who, in the nineteenth century, believed that people could be influenced for good by what they read. For this reason, 20th century people found her characters too moralistic, and her prose too turgid. I think her novels are delightful, her characters lovable, and her prose is minutely descriptive. It was said about her that she was 'able to make goodness exciting'. This is a rare talent, perhaps only found in other Christian writers like John Bunyan or Charles Kingsley.
Through the Gutenberg site, Miss Yonge's works are more easily available than ever. She originally wrote for upper and middle class young women. Even though I live a century and a half later, I can recognise her characters in their 'descendants' who live around me, but I sometimes wonder what Chinese, African, or even modern American readers think of her, their own backgrounds so different from the English Victorians.
I enjoy making Gutenberg texts, the work is simple, once you know how to. I would prefer, however, to see them presented in HTML. The modern ebooks all need to be in HTML format to present nicely on their tiny pages. I believe Gutenberg is going to publish HTML files, I would like to learn how to do it. Eventually, I think Gutenberg files will be available in a format that will work on all PCs, handhelds, palms, and ebooks;--but I don't know what that format is yet, I don't think standards have even been worked out among the ebook publishers.
Finally, yes, I do find mistakes in my published texts. When I have finished all 200+ of Miss Yonge's books, I am going to go through them all for the second time, and remove the mistakes. So, my work is cut out for many years to come. . . .
Suzanne Shell
Over the past several years, I visited the Project Gutenberg website occasionally, looked at what was involved in making a significant contribution to the effort, and left after downloading a few books--PG was a project that would need to wait until I retired.
In the summer and fall of 2002, I was doing research on e-books (sources, devices, costs) for my library, and ran across Distributed Proofreaders. I discovered Blackmask.com at about this time, and also followed a link from there to Distributed Proofreaders. Serendipity! After backing away a few times, I took the plunge and registered on November 5, then began proofing. The however-many-pages-I-wanted-to-proof commitment was just right for letting me get a feel for the process, and to start me thinking of the ways I could exploit all this free labor to get the books _I_ wanted into PG.
I was feeling quite virtuous about proofing my 10-20 pages per day, when I visited the site on November 8, and NONE of the books I was working on were available. Also there was this perfectly absurd number listed for number of proofers having proofed at least one page (it had roughly quadrupled). I KNEW the site had been hacked. Actually the site had been slash dotted. The DP discussion forums were so active, it was hard to find time to read all the messages, questions, suggestions, and complaints; these rapidly led to new documentation and more detailed proofing guidelines. Books moved through the site so rapidly that they brought out the "hard stuff" from the bottom of the to-do stack, and were STILL desperate for content. I was a relative "veteran" after just a few days, and helped out a little by answering questions, but I was still a beginner. I had some PG dreams that DP could make reality, but I needed to learn the ropes first.
Some of my ambitions revolved around professional goals--there are some public domain titles, which, if available in electronic form, would be extremely useful to my library's patrons. There are also some standard reference books and indexes--Granger's Index to Poetry is one example--that have pre-1923 editions that could still be important resources. In order to learn what I needed to know about providing content, though, I decided to start with something less overwhelming (wanting to read it on my e-book reader was just a coincidence). I went to my bookshelves and pulled out my P. G. Wodehouse reprints. I downloaded and read the scanning and submitting FAQ from the DP site, requested and received clearance for the first book (_Uneasy Money_) in late December, and got to work mastering my scanner. I tried Omnipage Pro first, but decided that ABBYY Finereader Pro did a significantly better job of the OCR. I offered to be a "behind the scenes" manager for the book while it worked its way through the site, but was made an official "Project Manager" instead. Although the first frenzy following the slash dot invasion had calmed down, DP was still feeling a need for more content and more hands to manage projects.
On January 5, _Uneasy Money_ started proofing; it went through 2 rounds of proofing in less than 20 hours. I felt a like a hick marveling at a traffic light changing colors, but I sat at my PC and watched the page count go down. By this time, I had also scanned and OCR'd a couple more Wodehouse reprints and a short book of poetry. I was hooked! Juliet Sutherland and the other admins had recruited some experienced DP'ers to help train new post-processors in the job of preparing final PG texts. I was handed over to one of them. After several projects, I "graduated" and was given permission to upload my own projects. My intent was to do 3 or 4 projects a month, no more than I could handle post-processing by myself. I planned to process an occasional reference book in addition to all the Wodehouse I could get my hands on. So much for plans...
One ongoing concern of many Distributed Proofreaders was how to train new volunteers in the DP style of proofreading. (It is somewhat idiosyncratic because of the distributed nature of the process.) We were still coping with the aftereffects of the massive influx of slash dotters--quantity benefited, but quality suffered. Super7, one of the highest volume proofreaders, suggested setting aside a project without complex formatting for "Beginners" and asking that the second round proofers (all of whom should be veterans) send feedback and encouragement to the newcomers. This was tried successfully, and with a couple of variations. Since I had been planning to start running a variety of genre fiction through the site, I then volunteered to manage these as beginners' projects for as long as the supply held out. All of a sudden, starting in February 2003, the amount of time I needed to spend locating, scanning, OCR'ing and managing books increased drastically, and the amount of time I could devote to post-processing decreased. Luckily, "veterans" stepped in to answer newcomers' questions, and to serve as "Mentors" in the second round of proofing. Recently, others have provided "beginners' projects", to help keep up with the demand of a steadily increasing flow of new volunteers. These projects are also useful for helping new post-processors learn the job.
I still have some ambitious projects planned; Granger's _Index to Poetry_, the unabridged edition of _The Golden Bough_, Curtis' _The North American Indian_, and the _Book Review Digest_ (volumes for 1905-1921). A couple of volumes are already waiting to be proofed, others are waiting to be scanned on the PG tabloid scanner. But, in the meantime, there are 23 new Wodehouse books in PG thanks to Distributed Proofreaders, not to mention such remnants of early 20th century popular culture as _The Sheik_.
I believe that a major accomplishment of Distributed Proofreaders has been the creation of way to provide on-the-job training for PG volunteers. Steady improvement in the quantity and quality of training techniques and documentation, enhancements to the user-friendliness of the site, and ready access to the collective experience and advice of a wide range of volunteers in the Forums have resulted in a growing core of active and experienced volunteers in all the facets of e-book production. I'm sure that I could not have progressed from a total newbie to a regular PG contributor within a 5-month period without this support structure. Regular communication and collaboration with book-lovers from around the world has enriched my life. The fact that it is easier to get leave from my job than from DP, is perhaps beside the point...
Tony Adam
How did you learn about PG?
It's been so long, I don't really remember! I probably read about it on a library listserv (I'm a librarian), and since making old texts accessible has always been a concern of mine, I jumped right in.
What was your first contact like?
Great! Mike Hart has always been easy to deal with via e-mail, although we've never talked. He and the "crew du jour" directed me to the FAQ and I took it from there.
What was the first PG job you did? How did it go?
My first job might have been Henry James' _Turn of the Screw_ (I just found a note from September 1993 on copyright clearance for it). Since in a former incarnation I was editorial assistant for the _Henry James Review_, I thought that would be a good start. I've always typed the files (I'm a fast typist), and I think we had few problems along the way.
How did you develop your PG experience from there?
Helter-skelter, much like my reading habits. I work at a historically black university, so getting 19th C African-American works posted is a central concern. I've done _Clotelle_ (the first A-A American novel) and the autobiography of Henry O. Flipper, the West Point cadet, and I'm always looking for something new in that area. Somewhere along the way I got sidetracked into essays by Whittier and other U.S. poets, and I've collaborated on early American historical documents and Sir Walter Scott with a fellow PGer up in Ohio and Chinese documents with another contact in Japan. A couple of years ago, I saw that someone in San Francisco needed help with the Shakespeare Apocrypha, and that has occupied my time on and off since. It's always something!
Can you tell us about the first text you produced?
I think it was _The Turn of the Screw_, which was a good starting point--not too long, a good read, etc. Just plugging away at the text a few pages a day made the process go quickly.
Why do you spend your hours contributing to PG?
I love the idea of making all of this print knowledge available to anyone anywhere. Working in a library that has suffered budget problems over the years opened my eyes to the need for acquisition of as much free stuff as possible for our students and faculty. Besides, in a perverse way, it's fun!
Do you specialize in any particular kind of work? of texts?
I've probably focused more on plays, historical documents, and 19th C U.S. works than anything else.
What do you like about making a PG text?
Having a project come to fruition--finally seeing an almost forgotten text come to life again.
What do you dislike about making a PG text?
The work can be tedious at times, depending on the author. But sometimes you have to plow through to get something significant processed. For example, we probably should have more philosophers represented, but what a horrible thing it would be to scan Kant!
Where do you get your eligible books?
Mostly from my library's collection, although I finally purchased my own copy of the Shakespeare Apocrypha (it's very hard to find, which makes it very suitable for posting). I've interlibrary loaned some items, but that's also been unusual.
Do you type or scan? What Scanner / OCR / Editor / WP do you prefer?
I still type everything--it's easier when working with a play, I've discovered. But I'm purchasing a scanner in the very near future and will do more with that.
How do you check your text? Any special tools? spellchecker? Do you print it out and read it? Put it on your PDA and read it? Have a voice synthesis program read it aloud to you from your PC?
I usually run it through the spellchecker, although depending on the work, I read it line by line a second time.
Do you have any tips'n'tricks or special routines you go through when preparing a text?
The best thing to do is put yourself on a schedule--do a set amount of pages every day, and you'll be surprised how quickly you get to the end. I also make a pencil mark in the book at a stopping point and even read back a paragraph to double check what I last entered.
How long does it take you to make a text?
Depends on my work schedule, other assignments, time of year, etc. A play might take a couple of weeks, but a Walter Scott novel could take six months. I think my record is probably one day for an essay, but that's unusual.
Do you work alone, or do you share the work of each text? Does anyone regularly help you proof the text?
I've worked alone and on teams, depending on the text. No one regularly helps to proof the text, but occasionally someone else does.
Do you do some PG work regularly, or drift in and out as opportunity permits, or when you feel like it?
I consider myself a regular, as time permits. In other words, I haven't dropped out of the picture, but sometimes I might not enter anything for up to a month.
How many different kinds of work, or different books, have you done?
Not sure how many different books I've done, but it's been a wide variety: James' and Scott's novels, Whittier's essays, a whole collection of early American documents (mostly New Netherlands), Shakespeare (accepted canon and the apocryphal works), some odd works (_The Psychology of Beauty_ comes to mind)--the list goes on and on. I've even forgotten that I've done some titles!
What do you like about the PG process?
That it's open-ended--if I think I have something that should be posted, I don't have to jump through hoops and ladders to get permission (other than copyright clearance).
What do you dislike about the PG process?
Can't think of anything offhand.
Is there anything you'd like to see PG doing differently?
I know it's a bone of contention, but we probably need to explore moving away from ASCII.
If one of your friends approached you to ask advice about how to get started contributing to PG, what would you tell them?
Start with something fun, that's close to your heart, and keep plugging away a little bit at a time.
What do you expect Project Gutenberg to be like in 5 years? 10 years?
We'll probably be a whole lot bigger (texts and personnel), with a different look to the texts. Maybe we'll even have more audio versions of texts, using some of the new software that's coming out.
Tonya Allen
I discovered Project Gutenberg in about 1997. After several years of enjoying PG's texts, in June of 2002 I decided it was time to start contributing. Via the PG web site I learned that the easiest way to do this would be to help out with proofreading via Charles Franks' Distributed Proofreaders web site. The day I signed on I proofed nine whole pages of a children's book called _Curly and Floppy Twistytail_ and felt very proud to be contributing.
At that time, there were probably only about 40 active volunteers on the site each day. Often I proofed an entire book almost all by myself over the course of a week or so. Things moved at a leisurely pace; guidelines were few and simple; and I had fun reading old books and discovering new authors.
After a few months a request was made for volunteers to post-process texts in French. I volunteered to help with this, and that was how I became a post-processor (PPer). Shortly afterwards, the web page listing texts available for post-processing and sign-out was unveiled. I remember several times checking and being disappointed because there was nothing currently available (hard to imagine now when there are always at least 40 texts waiting).
One day in November, I picked out a likely-looking text from the proofing page, and settled down for an hour of reading. As I recall, it was _The Greek View of Life_, a sizeable text of which only a few pages had been proofed so far, and which I thought would last for several days at least. At about that time, someone emailed me to say that DP had been "/.ed." "What does that mean?" I replied. I soon found out.
I had been proofing away peacefully for awhile when suddenly instead of the next page, I got a page about twenty pages further on. The same thing happened again and again, and suddenly all the pages were gone; the whole text had been completed. DP had indeed been slashdotted.
Since then, a lot of amazing things have happened. The number of active volunteers per day has increased almost 1000%. The number of texts that go through the site has increased exponentially. All kinds of proofing and processing tools have been developed. I now spend most of my time checking texts that others have PPed, and submitting them to PG, at an average rate of one to four per day--quite a leap from nine pages of _Curly and Floppy Twistytail_. And I'm looking forward to everything that lies ahead as DP continues to evolve.
Walter Debeuf
Quite by chance I became aware of PG when I was surfing and looking for interesting sites. I vaguely knew the name because I had heard of the Project a long time ago. After reading the "History and Philosophy of PG", I immediately became wildly enthusiastic about it. This was what I had been looking for for years, a meaningful use of my PC, and because I am a fervent lover of good literature, I didn't hesitate to contact the founders of the Project. I made a suggestion that I should work on French and Dutch e-texts. The very same day I received an answer from PG in which they told me they were very pleased with my contribution but that I had to keep in mind that all books must be free of copyright and published before 1923.
This wasn't so great. . . . After I browsed in the "Help And FAQ" of the PG site, I read that I didn't have to worry about all that, because they are willing to do all the clearance!
On my own bookshelf I found an old book of Jules Renard, "Poil de Carotte". It seemed old enough to me, but I couldn't find any copyright notations. So, I mailed to Mr Hart all the information I found on the title page and the verso, and asked him what he thought about it. The next day I received his answer, he wrote: "We still have to prove this edition was pre-1923, so I am forwarding to our authority on such copyright research." This authority is Ms. Dianne Bean who mailed me a few days later very pleasantly that I could start typing, because the copyright issues had been resolved. She asked me to send a "TP&V" (a photocopy of the title page and verso) of the book to Mr. Hart, because they need that for legal reasons.
But something wasn't very clear to me concerning the format I had to use. In the "FAQ" they spoke about "plain vanilla ASCII", something I never had heard about in my life! In "How to Volunteer, PG Volunteers' Board" Mr. Jim Tinsley answered all kind of questions about all kinds of problems people have when they start volunteering. So I did the same and sent him my question. I received an extensive answer about all kind of formats in the "ISO 8859 Alphabet Soup" and he recommended me to use "Codepage 1252" which is very common in Windows. Here are the addresses which Jim sent to me:
"If you are interested in the differences, I recommend the excellent web page
http://czyborra.com/charsets/codepages.html
in the excellent reference site http://czyborra.com"
I chose a French book, first because I had it already on my bookshelf, and secondly because I wanted to perfect my knowledge of the French language and typing seemed the right way to do it. When copying an author's text, you are very close to it. You also have to pay full attention to the spelling of the words. Gradually you come under the spell of the story and you forget that you are typing . . . Nevertheless, it is hard work, especially when it is not your native language, and therefore you shouldn't try to rush it. At first I started with two or three pages a day, which means that you would need about two months typing for an average book. But good typists can do it more quickly.
I can only applaud the aim of PG, to put books available on the net as much as possible and without cost, for every one in the whole world. I love to co-operate with it.
In the meantime there are thousands and thousands of books in the PG-collection, and that makes it a little difficult to find other examples which are free of copyright, because they must be from before 1923. Since I've got the "PG-bug" it's a challenge for me to find suitable copies, and I look for them high and low. I can buy a few books for a song and I take them home as a trophy, looking forward to the work which is waiting for me . . .
In libraries you can find old publications which you can find nowhere else.
It's amazing how fascinating old books can be and how much you can learn from them. For the moment I'm working on "Pecheur d'Islande" by Pierre Loti, in which I get acquainted with an old tradition of fishermen, very interesting. Without PG I would probably never have read this. There must be still a lot of little treasures in some old and dusty attics, waiting to be born again by the magic touch of a PG-volunteer.
If you do it, no compensation or payment is waiting, but . . . doing something disinterested and unselfish gives you a good feeling.
Bookmarks:
B.1. Project Gutenberg:
Home Page and Search <https://www.gutenberg.org/> Contact Information <https://www.gutenberg.org/contactinfo.html> Donations <https://www.gutenberg.org/donation.html> List of FTP sites <https://www.gutenberg.org/list.html> Web Browse to texts <http://www.ibiblio.org/pub/docs/books/gutenberg/>
Mailing Lists <https://www.gutenberg.org/subs.html> Volunteers' Board <https://www.gutenberg.org/vol/wwwboard/> Copyright Rules <https://www.gutenberg.org/vol/pd.html> Books In Progress <http://www.dprice48.freeserve.co.uk/GutIP.html> (The InProg List)
Greek Transliteration <https://www.gutenberg.org/vol/greek.html>
Music <http://www.ibiblio.org/gutenberg/music/music_helpex.html#what-software>
GUTINDEX.ALL <ftp://ibiblio.org/pub/docs/books/gutenberg/GUTINDEX.ALL> (Complete list of posted eBooks)
B.2. Distributed Proofing Sites:
Charles Franks <https://www.pgdp.net/> JC Byers <http://www.wollamshram.ca/1001/index.htm> Dewayne Cushman <http://www.metalbox.net/dcushman/pgroot.htm>
B.3. Other On-Line eBook Pages:
The On-Line Books Page <http://onlinebooks.library.upenn.edu/> /In Progress List <http://onlinebooks.library.upenn.edu/in-progress.html> Internet Public Library <http://www.ipl.org/>
B.4. Lists of Suggested Books to Transcribe:
PG Books In Progress <http://www.dprice48.freeserve.co.uk/GutIP.html> On-Line Requested List <http://onlinebooks.library.upenn.edu/in-progress.html#requests> Steve Harris' "To-do"s <http://www.steveharris.net/PGList.htm>
B.5. Finding Paper Books On-Line:
Advanced Book Exchange <http://www.abebooks.com> Alibris <http://www.alibris.com> Trussel BookSearch <http://www.trussel.com/f_books.htm> Library of Congress Catalog <http://catalog.loc.gov>
B.6. Character Sets
Overviews <http://czyborra.com> <http://www.cs.tut.fi/~jkorpela/chars/index.html> ISO-8859 <http://czyborra.com/charsets/iso8859.html> Microsoft & Other Codepages <http://czyborra.com/charsets/codepages.html> Unicode <http://www.unicode.org>