Distributed Proofreaders rock Project Gutenberg
September 21st, 2007
What is the Distributed Proofreaders project? (The driving force behind Project Gutenberg!)
There are many growing information archives available on the internet, but with regard to Book content, Project Gutenberg (PG) is the granddaddy of them all. Founded in 1971, Project Gutenberg seems to be the oldest Digital Library project in existence- deserving of the moniker referring to Johannes Gutenberg, inventor of the printing press- (of course, ushering in an era of mechanical reproduction of information for distribution).
With that, Project Gutenberg currently scans pages of texts, (and acquires scanned copies of texts), and publicly distributes any materials which are free of copyright protections. If you want to download, for example, a copy of Milton’s Paradise Lost, well there it is… The material available is amazing for reading, but from a data-freak perspective, the material is also incredibly valuable in creating a usable text corpus for data indexes- (search engines). Enough about Project Gutenberg, go look at their library yourself… I’m interested here in how the process works.
But mechanically reading scanned pages with OCR software is error-prone, as books are analog data- and digitizing them ‘correctly’ has fun challenges… In the case of books, there is a well-established method for maintaining content correctness- proofreading, and Project Gutenberg has spawned a great internet project of massive proportion on it’s own, the Distributed Proofreaders project.
Distributed Proofreaders
“Preserving History One Page At a Time” (um, that’s hot)
I joined the Distributed Proofreaders Project (DP) shortly before I started this blog, for what reason, I don’t really know… I’d like to say it was out of some personal humanitarian drive inside me, but I dunno. Perhaps it was because I wanted to learn something about their process and learn from their ways, to apply to my own book scanning project- but I dunno. Regardless of why I started proofreading, I’m happily addicted.

This is what the proofreading editor looks like, it’s very easy to use!
Personally, I have limited time to spend on new projects, so I set a small goal for myself- to attempt to proofread 1 page daily- when I read my morning news on the internet. Coffee in hand, it’s like doing mental push-ups as I move out of my morning funk. I’ve failed at that daily goal, but I do get a few pages a week proofread, and it strangely feels great… The task is so gargantuan, it’s easier to not feel guilty if I miss a day- the proofreading will never be done- ever. Infinity is strangely reassuring like that… So every page proofread, feels like a success.
The project basically has 3 rounds of proofreading, P1 corrects the majority of the proofing errors from raw OCR output, and beginners are limited to this round for a while. P2 is spellchecking and careful comparison to the scan. P3 is the hardcore final round- then the text moves on to the formatting rounds. Then finally, the text moves to the Smooth Reading round, (the hardest round), and then it is checked in Post-Processing before it is posted to Project Gutenberg. So that’s like 9 reads of the text by human proofreaders, which explains the high quality of Project Gutenberg texts!
Especially at the beginning, there are these fantastic mentors- proofreaders who have been on the project a long time, and are working in the high rounds. These people check your work as it hits the upper rounds, and provide feedback back down the chain. The mentors are what makes DP work, their feedback helped me focus, and not feel lost in some big computer… Their criticisms are always accompanied by references to the formatting manuals, and they’re just downright nice people.
In the end, I’ve learned a tremendous amout about proofreading, and read a few pages from some interesting texts. But more importantly, I’m addicted to Distributed Proofreading- it’s fun, and I look foreword to my morning coffee. It’s an intellectual ‘morning jog’, gets the blood flowing to the brain, warming up my neural synapses into the infospace of my daily hacking…
If you are curious about Distributed Proofreaders after reading this article, go join and give it a try!
P.S.: A shout out and thanks to all the mentors who’ve helped guide me along as I began:
TheEileen, Tiga, viviane, spiegel428, and storm. These folks ROCK.




September 21st, 2007 at 1:36 pm
One of my mentors wrote me with some corrections, (of course! :)
Here’s more information about especially the upper Formatting rounds, which I haven’t gotten to yet…
“P1 (not just for beginners, although beginners must start there) actually corrects the majority of the proofing errors from the raw OCR output. However, they’re not suppose to do any “formatting” or what DP considers formatting, such as moving text around, adding formatting tags, etc. They’re just interested in the text part of it–that it matches the original and it follows the guidelines.”
“There are 2 formatting rounds, F1 and F2. F2 is the hardest round as they have to correct any remaining formatting or proofing errors. Then it goes to post-processing where all the pages are run through a series of checks, sometimes reproofed again by the PPer, and then assembled into the final ebook. This is where the html codes are generated for the HTML versions. Then, optionally, the post-processor can send the book for smoothreading, but it is not actually required of all books, nor is it the hardest round. It’s actually the most fun round as the smoothreader can read right through the book and just note anything that doesn’t quite read right, or if they notice punctuation errors, guideline violations, etc. When they send it back to the PPer, the PPer makes a comparison of those notes to the original, makes any final corrections, and then either directly uploads the project to PG, or sends it to PPV (post-processing verification) where all the post-processing work is checked by a more experienced PPVer. When the project is finally sent to PG, a person dubbed a “whitewasher” (WW) (from “Tom Sawyer”) does a quick final check, adds the beginning and ending material (legal notices and such), indexes it, and adds it to the database–or possibly another team over there handles the database, I’m not too certain about what happens to it after it leaves DP…”
Rocket-
.ike