Green Fire by Ian Cohen conversion to ebook via scan OCR PDF

Green Fire by Ian CohenA few weeks ago Ian and I started talking about the idea of digitizing his book. After describing the feasibility of the project and making it available free of charge as a PDF, he agreed that it was a good idea and gave the go-ahead.

The project followed these steps:
  • take a copy of the book and guillotine off the spine
  • scan: feed the book into an office multi function scanner via ADF, set for double sided scanning, 300dpi, TIF, b/w optimized for text, A5; the machine processed them in under 1 hour in several batches
  • pages with photos were individually scanned as color JPG; this was slightly more fiddly then the above
  • photoshop photo pages; crop, optimize contrast, clean up, save as greyscale or color
  • OCR: I used Tesseract from Google; once I got the process down pat I made a custom shell script which contained a tesseract command for each page/TIF, times the number of files to be processed; this took maybe half hour on my slow PC
  • assemble the text files and photos into a document; this process took about one morning, not counting fiddling
  • edit: make the new text presentable, insert an automated table of contents with the use of headings, correct OCR errors, change indents and quotes for consistency; this took about 2 weeks part time
  • save as pdf; I used OpenOffice Writer for the above step, which also allows the pdf conversion to dial up or down the photo compression and turn a native 20Mb file into 3.5Mb

Incidentally, this project started after I started reading my autographed copy of the book.

Green Fire is a first hand account into activism and the Australian green movement which now spans decades. The chapter about The Politics of Poo was especially humorous. I am a total noob to all of it for reasons outside of the scope of this blog entry, but the book is probably a must read for any Australians interested in protest actions.

Download the book via iancohennsw.blogspot.com



Comments