Linux OCR with Tesseract

I’m scanning old Flor y Fauna news letters for my Dutch Hardwood Investment Wiki. I need to do this because most of these newsletters, although produced digitally, are available in the Sicirec archive only in paper form. The only graphical item these news-letters sport is a simple graphical header, so I want to convert the scans to text and put the text in a wiki article for each newsletter; I don’t want to upload dozens of image-heavy PDFs just to show the original (crappy) layout.

The problem, of course, is that I’m on Linux and I don’t know of any good free, open source OCR programs. I don’t know much at all about OCR to be frank. 😕

Anyway, I’ve found this Linux.com article by Mathis Dirksen-Thedens about doing OCR the hardcore way. The downside of his process is that you have to preprocess each image to end up with square, border-less chunks of just text. He recommend Tesseract. The Tesseract project brags that their “engine was one of the top 3 engines in the 1995 UNLV Accuracy test”. Wow, impressive! But, wait, there’s more: “Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available.” They’ve actually made me embarrassed for trying to do this with open source software. 🙁

Tesseract (and similar programs like GOCR and Ocrad) only do line-by-line, word-by-word character recognition, so it’s useful if you have a program that first breaks up a page in graphical elements and simple text blocks.

In that category, OCRopus (Wikipedia) seems very promising, but it’s still in the alpha stage of development. Maybe that’s why it isn’t in Portage yet. Either way, that means I’m not going to try it. Before the 0.4 release, OCRopus supported only Tesseract as a character recognition option, but now Tesseract has been replaced by their own system, although it’s still supported as a plug-in.

gscan2pdf is a GUI program that seems to be meant to pull many of these tools together, although it doesn’t seem as if it can break down page lay-outs into separate text blocks. I’ll have to try it out to better judge this, though. First, I want to return to the command-line.

In my case, I’m scanning the news letters using a Xerox WorkCentre 7232. This machine has a network scanning feature that creates PDFs by default. It can also create (multi-page) TIFFs, which saves me one conversion step, and I’m glad it does because I yet have to find out how to convert PDF to TIFF with ImageMagick without losing too much image detail to be be able to blame Tesseract for not producing anything useful.

Quite early on, I noticed that Tesseract supports multi-page TIFFs. This is cool. I was less enthusiastic to discover that it doesn’t support the MMR compression used by the Xerox machine (even though I’ve compiled it with the tiff use-flag enabled in Gentoo). Luckily, a simple convert a.tif b.tif seems to produce a b.tif without border-some compression schemes.

Then, all of a sudden, the Xerox would start delivering scans with the wrong rotation. I fixed this again with ImageMagick:

convert Document005.TIF -transpose teakwood-info-uncompressed-rotated.tif

Now, surely, I would get some kind of result.

tesseract teakwood-info-uncompressed-rotated.tif result -l nld
head result.txt

]AV.LQHVBEKODWB'A' ];I`OI5ABVfll/|V2'V` D2'.LVI`V\IV2IM(2EF=tI°ö52I EB BEKCOIN ·
 
$UIDp9DKGD°
_ GIDqbLOq¤K;GD SOSI2 AIOGLGD‘ KOSIJDGU‘ I¤IKGU‘ bGLöOIS\2 GD
. bLSCD;Iä DI; GD GL IB AGGI pGISDä2;GIIIDö AOOL qG
AGLMGLKGD° DI; $0JSLIö‘ ID MGqGLISDq äGqLOOäq‘ DON; SIG; GL
ssuqqud pong Agu sou suqsns bjsu;sds sqqu ms gsm ps;
MGQGLIQDQ ;G IS;GD MGDUGD SSD CO2;SLIGSSU2 ;GSKDO¤;' DG GGL2;G
§OSI2 MG H SI GGLqGL wGIqqGD SIJD MG pGSIö Ow qG NSLK; IU
MOLqGD'

Ok. Maybe not… At this point (ignoring all the other side-tracks), I noticed that although Gliv showed the image with the proper rotation, when importing a page from the mTIFF in the GIMP, it would show the image upside down. Then I realized that I was using -transpose just to please Gliv. Gliv simply doesn’t read the endian-ness of the file right! Instead of fixing my rotation problem convert -transpose actually made it worse! From Wikipedia: ‘Every TIFF begins with a 2-byte indicator of byte order: “II” for little endian and “MM” for big endian byte ordering.’ What I should have done is a convert -rotate 270.

This was starting to look a lot better, and I hadn’t even removed any borders or headers:

A'
E A K W C) O D M @
Bergum, november 1991
Geachte bosbouwer,
· Een maand eerder dan beloofd sturen we u een nieuwe Teakwood _
Info. Ik was van 14 september tot en met 12 oktober weer op
onze plantages in Costa Rica en heb geconstateerd dat onze
bomen er goed bij staan. Op sommige heuvels blijft de groei
ietsje achter, maar door extra voeding (bemesting) te geven,
_ trekken we dat bij.
Teakwood IV is inmiddels nageplant. Dat wil zeggen dat we de
stekken die niet wilden aanslaan, hebben vervangen door
nieuwe. Teakwood II en III zijn al voor de tweede keer
sx nageplant en doen het uitstekend. Deze keer stuur ik u nog
ïïw eens een foto van Teakwood I, vanuit hetzelfde standpunt als
de vorige foto van jongstleden juni, bij nummerpaal 1. "

(Tesseract processed two pages, by the way, but tried to convince me in its CLI output that it had only processed one.)

Now, I want to see what the program does if I give it a cleaner image, without scanning artifacts. I would like to use unpaper for this, but it’s masked in Portage, so for now I’ll use the GIMP to make a single-page TIFF, cropped from the original image. (When creating the new image in the GIMP, I had to change the image mode to be indexed, 1 bit black and white, and remove the alpha channel.)

Now, I was getting a better result:

Bergum, november 1991
Geachte bosbouwer,
Een maand eerder dan beloofd sturen we u een nieuwe Teakwood
Info. Ik was van 14 september tot en met 12 oktober weer op
onze plantages in Costa Rica en heb geconstateerd dat onze
bomen er goed bij staan. Op sommige heuvels blijft de groei
ietsje achter, maar door extra voeding (bemesting) te geven,
trekken we dat bij.
Teakwood IV is inmiddels nageplant. Dat wil zeggen dat we de
stekken die niet wilden aanslaan, hebben vervangen door
nieuwe. Teakwood II en III zijn al voor de tweede keer
nageplant en doen het uitstekend. Deze keer stuur ik u nog
eens een foto van Teakwood I, vanuit hetzelfde standpunt als
de vorige foto van jongstleden juni, bij nummerpaal 1. "

My conclusion is that users of open source OCR software must suffer. I’m not going to clean up this post to make it more useful for people who want to do the same as I did, because you shouldn’t want to do the same. You should simply go out and buy or pirate some proprietary piece of OCR software. Really, you should.

Now, I want a massage; my shoulders are stiff.

Linux OCR with Tesseract

Categories

Tags/Keywords

Recent Posts

Recent Comments