Smokes your problems, coughs fresh air.

Tag: Linux (Page 2 of 4)

Linux OCR with Tesseract

I’m scanning old Flor y Fauna news letters for my Dutch Hardwood Investment Wiki. I need to do this because most of these newsletters, although produced digitally, are available in the Sicirec archive only in paper form. The only graphical item these news-letters sport is a simple graphical header, so I want to convert the scans to text and put the text in a wiki article for each newsletter; I don’t want to upload dozens of image-heavy PDFs just to show the original (crappy) layout.

The problem, of course, is that I’m on Linux and I don’t know of any good free, open source OCR programs. I don’t know much at all about OCR to be frank. 😕

Anyway, I’ve found this Linux.com article by Mathis Dirksen-Thedens about doing OCR the hardcore way. The downside of his process is that you have to preprocess each image to end up with square, border-less chunks of just text. He recommend Tesseract. The Tesseract project brags that their “engine was one of the top 3 engines in the 1995 UNLV Accuracy test”. Wow, impressive! But, wait, there’s more: “Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available.” They’ve actually made me embarrassed for trying to do this with open source software. 🙁

Tesseract (and similar programs like GOCR and Ocrad) only do line-by-line, word-by-word character recognition, so it’s useful if you have a program that first breaks up a page in graphical elements and simple text blocks.

In that category, OCRopus (Wikipedia) seems very promising, but it’s still in the alpha stage of development. Maybe that’s why it isn’t in Portage yet. Either way, that means I’m not going to try it. Before the 0.4 release, OCRopus supported only Tesseract as a character recognition option, but now Tesseract has been replaced by their own system, although it’s still supported as a plug-in.

gscan2pdf is a GUI program that seems to be meant to pull many of these tools together, although it doesn’t seem as if it can break down page lay-outs into separate text blocks. I’ll have to try it out to better judge this, though. First, I want to return to the command-line.

In my case, I’m scanning the news letters using a Xerox WorkCentre 7232. This machine has a network scanning feature that creates PDFs by default. It can also create (multi-page) TIFFs, which saves me one conversion step, and I’m glad it does because I yet have to find out how to convert PDF to TIFF with ImageMagick without losing too much image detail to be be able to blame Tesseract for not producing anything useful.

Quite early on, I noticed that Tesseract supports multi-page TIFFs. This is cool. I was less enthusiastic to discover that it doesn’t support the MMR compression used by the Xerox machine (even though I’ve compiled it with the tiff use-flag enabled in Gentoo). Luckily, a simple convert a.tif b.tif seems to produce a b.tif without border-some compression schemes.

Then, all of a sudden, the Xerox would start delivering scans with the wrong rotation. I fixed this again with ImageMagick:

convert Document005.TIF -transpose teakwood-info-uncompressed-rotated.tif

Now, surely, I would get some kind of result.

tesseract teakwood-info-uncompressed-rotated.tif result -l nld
head result.txt
]AV.LQHVBEKODWB'A' ];I`OI5ABVfll/|V2'V` D2'.LVI`V\IV2IM(2EF=tI°ö52I EB BEKCOIN ·
 
$UIDp9DKGD°
_ GIDqbLOq¤K;GD SOSI2 AIOGLGD‘ KOSIJDGU‘ I¤IKGU‘ bGLöOIS\2 GD
. bLSCD;Iä DI; GD GL IB AGGI pGISDä2;GIIIDö AOOL qG
AGLMGLKGD° DI; $0­JSLIö‘ ID MGqGLISDq äGqLOOäq‘ DON; SIG; GL
ssuqqud pong ­Agu sou suqsns bjsu;sds­ sqqu ms gsm ps;
MGQGLIQDQ ;G IS;GD MGDUGD SSD CO2;SLIGSSU2 ;GSKDO¤;' DG GGL2;G
§OSI2 MG H SI GGLqGL wGIqqGD SIJD MG pGSIö Ow qG NSLK; IU
MOLqGD'

Ok. Maybe not… At this point (ignoring all the other side-tracks), I noticed that although Gliv showed the image with the proper rotation, when importing a page from the mTIFF in the GIMP, it would show the image upside down. Then I realized that I was using -transpose just to please Gliv. Gliv simply doesn’t read the endian-ness of the file right! Instead of fixing my rotation problem convert -transpose actually made it worse! From Wikipedia: ‘Every TIFF begins with a 2-byte indicator of byte order: “II” for little endian and “MM” for big endian byte ordering.’ What I should have done is a convert -rotate 270.

This was starting to look a lot better, and I hadn’t even removed any borders or headers:

A'
E A K W C) O D M @
Bergum, november 1991
Geachte bosbouwer,
· Een maand eerder dan beloofd sturen we u een nieuwe Teakwood _
Info. Ik was van 14 september tot en met 12 oktober weer op
onze plantages in Costa Rica en heb geconstateerd dat onze
bomen er goed bij staan. Op sommige heuvels blijft de groei
ietsje achter, maar door extra voeding (bemesting) te geven,
_ trekken we dat bij.
Teakwood IV is inmiddels nageplant. Dat wil zeggen dat we de
stekken die niet wilden aanslaan, hebben vervangen door
nieuwe. Teakwood II en III zijn al voor de tweede keer
sx nageplant en doen het uitstekend. Deze keer stuur ik u nog
ïïw eens een foto van Teakwood I, vanuit hetzelfde standpunt als
de vorige foto van jongstleden juni, bij nummerpaal 1. "

(Tesseract processed two pages, by the way, but tried to convince me in its CLI output that it had only processed one.)

Now, I want to see what the program does if I give it a cleaner image, without scanning artifacts. I would like to use unpaper for this, but it’s masked in Portage, so for now I’ll use the GIMP to make a single-page TIFF, cropped from the original image. (When creating the new image in the GIMP, I had to change the image mode to be indexed, 1 bit black and white, and remove the alpha channel.)

Now, I was getting a better result:

Bergum, november 1991
Geachte bosbouwer,
Een maand eerder dan beloofd sturen we u een nieuwe Teakwood
Info. Ik was van 14 september tot en met 12 oktober weer op
onze plantages in Costa Rica en heb geconstateerd dat onze
bomen er goed bij staan. Op sommige heuvels blijft de groei
ietsje achter, maar door extra voeding (bemesting) te geven,
trekken we dat bij.
Teakwood IV is inmiddels nageplant. Dat wil zeggen dat we de
stekken die niet wilden aanslaan, hebben vervangen door
nieuwe. Teakwood II en III zijn al voor de tweede keer
nageplant en doen het uitstekend. Deze keer stuur ik u nog
eens een foto van Teakwood I, vanuit hetzelfde standpunt als
de vorige foto van jongstleden juni, bij nummerpaal 1. "

My conclusion is that users of open source OCR software must suffer. I’m not going to clean up this post to make it more useful for people who want to do the same as I did, because you shouldn’t want to do the same. You should simply go out and buy or pirate some proprietary piece of OCR software. Really, you should.

Now, I want a massage; my shoulders are stiff.

Matriux, a penetration testing and security analysis LiveCD

Last December, someone pointed me to Matriux. In their own words:

It is a fully featured security distribution consisting of a bunch of powerful, open source and free tools that can be used for various purposes including, but not limited to, penetration testing, ethical hacking, system and network administration, cyber forensics investigations, security testing, vulnerability analysis, and much more. It is a distribution designed for security enthusiasts and professionals, although it can be used normally as your default desktop system.

It comes with a wide arsenal of free software tools to do naughty things to your network. I think I should give it a swing and download it some time.

HP LaserJet 6P under Ubuntu

Because Arnold Pilon is migrating his workplace to Apple, I could get his old PC and peripherals for free. Among its peripherals was an old HP LaserJet 6P, still perfectly working.

My sister didn’t have a printer yet. I was surprised that installing it on her Ubuntu machine was simply a matter of selecting the printer type from a list. I wonder: is this thanks to CUPS? Can I expect this to work in all distros that include CUPS these days?

Anyway, the printer works and the scanner too (of which I forgot to jot down the type). The scanner was supported by Xane without requiring any configuration. When it comes to hardware configuration, open source operating systems often beat those from Redmond.

My custom Linux environment

On every machine that I install, I need a custom environment. At the very basic, I need screen and bash customizations. I will attempt to keep this blog post up-to-date with my most recent config.

/etc/bash.bashrc_halfgaar (naming scheme depends on distro):

prompt_command {
  XTERM_TITLE="\e]2;\u@\H:\w\a"
 
  BGJOBS_COLOR="\[\e[1;30m\]"
  BGJOBS=""
  [ "$(jobs | head -c1)" ]; BGJOBS=" $BGJOBS_COLOR(bg:\j)";
 
  DOLLAR_COLOR="\[\e[1;32m\]"
  [[ ${EUID} == 0 ]] ; DOLLAR_COLOR="\[\e[1;31m\]";
  DOLLAR="$DOLLAR_COLOR\\\$"
 
  USER_COLOR="\[\e[1;32m\]"
  [[ ${EUID} == 0 ]]; USER_COLOR="\[\e[41;1;32m\]";
 
  PS1="$XTERM_TITLE$USER_COLOR\u\[\e[1;32m\]@\H:\[\e[m\] \[\e[1;34m\]\w\[\e[m\]\n\
$DOLLAR$BGJOBS \[\e[m\]"
} PROMPT_COMMAND=prompt_command
 EDITOR=vim
 ls='ls --color=auto' ll='ls -l' lh='ls -lh' grep='grep --color=auto'

Don’t forget to source the file in ~/.bashrc

~/.screenrc:

caption always "%{= kB}%-Lw%{=s kB}%50>%n%f* %t %{-}%+Lw%<"
vbell off
startup_message off
term linux

Dumping a bunch of MP3s as WAV files

I just used the following command for converting a directory with a bunch of MP3s to WAV. Does someone know a command that is shorter? I find mine a bit convoluted to say the least.

mkdir wav
ls *mp3|while i;  mpg123 --stdout "$i" > wav/ $i|sed -r -e 's/^([0-9]+) .*$/\1/'`.wav;

The directory looks like this:

$ ls -1
01 - Meant to Be.mp3
02 - Reflections.mp3
03 - Semester Days.mp3
04 - Friend.mp3
05 - True Gemini.mp3
06 - Down the Road.mp3
07 - Tulip Trees.mp3
08 - Not Alone.mp3
09 - Woods of Chaos.mp3
10 - Twilight.mp3
[en] Readme - www.jamendo.com .txt
[es] Lee me - www.jamendo.com .txt
[fr] Lisez moi - www.jamendo.com .txt
[it] Readme - www.jamendo.com .txt
License.txt
Rob Costlow - Solo Piano - Woods of Chaos.1.0.jpg

XTerm configuration

I just created a gist for my XTerm configuration (separated from the rest of my X resources). Here’s a snapshot of the current version:

XTerm*background: black
XTerm*Foreground: Grey

XTerm*faceName: Liberation Mono
XTerm*faceSize: 10

XTerm*on2Clicks: regex [^  \n]+

XTerm*bellIsUrgent: true

! Make the terminal 127 by 42 characters in size
XTerm*geometry: 127x44+64+0

! By default, XTerm composes special chars with META. With this setting I can work my readline magic instead.
XTerm*metaSendsEscape: true

! Bracketed paste mode requires the allowWindowOps resource to be true 
XTerm*allowWindowOps: true

XTerm*saveLines: 1000

! Don't jump to the bottom when there's output
XTerm*VT100*scrollTtyOutput: false

XTerm*VT100.Translations: #override \
    ShiftInsert: insert-selection(CLIPBOARD) \n\
    Insert: insert-selection(PRIMARY) \n\
    Shift: insert-selection(CLIPBOARD) \n\
    ShiftUp: scroll-back(1) \n\
    ShiftDown: scroll-forw(1)


! vim: set syntax=xdefaults expandtab tabstop=4 shiftwidth=4:

winepath

I have all these little scripts in my $HOME/bin directory to ease the execution of Windows programs. For instance, I don’t like to type the full path to Excel Viewer every time when I need to view an Excel sheet in Linux:

#!/bin/sh
"/home/bigsmoke/.wine/drive_c/Program Files/Microsoft Office/OFFICE11/XLVIEW.EXE" "$*"

This works pretty well, except when trying to view a document that’s not in the current directory:

xlview "/tmp/Some Excel sheet sent to me by someone.xls"

To make that work, you need to use a handy utility that comes with Wine, winepath. With winepath, I can modify the script to work (the example is for wordview):

#!/bin/bash
path=`winepath --windows "$*"`
"/home/bigsmoke/.wine/drive_c/Program Files/Microsoft Office/OFFICE11/WORDVIEW.EXE" "$path"

This, together with Linux’s binfmt_misc makes executing Windows programs on Linux a breeze.

« Older posts Newer posts »

© 2024 BigSmoke

Theme by Anders NorenUp ↑