BigSmoke

Smokes your problems, coughs fresh air.

Page 32 of 52

Blog competition

Wiebe’s posting of the two-hundred-and-first post, made me think of this image of the front page I made a short while after the new design went life. The image clearly shows that we have a little competition thing going on in that we both seem to be pretty determined to have our face dominate the home-page. 😛

Preventing syntax errors with old shell scripts

I was trying to install Unreal Tournament GOTY on one of my Linux machines. I downloaded and ran the script ut-install-436-GOTY.run but I got this error:

cannot open `+6' for reading: No such file or directory 

This line caused it:

sum1=`tail +6 $0 | cksum | sed -e 's/ /Z/' -e 's/   /Z/' | cut -dZ -f1`

To fix it, I set this environment variable:

export _POSIX2_VERSION=199209

Apparently, this makes programs behave differently. Research is required to find out exactly what it does…

WordPress pretty pagination plugin

During the recent redesign of my blog, I decided that I wanted to have pretty pagination with numbers instead of the WordPress default Older/Newer Posts links. The plugin I decided to use was WP Page Numbers by Jens Törnell.

This is how the pagination for page one of my home looks now:

WP Page Numbers on page 1

For page six you can see more of what the plugin can do:

WP Page Numbers on page 6

And finally…

WP Page Numbers on page 20

Do I have twenty pages of posts already?

Exemplary web design: Qt

Often, I come across websites that have a beautiful design or even just interesting design elements. Instead of continuing to spread these URLs all over my $HOME and my (del.icio.us) bookmarks, I thought I’d start adding them here. Today, I want to start with an entry from ~/jot/exemplary-web-design.txt: Qt

Qt homepage

Qt homepage

I’ve actually programmed in Qt a bit in a dark past. Even though I’m not too fond of it (or C++ in general), it’s a very decent toolkit as far as toolkits go. What’s really great, though, is their website (now at qt.nokia.com).

The website logo with integrated slogan is perfectly clear. At the top right, there’s a nice and clear Google Custom Search, below which there’s a language switcher and a cleverly placed contact link.

Then comes the horizontal navigation bar with the tabs. It clearly shows which section you’re in. The homepage has an icon instead of a text, which is a nice touch. Also, the Developer Zone tab has a distinct layout with a big icon. I like this; it makes it clear that Nokia (formerly Trolltech) appreciates its developers (insert Ballmer monkey dance here).

The content area starts with a clear h1 heading text and a one-sentence-introductory paragraph. Then, four of the sections are highlighted again, with a short summation below of what can be expected in that section.

Below that, is another visually distinct area which highlights the latest news-items, events and other recently featured items.

Testimonies by two high-profile projects are used to interrupt the flow of information at this point, before Qt in 2 minutes is presented. Qt in 2 minutes is clearly made to quickly help people who are new to Qt along to the right information. This takes up 7 headings and they use JavaScript to show only one subsection at the time, allowing you to switch subsection by clicking the headings in the menu at the left.

At the end of the content area, there’s a subtle reference to the KDE project and a list of the biggest-name customer logos.

The content area is closed by another horizontal navigation bar. This one has a link to the sitemap, an accessibility statement, and to the contact page again. At the right, it also contains a Nokia logo.

Then in almost invisible print (because it isn’t interesting), there’s the copyright statement and a link to the privacy policy.

Crimson Dark, a sci-fi webcomic

For a while now, I’ve been following Crimson Dark, which is a very cool, grimy sci-fi web comic. David Simon, the writer and artist, created quite the cynical bunch of characters. I love it. The story and dialog are a compliment to my intelligence, the artwork is nice (and getting better with each new chapter). Pure procrastinator’s poetry, that’s what it is.

To get a touch of the nice dark space atmosphere, below I placed the comic posted on Monday, the 14th of last December:

Crimson Dark comic for December 14, 2009

You’ll want to read from the beginning though.

jQuery plugin for auto-growing textareas

With my big blog redesign, I wanted a auto-growing comment box. In the past, I’ve written a nice auto-resize textarea JavaScript function which does just that, but with jQuery belonging to the standard equipment of WordPress these days, I thought it would be cooler to find a nice jQuery plugin to do this.

I added the Auto Growing Textareas plugin by Chrys Bader to my theme. In my header.php:

<?php wp_enqueue_script('jqueryautogrow', get_bloginfo('template_directory').'/jquery.autogrow.js', array('jquery'), '1.2.2') ?>
<script type="text/javascript">
$ = jQuery; /* FIXME: Ugly hack */
jQuery(document).ready(function(){
  jQuery('textarea[name=comment]').autogrow();
});
</script>

However, I noticed that the <textarea> shrunk beyond the original number of rows defined in the rows attribute. (My own function used this attribute as the minimum number of rows.)

While looking for documentation on Chrys Bader’s plugin, I noticed that all the links on the plugin page now redirect to crysbader.com. (Sometimes, I really hate these catch-all redirects! :-x) I also found the Auto Growing Textareas Update plugin by daaz, which is the same with a few updates because the former project has not been updated since January 12, 2008, and had some issues that needed to be resolved. Sounds like a good idea to install the update.

Back to the minimum height problem: the plugin’s source file proved a good source of documentation. I learned that it has a minHeight option. I didn’t manage to actually pass that option in JavaScript, though; doing the following didn’t work:

jQuery('textarea[name=comment]').autogrow({minHeight: 8});

Luckily, it defaults to the min-height defined in the element’s CSS, so I could add the following to my stylesheet to stop the auto-shrinkage madness:

#comments textarea
{
  min: 8em;
 : 8em;
}

Saving and loading iptables rules on Debian

For some reason, Debian can’t do “/etc/init.d/iptables save”. So, we have to fix something ourselves. I used this article as source, which also has some useful comments. Apparently, the iptables initscript used to exist…

To save, type:

iptables-save > /etc/iptables.rules

Make /etc/network/if-pre-up.d/iptables:

#!/bin/sh
iptables-restore < /etc/iptables.rules

Don’t forget to make it executable:

chmod +x /etc/network/if-pre-up.d/iptables

Linux OCR with Tesseract

I’m scanning old Flor y Fauna news letters for my Dutch Hardwood Investment Wiki. I need to do this because most of these newsletters, although produced digitally, are available in the Sicirec archive only in paper form. The only graphical item these news-letters sport is a simple graphical header, so I want to convert the scans to text and put the text in a wiki article for each newsletter; I don’t want to upload dozens of image-heavy PDFs just to show the original (crappy) layout.

The problem, of course, is that I’m on Linux and I don’t know of any good free, open source OCR programs. I don’t know much at all about OCR to be frank. 😕

Anyway, I’ve found this Linux.com article by Mathis Dirksen-Thedens about doing OCR the hardcore way. The downside of his process is that you have to preprocess each image to end up with square, border-less chunks of just text. He recommend Tesseract. The Tesseract project brags that their “engine was one of the top 3 engines in the 1995 UNLV Accuracy test”. Wow, impressive! But, wait, there’s more: “Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available.” They’ve actually made me embarrassed for trying to do this with open source software. 🙁

Tesseract (and similar programs like GOCR and Ocrad) only do line-by-line, word-by-word character recognition, so it’s useful if you have a program that first breaks up a page in graphical elements and simple text blocks.

In that category, OCRopus (Wikipedia) seems very promising, but it’s still in the alpha stage of development. Maybe that’s why it isn’t in Portage yet. Either way, that means I’m not going to try it. Before the 0.4 release, OCRopus supported only Tesseract as a character recognition option, but now Tesseract has been replaced by their own system, although it’s still supported as a plug-in.

gscan2pdf is a GUI program that seems to be meant to pull many of these tools together, although it doesn’t seem as if it can break down page lay-outs into separate text blocks. I’ll have to try it out to better judge this, though. First, I want to return to the command-line.

In my case, I’m scanning the news letters using a Xerox WorkCentre 7232. This machine has a network scanning feature that creates PDFs by default. It can also create (multi-page) TIFFs, which saves me one conversion step, and I’m glad it does because I yet have to find out how to convert PDF to TIFF with ImageMagick without losing too much image detail to be be able to blame Tesseract for not producing anything useful.

Quite early on, I noticed that Tesseract supports multi-page TIFFs. This is cool. I was less enthusiastic to discover that it doesn’t support the MMR compression used by the Xerox machine (even though I’ve compiled it with the tiff use-flag enabled in Gentoo). Luckily, a simple convert a.tif b.tif seems to produce a b.tif without border-some compression schemes.

Then, all of a sudden, the Xerox would start delivering scans with the wrong rotation. I fixed this again with ImageMagick:

convert Document005.TIF -transpose teakwood-info-uncompressed-rotated.tif

Now, surely, I would get some kind of result.

tesseract teakwood-info-uncompressed-rotated.tif result -l nld
head result.txt
]AV.LQHVBEKODWB'A' ];I`OI5ABVfll/|V2'V` D2'.LVI`V\IV2IM(2EF=tI°ö52I EB BEKCOIN ·
 
$UIDp9DKGD°
_ GIDqbLOq¤K;GD SOSI2 AIOGLGD‘ KOSIJDGU‘ I¤IKGU‘ bGLöOIS\2 GD
. bLSCD;Iä DI; GD GL IB AGGI pGISDä2;GIIIDö AOOL qG
AGLMGLKGD° DI; $0­JSLIö‘ ID MGqGLISDq äGqLOOäq‘ DON; SIG; GL
ssuqqud pong ­Agu sou suqsns bjsu;sds­ sqqu ms gsm ps;
MGQGLIQDQ ;G IS;GD MGDUGD SSD CO2;SLIGSSU2 ;GSKDO¤;' DG GGL2;G
§OSI2 MG H SI GGLqGL wGIqqGD SIJD MG pGSIö Ow qG NSLK; IU
MOLqGD'

Ok. Maybe not… At this point (ignoring all the other side-tracks), I noticed that although Gliv showed the image with the proper rotation, when importing a page from the mTIFF in the GIMP, it would show the image upside down. Then I realized that I was using -transpose just to please Gliv. Gliv simply doesn’t read the endian-ness of the file right! Instead of fixing my rotation problem convert -transpose actually made it worse! From Wikipedia: ‘Every TIFF begins with a 2-byte indicator of byte order: “II” for little endian and “MM” for big endian byte ordering.’ What I should have done is a convert -rotate 270.

This was starting to look a lot better, and I hadn’t even removed any borders or headers:

A'
E A K W C) O D M @
Bergum, november 1991
Geachte bosbouwer,
· Een maand eerder dan beloofd sturen we u een nieuwe Teakwood _
Info. Ik was van 14 september tot en met 12 oktober weer op
onze plantages in Costa Rica en heb geconstateerd dat onze
bomen er goed bij staan. Op sommige heuvels blijft de groei
ietsje achter, maar door extra voeding (bemesting) te geven,
_ trekken we dat bij.
Teakwood IV is inmiddels nageplant. Dat wil zeggen dat we de
stekken die niet wilden aanslaan, hebben vervangen door
nieuwe. Teakwood II en III zijn al voor de tweede keer
sx nageplant en doen het uitstekend. Deze keer stuur ik u nog
ïïw eens een foto van Teakwood I, vanuit hetzelfde standpunt als
de vorige foto van jongstleden juni, bij nummerpaal 1. "

(Tesseract processed two pages, by the way, but tried to convince me in its CLI output that it had only processed one.)

Now, I want to see what the program does if I give it a cleaner image, without scanning artifacts. I would like to use unpaper for this, but it’s masked in Portage, so for now I’ll use the GIMP to make a single-page TIFF, cropped from the original image. (When creating the new image in the GIMP, I had to change the image mode to be indexed, 1 bit black and white, and remove the alpha channel.)

Now, I was getting a better result:

Bergum, november 1991
Geachte bosbouwer,
Een maand eerder dan beloofd sturen we u een nieuwe Teakwood
Info. Ik was van 14 september tot en met 12 oktober weer op
onze plantages in Costa Rica en heb geconstateerd dat onze
bomen er goed bij staan. Op sommige heuvels blijft de groei
ietsje achter, maar door extra voeding (bemesting) te geven,
trekken we dat bij.
Teakwood IV is inmiddels nageplant. Dat wil zeggen dat we de
stekken die niet wilden aanslaan, hebben vervangen door
nieuwe. Teakwood II en III zijn al voor de tweede keer
nageplant en doen het uitstekend. Deze keer stuur ik u nog
eens een foto van Teakwood I, vanuit hetzelfde standpunt als
de vorige foto van jongstleden juni, bij nummerpaal 1. "

My conclusion is that users of open source OCR software must suffer. I’m not going to clean up this post to make it more useful for people who want to do the same as I did, because you shouldn’t want to do the same. You should simply go out and buy or pirate some proprietary piece of OCR software. Really, you should.

Now, I want a massage; my shoulders are stiff.

« Older posts Newer posts »

© 2024 BigSmoke

Theme by Anders NorenUp ↑