Monday, February 12, 2007

OCR

A few days ago I played with different free OCR (Optical Character Recognition) programs. I would like to be able to scan all my snailmails and OCR them, thus that would give me the possibility to search in snalmails like I do in e-mails. All places on the net said that Tesseract is the best one. But I did only got crap with 1.03. The result was better with gocr, but not good. Then I found out that there are some problems with Tesseract 1.03 when it is compiled in certain ways. Yesterday I downloaded 1.02 and it worked much better. Unfortunately, it does not support non English characters like the Swedish å, ä, and ö. Which is necessary for me. If for instance å always become the same character then I could hide the problem within the search engine, but this is not the case.
Gocr is already included in Ubuntu and Tesseract will be included in Feisty.

1 comment:

Anonymous said...

Or HOCR - Just in case you need to scan Hebrew...
See the video here:
Hebrew optical character recognition updated