Hey, this is pretty cool. I actually tried something similar. (Keeping a list of shop names and matching it with tesseract's results)
I was trying hough transform for slight image rotations. I wasn't aware of imagemagick's textcleaner script. That could have save me a lot of trouble :)
I got roadblocked by the problem of having various kinds of receipts with absolutely no layout in common. I figured it would need a lot of training for the system to have a decent accuracy and left it for another day.
Cool. We did a similar thing at an hack, integrated with Dropbox and an automatic monthly receipt generation. I think most if not all the code should still be in pieces on our github accounts
Argh. I should not comment from the phone. We had similar problems using the layout. In the end we mainly focused on getting the date and the total correct by checking parable pieces of text