Thứ Năm, 31 tháng 3, 2016

http://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy

up vote33down voteaccepted
  1. fix DPI (if needed) 300 DPI is minimum
  2. fix text size (e.g. 12 pt should be ok)
  3. try to fix text lines (deskew and dewarp text)
  4. try to fix illumination of image (e.g. no dark part of image
  5. binarize and de-noise image
There is no universal command line that would fit to all cases (sometimes you need to blur and sharpen image). But you can give a try to TEXTCLEANER from Fred's ImageMagick Scripts.
If you are not fan of command line, maybe you can try to use opensource scantailor.sourceforge.net or commercial bookrestorer.



I am by no means an OCR expert. But I this week had need to convert text out of a jpg.
I started with a colorized, RGB 445x747 pixel jpg. I immediately tried tesseract on this, and the program converted almost nothing. I then went into GIMP and did the following. image>mode>grayscale image>scale image>1191x2000 pixels filters>enhance>unsharp mask with values of radius = 6.8, amount = 2.69, threshold = 0 I then saved as a new jpg at 100% quality.
Tesseract then was able to extract all the text into a .txt file
Gimp is your friend.

Không có nhận xét nào:

Đăng nhận xét