võ hoàng chiêu: http://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy

Thứ Năm, 31 tháng 3, 2016

http://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy

fix DPI (if needed) 300 DPI is minimum
fix text size (e.g. 12 pt should be ok)
try to fix text lines (deskew and dewarp text)
try to fix illumination of image (e.g. no dark part of image
binarize and de-noise image

There is no universal command line that would fit to all cases (sometimes you need to blur and sharpen image). But you can give a try to TEXTCLEANER from Fred's ImageMagick Scripts.

If you are not fan of command line, maybe you can try to use opensource scantailor.sourceforge.net or commercial bookrestorer.

27down vote

I am by no means an OCR expert. But I this week had need to convert text out of a jpg.

I started with a colorized, RGB 445x747 pixel jpg. I immediately tried tesseract on this, and the program converted almost nothing. I then went into GIMP and did the following. image>mode>grayscale image>scale image>1191x2000 pixels filters>enhance>unsharp mask with values of radius = 6.8, amount = 2.69, threshold = 0 I then saved as a new jpg at 100% quality.

Tesseract then was able to extract all the text into a .txt file

Gimp is your friend.

võ hoàng chiêu

Thứ Năm, 31 tháng 3, 2016

http://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy

Không có nhận xét nào:

Đăng nhận xét