What is OCR?
Optical Character Recognition(OCR) is the process of electronically extracting text from images or any documents like PDF and reusing it in a variety of ways such as full text searches.
In this blog, we will see, how to use 'Python-tesseract', an OCR tool for python.
It will recognize and read the text present in images. It can read all image types - png, jpeg, gif, tiff, bmp etc. It’s widely used to process everything from scanned documents.
$ sudo pip install pytesseract
* Requires python 2.5 or later versions.
* And requires Python Imaging Library(PIL).
From the shell:
$ ./pytesseract.py test.png
Above command prints the recognized text from image 'test.png'.
$ ./pytesseract.py -l eng test-english.jpg
Above command recognizes english text.
In Python Script:
import Image from tesseract import image_to_string print image_to_string(Image.open('test.png')) print image_to_string(Image.open('test-english.jpg'), lang='eng')