Optical Character Recognition(OCR) is the process of electronically extracting text from images or any documents like PDF and reusing it in a variety of ways such as full text searches. In this blog, we will see, how to use 'Python-tesseract', an OCR tool for python. It will recognize and read the text present in images. It can read all image types - png, jpeg, gif, tiff, bmp etc. It’s widely used to process everything from scanned documents.
Optical Character Recognition(OCR) is the process of electronically extracting text from images or any documents like PDF and reusing it in a variety of ways such as full text searches.
In this blog, we will see, how to use 'Python-tesseract', an OCR tool for python.
It will recognize and read the text present in images. It can read all image types - png, jpeg, gif, tiff, bmp etc. It’s widely used to process everything from scanned documents.
$ sudo pip install pytesseract
* Requires python 2.5 or later versions.
* And requires Python Imaging Library(PIL).
From the shell:
$ ./pytesseract.py test.png
Above command prints the recognized text from image 'test.png'.
$ ./pytesseract.py -l eng test-english.jpg
Above command recognizes english text.
In Python Script:
import Image
from tesseract import image_to_string
print image_to_string(Image.open('test.png'))
print image_to_string(Image.open('test-english.jpg'), lang='eng')
To Know more about our Django CRM(Customer Relationship Management) Open Source Package.