What is OCR?

Optical Character Recognition(OCR) is the process of electronically extracting text from images or any documents like PDF and reusing it in a variety of ways such as full text searches.

In this blog, we will see, how to use 'Python-tesseract', an OCR tool for python.

pytesseract:

It will recognize and read the text present in images. It can read all image types - png, jpeg, gif, tiff, bmp etc. It’s widely used to process everything from scanned documents.

Installation:

$ sudo pip install pytesseract

Requirements:

* Requires python 2.5 or later versions.
* And requires Python Imaging Library(PIL).

Usage:

From the shell:

$ ./pytesseract.py test.png 

Above command prints the recognized text from image 'test.png'.

$ ./pytesseract.py -l eng test-english.jpg

Above command recognizes english text.

In Python Script:

import Image
from tesseract import image_to_string

print image_to_string(Image.open('test.png'))
print image_to_string(Image.open('test-english.jpg'), lang='eng')

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties
Latest Comments
Related Articles
Understanding django serializers with examples Vamsi Popuri

Serializers are used for “translating” Django models into other formats like xmi,json,yaml(YAML Ain’t a Markup Language)

from django.core import serializers
data = serializers.serialize("xml", SomeModel.objects.all())

Continue Reading...
Basics of Django messages framework Divya Sri

In any web application we need to display notification messages to the end user after processing a form or some other types of his requests. ...

Continue Reading...
Hosting Django Application with Nginx and UWSGI Ashwin Kumar

Django is a python based web- application development framework. Setting up a sample app and running it as easy as pie. Nginx is a webserver ...

Continue Reading...