Extract Text from Images with OCR in Python Using pytesseract

Blog / Django · April 22, 2025 · Updated June 9, 2026 · 11 min read
Extract Text from Images with OCR in Python Using pytesseract

Extract text from images in Python with pytesseract

If you need to pull text out of a PNG, JPG, TIFF, or scanned page in Python, the fastest path is pytesseract — a thin wrapper around Google's Tesseract OCR engine. In a few lines you can turn an image into a string; with a little OpenCV preprocessing you can push accuracy from "mostly readable" to "clean enough for a database".

This tutorial is current for Python 3.12+, Tesseract 5, pytesseract, and Pillow / OpenCV. By the end you'll be able to:

  • Install Tesseract and the Python bindings on Linux, macOS, and Windows.
  • Run the minimal image_to_string example and read multiple image formats.
  • Preprocess images (grayscale, threshold, denoise, deskew, upscale) for better accuracy.
  • Tune page-segmentation (--psm) and engine (--oem) modes.
  • Use language packs and extract structured data with bounding boxes and confidence scores.
  • Add an OCR endpoint to a Django app and know when to reach for cloud OCR instead.

Prerequisites: Python 3.12+, pip, and the ability to install a system package (Tesseract is a separate binary, not a pip-only library).

How OCR with Tesseract actually works

Optical Character Recognition (OCR) is the process of electronically extracting machine-readable text from images, scanned documents, and PDFs so the text can be searched, indexed, copied, or processed.

pytesseract does not do the recognition itself — it shells out to the Tesseract command-line engine and parses the result. So you always need two things installed: the Tesseract binary (the engine + trained language data) and the pytesseract Python package (the wrapper). Get the binary right first; most "OCR not working" problems are a missing or unconfigured Tesseract install.

Step 1: Install Tesseract (the engine)

Install the system binary first. Tesseract 5 is the current major version.

Ubuntu / Debian (apt):

# Engine + English language data
sudo apt update
sudo apt install -y tesseract-ocr

# Optional: extra language packs, e.g. French + German
sudo apt install -y tesseract-ocr-fra tesseract-ocr-deu

macOS (Homebrew):

brew install tesseract
# All language packs (large download):
brew install tesseract-lang

Windows: download the installer from the UB Mannheim Tesseract builds and run it. Either add the install folder (e.g. C:\Program Files\Tesseract-OCR) to your PATH, or point pytesseract at the binary explicitly in your code:

import pytesseract

# Windows only: tell pytesseract where the engine lives
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

Verify the engine and see which language packs are installed:

tesseract --version        # should report 5.x
tesseract --list-langs     # installed language data (eng, osd, ...)

Step 2: Install the Python packages

Now install pytesseract plus the imaging libraries. Pillow handles basic image I/O; OpenCV does the heavy preprocessing.

Note: install Tesseract (the package above) and pytesseract (the package below) separately — pytesseract is just the wrapper. Pillow ships as pillow but is imported as PIL.

pip install pytesseract pillow opencv-python

Step 3: The minimal example

With the engine and packages installed, OCR is two lines. Pass a PIL.Image (or a path / NumPy array) to image_to_string:

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open("test.png"))
print(text)

# You can also pass a file path directly
print(pytesseract.image_to_string("receipt.jpg"))

That's the whole "hello world". On clean, high-resolution, high-contrast images you'll get excellent results immediately. On photos, low-DPI scans, or noisy backgrounds, accuracy drops — which is what the preprocessing section fixes.

From the shell you can do the same thing without writing Python:

# Print recognised text to stdout ("stdout" is the output target, not a file)
tesseract test.png stdout

# Force English and write to result.txt (Tesseract appends .txt)
tesseract test-english.jpg result -l eng

Handling multiple image formats (PNG, JPG, TIFF, BMP, PDF)

Tesseract reads the common raster formats out of the box: PNG, JPG/JPEG, TIFF (including multi-page), BMP, GIF, and WebP. Because pytesseract accepts a Pillow image, anything Pillow can open, Tesseract can OCR:

from PIL import Image
import pytesseract

for path in ["scan.png", "photo.jpg", "fax.tiff", "logo.bmp"]:
    text = pytesseract.image_to_string(Image.open(path))
    print(f"=== {path} ===")
    print(text)

# Multi-page TIFF: iterate frames
img = Image.open("multipage.tiff")
for i in range(getattr(img, "n_frames", 1)):
    img.seek(i)
    print(f"--- page {i + 1} ---")
    print(pytesseract.image_to_string(img))

PDFs are different. Tesseract does not read PDF files directly. Convert each page to an image first with pdf2image, which relies on the Poppler utilities (sudo apt install poppler-utils, or brew install poppler):

# pip install pdf2image  (and install Poppler on the system)
from pdf2image import convert_from_path
import pytesseract

pages = convert_from_path("document.pdf", dpi=300)  # 300 DPI is a good OCR baseline
full_text = []
for page_number, page_image in enumerate(pages, start=1):
    full_text.append(pytesseract.image_to_string(page_image))

print("\n\n".join(full_text))

If your goal is broader document extraction (Word, Excel, native PDFs with a real text layer), OCR is often the wrong tool — see our companion guide on extracting data from PDF and Microsoft Office files in Python. Use OCR only when the text exists purely as pixels (scans, photos, image-only PDFs).

Preprocessing for accuracy with OpenCV

Tesseract's accuracy depends almost entirely on input quality. Feeding it a clean, high-contrast, properly oriented image matters far more than any config flag. The standard preprocessing pipeline — grayscale, scale up, threshold, denoise, deskew — is where most real-world accuracy gains come from.

Grayscale + Otsu thresholding (binarisation) is the single highest-impact step:

import cv2
import pytesseract

image = cv2.imread("noisy_scan.jpg")

# 1. Grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# 2. Upscale small text (Tesseract likes ~300 DPI / capital letters ~30px tall)
gray = cv2.resize(gray, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)

# 3. Denoise
gray = cv2.medianBlur(gray, 3)

# 4. Binarise with Otsu's automatic threshold
_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

text = pytesseract.image_to_string(thresh)
print(text)

Deskewing straightens tilted scans, which dramatically improves line detection. Estimate the skew angle from the text pixels and rotate:

import cv2
import numpy as np

def deskew(gray):
    # Invert so text is white on black, then find the minimum-area box
    inverted = cv2.bitwise_not(gray)
    coords = np.column_stack(np.where(inverted > 0))
    angle = cv2.minAreaRect(coords)[-1]
    angle = -(90 + angle) if angle < -45 else -angle

    (h, w) = gray.shape[:2]
    matrix = cv2.getRotationMatrix2D((w / 2, h / 2), angle, 1.0)
    return cv2.warpAffine(
        gray, matrix, (w, h),
        flags=cv2.INTER_CUBIC,
        borderMode=cv2.BORDER_REPLICATE,
    )

Reach for the right step based on the symptom you're seeing:

Symptom Preprocessing fix OpenCV / approach
Low contrast, gray background Grayscale + threshold cvtColor + threshold (Otsu)
Tiny or low-DPI text Upscale to ~300 DPI cv2.resize(..., INTER_CUBIC)
Speckles / scanner noise Denoise medianBlur, fastNlMeansDenoising
Tilted / rotated page Deskew minAreaRect + warpAffine
Uneven lighting / shadows Adaptive threshold adaptiveThreshold
Faint or broken strokes Morphology dilate / erode

Don't over-process: too aggressive a blur or threshold erases thin strokes and lowers accuracy. Tune on a few representative samples and compare output.

Language packs with lang=

By default Tesseract uses English (eng). Install the relevant tesseract-ocr-<lang> package, then pass lang=. You can combine languages with + for mixed-language documents:

# Single language
pytesseract.image_to_string(img, lang="fra")

# Mixed English + French + German
pytesseract.image_to_string(img, lang="eng+fra+deu")

# List what's available from Python
print(pytesseract.get_languages(config=""))

Page segmentation and OCR engine modes (--psm / --oem)

Two config flags have an outsized effect on results. Pass them via the config argument.

--psm (page segmentation mode) tells Tesseract how the text is laid out. The default (3, fully automatic) works for full pages, but for a single line, a single word, or sparse text it often guesses wrong. The most useful modes:

--psm Use it when the image is...
3 A full page / document (default)
4 A single column of variable-size text
6 A single uniform block of text
7 A single text line
8 A single word
10 A single character
11 Sparse text, no particular order
12 Sparse text with orientation detection

--oem (OCR engine mode) selects the recognition engine. On Tesseract 5, --oem 3 (default) uses the LSTM neural-net engine, which you almost always want:

--oem Engine
0 Legacy engine only
1 Neural-net LSTM only
2 Legacy + LSTM combined
3 Default (LSTM, recommended)

Combine them in one config string:

config = r"--oem 3 --psm 6"
text = pytesseract.image_to_string(img, lang="eng", config=config)

# Restrict to digits (e.g. reading an amount or a code)
digits = pytesseract.image_to_string(
    img, config=r"--psm 7 -c tessedit_char_whitelist=0123456789"
)

Structured output: bounding boxes and confidence

For more than a plain string, use image_to_data to get every word with its position and a confidence score (0-100). This is how you build searchable PDFs, highlight matches, or filter out low-confidence garbage. Asking for Output.DICT gives you parallel lists keyed by field:

import pytesseract
from pytesseract import Output
from PIL import Image

img = Image.open("invoice.png")
data = pytesseract.image_to_data(img, output_type=Output.DICT)

for i, word in enumerate(data["text"]):
    conf = int(data["conf"][i])
    if word.strip() and conf > 60:          # keep only confident words
        x, y, w, h = (data["left"][i], data["top"][i],
                      data["width"][i], data["height"][i])
        print(f"{word!r} conf={conf} box=({x},{y},{w},{h})")

Related helpers: image_to_boxes returns character-level boxes, and image_to_pdf_or_hocr(img, extension="pdf") produces a searchable PDF with an invisible text layer over the original image — ideal for archiving scanned documents while keeping them searchable.

Adding OCR to a Django app

Since this is a Django guide: a typical pattern is an upload → extract endpoint. The user uploads an image, you OCR it server-side and return (or store) the text. Run OCR off the request thread for anything non-trivial — Tesseract is CPU-bound and will block your worker — so a Celery task is the production-ready approach. Here's a minimal synchronous view to show the shape:

# views.py
import pytesseract
from PIL import Image
from django.http import JsonResponse
from django.views.decorators.http import require_POST


@require_POST
def ocr_upload(request):
    upload = request.FILES.get("image")
    if not upload:
        return JsonResponse({"error": "No image provided"}, status=400)

    # Pillow reads directly from the in-memory upload
    image = Image.open(upload)
    text = pytesseract.image_to_string(image, lang="eng")

    return JsonResponse({"text": text.strip()})


# urls.py
# from django.urls import path
# from .views import ocr_upload
# urlpatterns = [path("api/ocr/", ocr_upload, name="ocr-upload")]

In production: validate the uploaded file type and size, run OCR in a background task (Celery/RQ), cache results, and store extracted text in a field you can full-text search. If you're building document-heavy features into a Django product, our team does this regularly — see our Django development services and broader Python development services.

When to use cloud OCR instead

Tesseract is free, runs locally (no data leaves your server), and is excellent for clean printed text. But it has real limits: it struggles with handwriting, complex multi-column layouts, dense tables, and low-quality phone photos, and it has no built-in understanding of what a field is.

Consider a managed service such as AWS Textract, Google Cloud Vision, or Azure Document Intelligence when you need:

  • Form and table extraction with key-value pairs (e.g. parsing invoices or IDs).
  • Reliable handwriting recognition.
  • Higher accuracy on messy, real-world documents at scale without maintaining preprocessing pipelines.

The trade-offs are per-call cost, network latency, and sending documents to a third party. A common hybrid: Tesseract for the easy, high-volume, privacy-sensitive cases and a cloud API for the hard documents. Increasingly, teams also pair OCR with an LLM to clean up output and extract structured fields — something we build as custom AI feature development and intelligent document-processing pipelines.

A note on accuracy

Be realistic: OCR is never 100% accurate on arbitrary input. Even with good preprocessing, expect occasional misreads — 0/O, 1/l/I, and rn/m are classic confusions. Always:

  • Measure accuracy on your real documents, not a clean sample.
  • Use confidence scores from image_to_data to flag uncertain output for human review.
  • Constrain the problem where you can (character whitelists, fixed --psm, cropping to regions of interest).

Clean inputs and a tight, well-tuned pipeline beat any single magic flag.

Frequently Asked Questions

Why is pytesseract returning an empty string?

Usually the image is too low-resolution, too low-contrast, or skewed, or Tesseract can't find the binary. First confirm tesseract --version works in your shell (on Windows, set pytesseract.pytesseract.tesseract_cmd). Then preprocess: grayscale, upscale to roughly 300 DPI, and apply Otsu thresholding before calling image_to_string.

Do I need to install Tesseract separately from pytesseract?

Yes. pytesseract is only a Python wrapper that calls the Tesseract command-line engine. You must install the Tesseract binary via your OS package manager (apt, Homebrew, or the Windows installer) in addition to pip install pytesseract.

How do I OCR a PDF in Python?

Tesseract doesn't read PDFs directly. Convert each page to an image with pdf2image (which needs the Poppler utilities), then run image_to_string on each page image. Render at around 300 DPI for the best balance of accuracy and speed.

How do I improve OCR accuracy?

Fix the input before touching config. Convert to grayscale, upscale small text, denoise, deskew, and binarise (Otsu or adaptive threshold) with OpenCV. Then set an appropriate --psm for the layout, keep --oem 3 (the LSTM engine), and use a character whitelist when the content is constrained (e.g. digits only).

What's the difference between --psm and --oem?

--psm (page segmentation mode) tells Tesseract how the text is laid out — a full page, a single line, a single word, or sparse text. --oem (OCR engine mode) chooses which recognition engine runs; on Tesseract 5 the default --oem 3 uses the neural-net LSTM engine, which you should keep in almost all cases.

Is Tesseract good enough for production, or should I use a cloud OCR service?

Tesseract is great for clean printed text, runs locally with no data leaving your server, and costs nothing. For handwriting, complex tables/forms, or noisy real-world documents, a managed service like AWS Textract, Google Cloud Vision, or Azure Document Intelligence is usually more accurate. Many teams run a hybrid: Tesseract for easy, high-volume, privacy-sensitive jobs and a cloud API for the hard documents.

Wrapping up

You now have a complete OCR workflow in Python: install Tesseract 5 and pytesseract, run image_to_string, handle PNG/JPG/TIFF/BMP (and PDFs via pdf2image), preprocess with OpenCV for accuracy, tune --psm/--oem, pull structured data with confidence scores, and expose it through a Django endpoint. For most clean documents Tesseract is all you need; for the messy ones, lean on preprocessing or a cloud API.

Further reading from our blog: extracting data from PDFs and Microsoft Office files in Python and web scraping with Beautiful Soup.

Share this article