Extract text from images in Python with pytesseract
If you need to pull text out of a PNG, JPG, TIFF, or scanned page in Python, the fastest path is pytesseract — a thin wrapper around Google's Tesseract OCR engine. In a few lines you can turn an image into a string; with a little OpenCV preprocessing you can push accuracy from "mostly readable" to "clean enough for a database".
This tutorial is current for Python 3.12+, Tesseract 5, pytesseract, and Pillow / OpenCV. By the end you'll be able to:
- Install Tesseract and the Python bindings on Linux, macOS, and Windows.
- Run the minimal
image_to_stringexample and read multiple image formats. - Preprocess images (grayscale, threshold, denoise, deskew, upscale) for better accuracy.
- Tune page-segmentation (
--psm) and engine (--oem) modes. - Use language packs and extract structured data with bounding boxes and confidence scores.
- Add an OCR endpoint to a Django app and know when to reach for cloud OCR instead.
Prerequisites: Python 3.12+, pip, and the ability to install a system package (Tesseract is a separate binary, not a pip-only library).
How OCR with Tesseract actually works
Optical Character Recognition (OCR) is the process of electronically extracting machine-readable text from images, scanned documents, and PDFs so the text can be searched, indexed, copied, or processed.
pytesseract does not do the recognition itself — it shells out to the Tesseract command-line engine and parses the result. So you always need two things installed: the Tesseract binary (the engine + trained language data) and the pytesseract Python package (the wrapper). Get the binary right first; most "OCR not working" problems are a missing or unconfigured Tesseract install.
Step 1: Install Tesseract (the engine)
Install the system binary first. Tesseract 5 is the current major version.
Ubuntu / Debian (apt):
# Engine + English language data
sudo apt update
sudo apt install -y tesseract-ocr
# Optional: extra language packs, e.g. French + German
sudo apt install -y tesseract-ocr-fra tesseract-ocr-deumacOS (Homebrew):
brew install tesseract
# All language packs (large download):
brew install tesseract-langWindows: download the installer from the UB Mannheim Tesseract builds and run it. Either add the install folder (e.g. C:\Program Files\Tesseract-OCR) to your PATH, or point pytesseract at the binary explicitly in your code:
import pytesseract
# Windows only: tell pytesseract where the engine lives
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"Verify the engine and see which language packs are installed:
tesseract --version # should report 5.x
tesseract --list-langs # installed language data (eng, osd, ...)Step 2: Install the Python packages
Now install pytesseract plus the imaging libraries. Pillow handles basic image I/O; OpenCV does the heavy preprocessing.
Note: install Tesseract (the package above) and pytesseract (the package below) separately — pytesseract is just the wrapper. Pillow ships as
pillowbut is imported asPIL.
pip install pytesseract pillow opencv-pythonStep 3: The minimal example
With the engine and packages installed, OCR is two lines. Pass a PIL.Image (or a path / NumPy array) to image_to_string:
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("test.png"))
print(text)
# You can also pass a file path directly
print(pytesseract.image_to_string("receipt.jpg"))That's the whole "hello world". On clean, high-resolution, high-contrast images you'll get excellent results immediately. On photos, low-DPI scans, or noisy backgrounds, accuracy drops — which is what the preprocessing section fixes.
From the shell you can do the same thing without writing Python:
# Print recognised text to stdout ("stdout" is the output target, not a file)
tesseract test.png stdout
# Force English and write to result.txt (Tesseract appends .txt)
tesseract test-english.jpg result -l engHandling multiple image formats (PNG, JPG, TIFF, BMP, PDF)
Tesseract reads the common raster formats out of the box: PNG, JPG/JPEG, TIFF (including multi-page), BMP, GIF, and WebP. Because pytesseract accepts a Pillow image, anything Pillow can open, Tesseract can OCR:
from PIL import Image
import pytesseract
for path in ["scan.png", "photo.jpg", "fax.tiff", "logo.bmp"]:
text = pytesseract.image_to_string(Image.open(path))
print(f"=== {path} ===")
print(text)
# Multi-page TIFF: iterate frames
img = Image.open("multipage.tiff")
for i in range(getattr(img, "n_frames", 1)):
img.seek(i)
print(f"--- page {i + 1} ---")
print(pytesseract.image_to_string(img))PDFs are different. Tesseract does not read PDF files directly. Convert each page to an image first with pdf2image, which relies on the Poppler utilities (sudo apt install poppler-utils, or brew install poppler):
# pip install pdf2image (and install Poppler on the system)
from pdf2image import convert_from_path
import pytesseract
pages = convert_from_path("document.pdf", dpi=300) # 300 DPI is a good OCR baseline
full_text = []
for page_number, page_image in enumerate(pages, start=1):
full_text.append(pytesseract.image_to_string(page_image))
print("\n\n".join(full_text))If your goal is broader document extraction (Word, Excel, native PDFs with a real text layer), OCR is often the wrong tool — see our companion guide on extracting data from PDF and Microsoft Office files in Python. Use OCR only when the text exists purely as pixels (scans, photos, image-only PDFs).
Preprocessing for accuracy with OpenCV
Tesseract's accuracy depends almost entirely on input quality. Feeding it a clean, high-contrast, properly oriented image matters far more than any config flag. The standard preprocessing pipeline — grayscale, scale up, threshold, denoise, deskew — is where most real-world accuracy gains come from.
Grayscale + Otsu thresholding (binarisation) is the single highest-impact step:
import cv2
import pytesseract
image = cv2.imread("noisy_scan.jpg")
# 1. Grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# 2. Upscale small text (Tesseract likes ~300 DPI / capital letters ~30px tall)
gray = cv2.resize(gray, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
# 3. Denoise
gray = cv2.medianBlur(gray, 3)
# 4. Binarise with Otsu's automatic threshold
_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
text = pytesseract.image_to_string(thresh)
print(text)Deskewing straightens tilted scans, which dramatically improves line detection. Estimate the skew angle from the text pixels and rotate:
import cv2
import numpy as np
def deskew(gray):
# Invert so text is white on black, then find the minimum-area box
inverted = cv2.bitwise_not(gray)
coords = np.column_stack(np.where(inverted > 0))
angle = cv2.minAreaRect(coords)[-1]
angle = -(90 + angle) if angle < -45 else -angle
(h, w) = gray.shape[:2]
matrix = cv2.getRotationMatrix2D((w / 2, h / 2), angle, 1.0)
return cv2.warpAffine(
gray, matrix, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE,
)Reach for the right step based on the symptom you're seeing:
| Symptom | Preprocessing fix | OpenCV / approach |
|---|---|---|
| Low contrast, gray background | Grayscale + threshold | cvtColor + threshold (Otsu) |
| Tiny or low-DPI text | Upscale to ~300 DPI | cv2.resize(..., INTER_CUBIC) |
| Speckles / scanner noise | Denoise | medianBlur, fastNlMeansDenoising |
| Tilted / rotated page | Deskew | minAreaRect + warpAffine |
| Uneven lighting / shadows | Adaptive threshold | adaptiveThreshold |
| Faint or broken strokes | Morphology | dilate / erode |
Don't over-process: too aggressive a blur or threshold erases thin strokes and lowers accuracy. Tune on a few representative samples and compare output.
Language packs with lang=
By default Tesseract uses English (eng). Install the relevant tesseract-ocr-<lang> package, then pass lang=. You can combine languages with + for mixed-language documents:
# Single language
pytesseract.image_to_string(img, lang="fra")
# Mixed English + French + German
pytesseract.image_to_string(img, lang="eng+fra+deu")
# List what's available from Python
print(pytesseract.get_languages(config=""))Page segmentation and OCR engine modes (--psm / --oem)
Two config flags have an outsized effect on results. Pass them via the config argument.
--psm (page segmentation mode) tells Tesseract how the text is laid out. The default (3, fully automatic) works for full pages, but for a single line, a single word, or sparse text it often guesses wrong. The most useful modes:
--psm |
Use it when the image is... |
|---|---|
| 3 | A full page / document (default) |
| 4 | A single column of variable-size text |
| 6 | A single uniform block of text |
| 7 | A single text line |
| 8 | A single word |
| 10 | A single character |
| 11 | Sparse text, no particular order |
| 12 | Sparse text with orientation detection |
--oem (OCR engine mode) selects the recognition engine. On Tesseract 5, --oem 3 (default) uses the LSTM neural-net engine, which you almost always want:
--oem |
Engine |
|---|---|
| 0 | Legacy engine only |
| 1 | Neural-net LSTM only |
| 2 | Legacy + LSTM combined |
| 3 | Default (LSTM, recommended) |
Combine them in one config string:
config = r"--oem 3 --psm 6"
text = pytesseract.image_to_string(img, lang="eng", config=config)
# Restrict to digits (e.g. reading an amount or a code)
digits = pytesseract.image_to_string(
img, config=r"--psm 7 -c tessedit_char_whitelist=0123456789"
)Structured output: bounding boxes and confidence
For more than a plain string, use image_to_data to get every word with its position and a confidence score (0-100). This is how you build searchable PDFs, highlight matches, or filter out low-confidence garbage. Asking for Output.DICT gives you parallel lists keyed by field:
import pytesseract
from pytesseract import Output
from PIL import Image
img = Image.open("invoice.png")
data = pytesseract.image_to_data(img, output_type=Output.DICT)
for i, word in enumerate(data["text"]):
conf = int(data["conf"][i])
if word.strip() and conf > 60: # keep only confident words
x, y, w, h = (data["left"][i], data["top"][i],
data["width"][i], data["height"][i])
print(f"{word!r} conf={conf} box=({x},{y},{w},{h})")Related helpers: image_to_boxes returns character-level boxes, and image_to_pdf_or_hocr(img, extension="pdf") produces a searchable PDF with an invisible text layer over the original image — ideal for archiving scanned documents while keeping them searchable.
Adding OCR to a Django app
Since this is a Django guide: a typical pattern is an upload → extract endpoint. The user uploads an image, you OCR it server-side and return (or store) the text. Run OCR off the request thread for anything non-trivial — Tesseract is CPU-bound and will block your worker — so a Celery task is the production-ready approach. Here's a minimal synchronous view to show the shape:
# views.py
import pytesseract
from PIL import Image
from django.http import JsonResponse
from django.views.decorators.http import require_POST
@require_POST
def ocr_upload(request):
upload = request.FILES.get("image")
if not upload:
return JsonResponse({"error": "No image provided"}, status=400)
# Pillow reads directly from the in-memory upload
image = Image.open(upload)
text = pytesseract.image_to_string(image, lang="eng")
return JsonResponse({"text": text.strip()})
# urls.py
# from django.urls import path
# from .views import ocr_upload
# urlpatterns = [path("api/ocr/", ocr_upload, name="ocr-upload")]In production: validate the uploaded file type and size, run OCR in a background task (Celery/RQ), cache results, and store extracted text in a field you can full-text search. If you're building document-heavy features into a Django product, our team does this regularly — see our Django development services and broader Python development services.
When to use cloud OCR instead
Tesseract is free, runs locally (no data leaves your server), and is excellent for clean printed text. But it has real limits: it struggles with handwriting, complex multi-column layouts, dense tables, and low-quality phone photos, and it has no built-in understanding of what a field is.
Consider a managed service such as AWS Textract, Google Cloud Vision, or Azure Document Intelligence when you need:
- Form and table extraction with key-value pairs (e.g. parsing invoices or IDs).
- Reliable handwriting recognition.
- Higher accuracy on messy, real-world documents at scale without maintaining preprocessing pipelines.
The trade-offs are per-call cost, network latency, and sending documents to a third party. A common hybrid: Tesseract for the easy, high-volume, privacy-sensitive cases and a cloud API for the hard documents. Increasingly, teams also pair OCR with an LLM to clean up output and extract structured fields — something we build as custom AI feature development and intelligent document-processing pipelines.
A note on accuracy
Be realistic: OCR is never 100% accurate on arbitrary input. Even with good preprocessing, expect occasional misreads — 0/O, 1/l/I, and rn/m are classic confusions. Always:
- Measure accuracy on your real documents, not a clean sample.
- Use confidence scores from
image_to_datato flag uncertain output for human review. - Constrain the problem where you can (character whitelists, fixed
--psm, cropping to regions of interest).
Clean inputs and a tight, well-tuned pipeline beat any single magic flag.
Frequently Asked Questions
Why is pytesseract returning an empty string?
Usually the image is too low-resolution, too low-contrast, or skewed, or Tesseract can't find the binary. First confirm tesseract --version works in your shell (on Windows, set pytesseract.pytesseract.tesseract_cmd). Then preprocess: grayscale, upscale to roughly 300 DPI, and apply Otsu thresholding before calling image_to_string.
Do I need to install Tesseract separately from pytesseract?
Yes. pytesseract is only a Python wrapper that calls the Tesseract command-line engine. You must install the Tesseract binary via your OS package manager (apt, Homebrew, or the Windows installer) in addition to pip install pytesseract.
How do I OCR a PDF in Python?
Tesseract doesn't read PDFs directly. Convert each page to an image with pdf2image (which needs the Poppler utilities), then run image_to_string on each page image. Render at around 300 DPI for the best balance of accuracy and speed.
How do I improve OCR accuracy?
Fix the input before touching config. Convert to grayscale, upscale small text, denoise, deskew, and binarise (Otsu or adaptive threshold) with OpenCV. Then set an appropriate --psm for the layout, keep --oem 3 (the LSTM engine), and use a character whitelist when the content is constrained (e.g. digits only).
What's the difference between --psm and --oem?
--psm (page segmentation mode) tells Tesseract how the text is laid out — a full page, a single line, a single word, or sparse text. --oem (OCR engine mode) chooses which recognition engine runs; on Tesseract 5 the default --oem 3 uses the neural-net LSTM engine, which you should keep in almost all cases.
Is Tesseract good enough for production, or should I use a cloud OCR service?
Tesseract is great for clean printed text, runs locally with no data leaving your server, and costs nothing. For handwriting, complex tables/forms, or noisy real-world documents, a managed service like AWS Textract, Google Cloud Vision, or Azure Document Intelligence is usually more accurate. Many teams run a hybrid: Tesseract for easy, high-volume, privacy-sensitive jobs and a cloud API for the hard documents.
Wrapping up
You now have a complete OCR workflow in Python: install Tesseract 5 and pytesseract, run image_to_string, handle PNG/JPG/TIFF/BMP (and PDFs via pdf2image), preprocess with OpenCV for accuracy, tune --psm/--oem, pull structured data with confidence scores, and expose it through a Django endpoint. For most clean documents Tesseract is all you need; for the messy ones, lean on preprocessing or a cloud API.
Further reading from our blog: extracting data from PDFs and Microsoft Office files in Python and web scraping with Beautiful Soup.