Extract Text and Data from PDF and Microsoft Office Files in Python

Blog / Django · August 21, 2023 · Updated June 10, 2026 · 9 min read
Extract Text and Data from PDF and Microsoft Office Files in Python

To extract text and data from PDF and Microsoft Office files in Python, use a small set of focused, actively maintained libraries: pdfplumber or pypdf for PDFs, python-docx for Word (.docx), openpyxl (or pandas.read_excel) for Excel (.xlsx), and python-pptx for PowerPoint (.pptx). For scanned PDFs and images that contain no embedded text, fall back to OCR with pytesseract. When you need one tool that handles every format at once, unstructured or Apache Tika can parse them all through a single API.

This guide replaces older approaches that relied on the long-abandoned slate package and Python 2 syntax. Everything below targets Python 3.11+ and current library versions, and the patterns map directly onto the document-ingestion step of RAG and LLM pipelines.

Which library for which file format

The right tool depends on the format and whether the file contains real, selectable text or just a scanned image. This table maps each common format to the library we reach for first.

File type Extension Recommended library Notes
PDF (text) .pdf pdfplumber, pypdf pdfplumber for tables and layout; pypdf is lighter for plain text
PDF (scanned) .pdf pdf2image + pytesseract Rasterise pages, then OCR each image
Word .docx python-docx Paragraphs, tables, headers and footers
Excel .xlsx, .xlsm openpyxl, pandas pandas.read_excel for data analysis; openpyxl for cell-level control
PowerPoint .pptx python-pptx Iterate slides and shapes that have a text frame
Legacy Word .doc LibreOffice convert, textract Convert to .docx first for reliable results
Legacy Excel .xls pandas + xlrd xlrd reads the old binary .xls format only
Images .png, .jpg, .tiff pytesseract + pillow Pure OCR
Everything mixed unstructured, Apache Tika One API across all formats

A quick rule of thumb: if a PDF lets you select text in a viewer, a text extractor will work; if selection highlights nothing, the page is an image and you need OCR.

Install the libraries

Install only what you need. Each format is independent, so you can start with PDFs and add the Office parsers later.

# Core extractors (pick what you need)
pip install pdfplumber pypdf python-docx openpyxl python-pptx pandas

# OCR fallback for scanned PDFs and images
pip install pytesseract pdf2image pillow

# Optional: one library that handles every format
pip install "unstructured[all-docs]"

OCR also needs the system-level Tesseract binary, and pdf2image needs Poppler. On Debian or Ubuntu:

sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils

# macOS (Homebrew)
# brew install tesseract poppler

Extract text from a PDF

pdfplumber is the most reliable choice for text-based PDFs because it preserves layout and can pull out tables. It returns one string per page, so you keep page boundaries for citations and chunking.

import pdfplumber


def extract_pdf_text(path: str) -> list[str]:
    """Return a list of strings, one per page, from a text-based PDF."""
    pages: list[str] = []
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            pages.append(page.extract_text() or "")
    return pages


if __name__ == "__main__":
    pages = extract_pdf_text("sample.pdf")
    print(f"Extracted {len(pages)} pages")
    print(pages[0][:500])

If you only need raw text and want a lighter dependency, use pypdf (the maintained successor to the deprecated PyPDF2). It also reads document metadata and handles encrypted files.

from pypdf import PdfReader

reader = PdfReader("sample.pdf")

# Decrypt password-protected PDFs before reading
if reader.is_encrypted:
    reader.decrypt("your-password-here")

all_text = "\n".join(page.extract_text() or "" for page in reader.pages)
print(all_text[:500])

# Document metadata (title, author, etc.)
print(reader.metadata)

Pull tables out of a PDF

Financial statements, invoices and reports often store data in tables. pdfplumber.extract_tables() returns each table as a list of rows, which drops straight into pandas.

import pandas as pd
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    for page_number, page in enumerate(pdf.pages, start=1):
        for table in page.extract_tables():
            df = pd.DataFrame(table[1:], columns=table[0])
            print(f"Page {page_number} table:")
            print(df.head())

Extract text from Word documents (.docx)

python-docx reads paragraphs and tables from modern .docx files. The text lives in document.paragraphs, and tabular data sits in document.tables.

from docx import Document


def extract_docx(path: str) -> dict:
    doc = Document(path)

    paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]

    tables = []
    for table in doc.tables:
        rows = [[cell.text for cell in row.cells] for row in table.rows]
        tables.append(rows)

    return {"paragraphs": paragraphs, "tables": tables}


data = extract_docx("contract.docx")
print("\n".join(data["paragraphs"][:10]))

Note that python-docx only reads the modern .docx format. For the legacy binary .doc, convert it first with a headless LibreOffice command (libreoffice --headless --convert-to docx file.doc) and then parse the result.

Extract data from Excel spreadsheets (.xlsx)

For analysis, pandas.read_excel is the fastest path: it loads each sheet into a DataFrame. Under the hood it uses openpyxl for .xlsx files, so make sure that engine is installed.

import pandas as pd

# Read every sheet into a dict of {sheet_name: DataFrame}
sheets = pd.read_excel("workbook.xlsx", sheet_name=None, engine="openpyxl")

for name, df in sheets.items():
    print(f"Sheet '{name}': {df.shape[0]} rows x {df.shape[1]} cols")
    print(df.head())

When you need cell-level control, formulas or formatting rather than a tidy DataFrame, use openpyxl directly.

from openpyxl import load_workbook

# data_only=True returns the last-calculated value instead of the formula text
wb = load_workbook("workbook.xlsx", data_only=True)

for sheet in wb.worksheets:
    print(f"--- {sheet.title} ---")
    for row in sheet.iter_rows(values_only=True):
        print(row)

Extract text from PowerPoint decks (.pptx)

python-pptx walks every slide and every shape. Only shapes with a text frame contain text, so guard for that before reading.

from pptx import Presentation


def extract_pptx(path: str) -> list[str]:
    prs = Presentation(path)
    slides_text: list[str] = []
    for slide in prs.slides:
        parts = []
        for shape in slide.shapes:
            if shape.has_text_frame:
                for para in shape.text_frame.paragraphs:
                    text = "".join(run.text for run in para.runs)
                    if text.strip():
                        parts.append(text)
        slides_text.append("\n".join(parts))
    return slides_text


for i, text in enumerate(extract_pptx("deck.pptx"), start=1):
    print(f"Slide {i}:\n{text}\n")

OCR fallback for scanned PDFs and images

When extract_text() returns empty strings, the PDF is a scanned image. Rasterise each page with pdf2image, then run OCR with pytesseract. This is the single most common gap in document pipelines, so build it in from the start.

import pytesseract
from pdf2image import convert_from_path


def ocr_pdf(path: str, dpi: int = 300, lang: str = "eng") -> list[str]:
    """OCR every page of a scanned PDF. Higher dpi = better accuracy, slower run."""
    images = convert_from_path(path, dpi=dpi)
    return [pytesseract.image_to_string(img, lang=lang) for img in images]


pages = ocr_pdf("scanned-invoice.pdf")
print(pages[0][:500])

A robust extractor tries the fast text path first and only pays the OCR cost when a page comes back empty:

import pdfplumber
import pytesseract
from pdf2image import convert_from_path


def extract_pdf_smart(path: str) -> list[str]:
    """Use embedded text where present, OCR only the pages that need it."""
    pages: list[str] = []
    with pdfplumber.open(path) as pdf:
        page_count = len(pdf.pages)
        for page in pdf.pages:
            pages.append((page.extract_text() or "").strip())

    empty = [i for i, text in enumerate(pages) if not text]
    if empty:
        images = convert_from_path(path, dpi=300)
        for i in empty:
            pages[i] = pytesseract.image_to_string(images[i]).strip()

    assert len(pages) == page_count
    return pages

One library for every format: unstructured and Apache Tika

If you are building an ingestion service that must accept whatever a user uploads, a per-format if/else ladder becomes tedious. Two tools normalise everything behind a single call.

unstructured is a Python-native library that detects the file type and returns a list of typed elements (titles, narrative text, tables, list items). It is the most common front door for LLM and RAG ingestion because the element types help you chunk content sensibly.

from unstructured.partition.auto import partition

# Works for .pdf, .docx, .xlsx, .pptx, .html, .eml and more
elements = partition(filename="any-document.pdf")

for el in elements[:10]:
    print(f"[{type(el).__name__}] {el.text[:120]}")

full_text = "\n\n".join(el.text for el in elements if el.text)

Apache Tika is a mature Java content-detection toolkit with a Python client (tika). It runs a local Tika server and extracts text plus metadata from over a thousand file types, which makes it a strong choice when you must handle obscure or legacy formats.

from tika import parser

# Spins up a local Tika server on first call (needs Java installed)
result = parser.from_file("any-document.docx")

print(result["content"][:500])   # extracted text
print(result["metadata"])         # author, created date, mime type, etc.

A unified extractor by file type

In production we usually wrap the format-specific functions behind one dispatcher so the rest of the application never cares what was uploaded. This pattern keeps each parser testable and makes the OCR fallback explicit.

from pathlib import Path


def extract_any(path: str) -> str:
    """Dispatch to the right extractor based on file extension."""
    suffix = Path(path).suffix.lower()

    if suffix == ".pdf":
        return "\n".join(extract_pdf_smart(path))
    if suffix == ".docx":
        data = extract_docx(path)
        return "\n".join(data["paragraphs"])
    if suffix in {".xlsx", ".xlsm"}:
        import pandas as pd
        sheets = pd.read_excel(path, sheet_name=None, engine="openpyxl")
        return "\n\n".join(df.to_csv(index=False) for df in sheets.values())
    if suffix == ".pptx":
        return "\n".join(extract_pptx(path))

    raise ValueError(f"Unsupported file type: {suffix}")


print(extract_any("report.pdf")[:500])

Where this fits in a document pipeline

Text extraction is step one. Real systems then clean the text, split it into overlapping chunks, attach metadata such as source filename and page number, and embed it into a vector store for retrieval. We build this exact flow for clients who need to search and chat over their documents; see our work on Python development and AI feature development. When the documents live inside a Django application, the same parsers run inside background tasks so large uploads never block a request.

Over 12+ years and 50+ delivered projects, the most common failure we see is treating every PDF as text-based. Scanned contracts, faxed forms and photographed receipts all need the OCR fallback shown above, and skipping it silently drops content from search results.

Frequently Asked Questions

What is the best Python library to extract text from a PDF?

For text-based PDFs, pdfplumber is the best general choice because it preserves layout and extracts tables, returning one string per page. If you only need plain text with a lighter dependency, use pypdf, the maintained successor to the deprecated PyPDF2. For scanned PDFs that contain no embedded text, neither works on its own and you must add OCR with pytesseract.

Why should I not use the slate or PyPDF2 packages anymore?

The slate package is abandoned and was written for Python 2, so it fails on modern interpreters. PyPDF2 has been deprecated and merged back into pypdf, which is now the actively maintained project. Use pdfplumber or pypdf instead; both support Python 3.11+, receive regular updates, and handle encryption and metadata correctly.

How do I extract data from Excel files in Python?

The quickest path is pandas.read_excel, which loads each sheet into a DataFrame; pass sheet_name=None to read every sheet at once. It uses openpyxl as the engine for .xlsx files. When you need cell-level control, formulas, or formatting, use openpyxl directly with load_workbook. For the legacy binary .xls format, install xlrd as the engine.

How do I extract text from a scanned PDF or image?

Scanned PDFs are images, so a normal text extractor returns empty strings. Convert each page to an image with pdf2image, then run OCR with pytesseract, which wraps the Tesseract engine. You also need the system Tesseract binary and Poppler installed. For accuracy, rasterise at around 300 DPI and pass the correct language code to pytesseract.

Is there one Python library that handles PDF, Word, Excel and PowerPoint together?

Yes. The unstructured library detects the file type automatically and returns typed elements across PDF, DOCX, XLSX, PPTX, HTML, email and more through a single partition call, which makes it popular for RAG and LLM ingestion. Apache Tika, used through its Python client, is another option that extracts text and metadata from over a thousand file types but requires a running Java server.

How do I handle password-protected PDF files?

Use pypdf and check reader.is_encrypted before reading. If the file is encrypted, call reader.decrypt with the password, then iterate the pages as usual. This works for PDFs protected with a user password; documents secured with owner-only restrictions or strong DRM may still refuse extraction, which is expected behaviour.

Share this article