How to Index Binary Files (PDF, DOCX) in Django Haystack

Blog / Django · August 17, 2015 · Updated June 10, 2026 · 8 min read
How to Index Binary Files (PDF, DOCX) in Django Haystack

To make binary files like PDFs, Word documents and PowerPoint slides full-text searchable in Django Haystack, you must extract their text first and index that plain string — a search engine indexes text, not raw bytes. There are two proven strategies:

  1. Let the search engine extract the text — Solr's ExtractingRequestHandler wraps Apache Tika and is exposed in Haystack through the Solr backend's extract_file_contents() method. You call it inside a SearchIndex.prepare() override and store the returned text on the index.
  2. Extract the text in Python first — use a library such as tika-python, pdfminer.six, python-docx or textract, then feed the resulting string to an ordinary indexed CharField. This route is backend-agnostic and works with the Elasticsearch backend too.

This guide shows both, with working Haystack 3.x / Django 4.2–5.x / Python 3 code, an honest comparison, and the modern alternatives worth knowing in 2026.

Key takeaways

  • Haystack cannot index a binary file as-is; text extraction always happens before indexing.
  • Strategy 1 (Solr + Tika) uses connections['default'].get_backend().extract_file_contents(file_obj), which returns {'metadata': {...}, 'contents': '...'}. It is Solr-only.
  • Strategy 2 (Python pre-extraction) is portable — the same plain-text function feeds the Elasticsearch, Solr or Whoosh backend, and runs in your own worker so you control errors and async.
  • Build the searchable string with prepare_text() (or a template) so the file text lands in the document=True field.
  • The old Haystack 2.x / Python 2 patterns (super(NewsIndex, self), Context(), self._get_backend()) are obsolete — use the snippets below.
  • For large document corpora, an explicit extract → index pipeline feeding Elasticsearch/OpenSearch is the durable pattern; consider it before stretching Haystack's dated backends.

Why can't Haystack index a binary file directly?

A FileField stores bytes — a PDF or .docx is a compressed container, not readable text. Search backends (Solr, Elasticsearch, Whoosh) build inverted indexes over tokens extracted from strings. So the job is always the same: turn the document into plain text, then hand that text to a Haystack index field. The only real choice is where that extraction runs — inside the search server, or inside your Django process.

Assume a simple model for the rest of this guide:

# models.py
from django.db import models


class Document(models.Model):
    title = models.CharField(max_length=255)
    upload = models.FileField(upload_to='documents/')   # PDF, DOCX, PPTX, ...
    created = models.DateTimeField(auto_now_add=True)

    def __str__(self):
        return self.title

Strategy 1: Let Solr + Tika extract the text

If you run the Solr backend, Haystack exposes Solr's Tika-powered ExtractingRequestHandler through extract_file_contents(). Grab the backend, pass it an open file object, and merge the returned contents into your document field inside prepare():

# search_indexes.py  (Haystack 3.x, Solr backend)
from haystack import indexes, connections
from .models import Document


class DocumentIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    title = indexes.CharField(model_attr='title')
    created = indexes.DateTimeField(model_attr='created')
    file_contents = indexes.CharField(stored=True, indexed=True, null=True)

    def get_model(self):
        return Document

    def prepare(self, obj):
        data = super().prepare(obj)            # Py3: no super(Cls, self)
        if obj.upload:
            backend = connections['default'].get_backend()
            with obj.upload.open('rb') as fh:
                extracted = backend.extract_file_contents(fh)
            if extracted and extracted.get('contents'):
                # extracted == {'metadata': {...}, 'contents': '...'}
                data['file_contents'] = extracted['contents']
                # append the file text to the main document field
                data['text'] = '%s\n%s' % (data['text'], extracted['contents'])
        return data

    def index_queryset(self, using=None):
        return self.get_model().objects.all()

The text field still renders its template (e.g. search/indexes/myapp/document_text.txt) for the model's own fields; prepare() then concatenates the extracted file text on top. If you prefer to place the file text precisely inside the template instead, pass it through the context and reference it there:

{# search/indexes/myapp/document_text.txt #}
{{ object.title }}
{{ object.created }}
{{ file_contents|striptags|safe }}

Then generate and install the Solr schema for your indexes (modern Solr uses a managed schema / configset rather than hand-copying schema.xml into /etc/solr/conf):

# generate the schema Haystack derived from your SearchIndex classes
python manage.py build_solr_schema > schema.xml
# load schema.xml into your Solr core's configset, restart Solr, then:
python manage.py rebuild_index

Strategy 2: Extract the text in Python first (backend-agnostic)

The Solr-only method does not exist on the Elasticsearch backend, and running extraction inside the search server gives you little control over bad files. The portable answer is to extract text yourself and index the string. Put extraction in a reusable helper — Apache Tika via tika-python handles the widest range of formats:

# utils.py  -- one function, any backend
from tika import parser   # pip install tika   (talks to a Tika server / JVM)


def extract_text(file_field):
    """Return plain text from any file Tika understands: PDF, DOC/DOCX, PPTX, XLSX, ..."""
    with file_field.open('rb') as fh:
        parsed = parser.from_buffer(fh.read())
    return (parsed.get('content') or '').strip()

Prefer no JVM? Use format-specific pure-Python libraries and dispatch on the file extension:

# utils.py  -- pure-Python, no JVM
from pathlib import Path
from pdfminer.high_level import extract_text as pdf_text   # pip install pdfminer.six
from docx import Document as Docx                           # pip install python-docx


def extract_text(file_field):
    suffix = Path(file_field.name).suffix.lower()
    with file_field.open('rb') as fh:
        if suffix == '.pdf':
            return pdf_text(fh)
        if suffix == '.docx':
            return '\n'.join(p.text for p in Docx(fh).paragraphs)
    return ''   # unknown type -- skip or log

Now the SearchIndex is trivial and works with Elasticsearch or Solr. Build the document field with prepare_text() so the extracted text is part of what gets indexed:

# search_indexes.py  -- works with the Elasticsearch OR Solr backend
from haystack import indexes
from .models import Document
from .utils import extract_text


class DocumentIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True)        # no template -> use prepare_text
    title = indexes.CharField(model_attr='title')
    created = indexes.DateTimeField(model_attr='created')

    def get_model(self):
        return Document

    def prepare_text(self, obj):
        parts = [obj.title]
        if obj.upload:
            parts.append(extract_text(obj.upload))
        return '\n'.join(p for p in parts if p)

    def index_queryset(self, using=None):
        return self.get_model().objects.all()

Extraction can be slow, so for production move extract_text() into a Celery task (or cache the result on the model) and re-index when a file changes rather than parsing on every rebuild_index. Haystack's RealtimeSignalProcessor updates the index on save(); pair it with async extraction so saves stay fast. For deeper analyzer and tokenizer tuning, see our guide on how to customize the Elasticsearch search engine.

Solr/Tika extraction vs Python pre-extraction: which should you pick?

Factor Solr + Tika (extract_file_contents) Python pre-extraction (tika-python / pdfminer.six / python-docx)
Backend support Solr backend only Any backend — Elasticsearch, Solr, Whoosh
Where extraction runs Inside Solr (server-side, blocks the index request) In your Django/Celery worker (async-friendly, controllable)
Setup Solr + ExtractingRequestHandler (Tika JARs bundled) pip install libraries; tika-python also needs a JVM/Tika server
Format coverage Whatever Tika supports (PDF, DOC/DOCX, PPTX, XLSX, ...) Tika = same breadth; pure-Python libs = one format each
Error handling Limited — a bad file can fail the whole commit Full control: skip/log per file, retry, cache the text
Portability Tied to Solr + Haystack internals Plain function, reusable anywhere
Best for An existing Solr stack, quick wins New builds, Elasticsearch users, large/async pipelines

For most teams in 2026 — especially anyone on Elasticsearch — Strategy 2 wins on portability and control. Strategy 1 is the fastest path only if Solr is already in place.

Which extraction library handles which format?

Library Install Handles Needs a JVM?
Apache Tika pip install tika PDF, DOC/DOCX, PPTX, XLSX, RTF, HTML, images (OCR via Tesseract) Yes (Tika server)
pdfminer.six pip install pdfminer.six PDF only No
python-docx pip install python-docx DOCX only (not legacy .doc) No
python-pptx pip install python-pptx PPTX only No
textract pip install textract Many formats (wraps other tools) No, but many system deps; less maintained

What about the Elasticsearch backend and other modern options?

Haystack's Elasticsearch backend does not expose extract_file_contents(), so on Elasticsearch you use Strategy 2. If you talk to Elasticsearch/OpenSearch directly (outside Haystack), the engine has its own server-side extraction: the ingest-attachment processor, which is also Tika under the hood — define an ingest pipeline with an attachment processor and index the base64-encoded file. Because Haystack abstracts the raw engine away, the pragmatic choice when staying inside Haystack is still Python pre-extraction.

Be honest about the ecosystem in 2026:

  • django-haystack 3.x supports Django 4.2/5.x, but its backends are dated and there is no first-class vector/semantic search.
  • For document-heavy products, a dedicated extract → embed → index pipeline (Tika/Celery feeding Elasticsearch or OpenSearch) ages better than Haystack's abstractions, and unlocks semantic/vector search.
  • For simpler needs, PostgreSQL full-text search (SearchVector/SearchQuery) or managed engines like Typesense, Meilisearch or Algolia may remove the Solr/JVM operational burden entirely.

New to this stack? Start with implementing search with Django Haystack and Elasticsearch to get a basic index running before layering binary-file extraction on top. If you would rather hand off the search architecture, our Django development services team has shipped Haystack, Elasticsearch and OpenSearch search for clients since 2014.

Frequently Asked Questions

Can Django Haystack index PDFs and Word documents directly?

No. Haystack indexes text, so you must extract the document's text first. Either let Solr's Tika-backed extract_file_contents() do it (Solr backend only), or extract the text in Python with tika-python / pdfminer.six / python-docx and index the resulting string. There is no setting that makes Haystack read raw binary content on its own.

What does extract_file_contents() return?

It returns a dictionary with two keys: metadata (author, content-type, page count and similar attributes Tika found) and contents (a single string of all the text Tika could pull from the file). You typically store contents on a CharField and append it to your document=True field. The method lives only on the Solr backend.

Does binary-file extraction work with the Elasticsearch backend?

Yes, but not via extract_file_contents() — that method is Solr-only. With the Elasticsearch backend you extract text in Python first (Strategy 2) and index the string through a normal CharField. If you query Elasticsearch directly outside Haystack, you can instead use its server-side ingest-attachment pipeline, which is also Tika under the hood.

Which Python library should I use to extract document text?

For the widest format coverage in one call, use Apache Tika through tika-python — it handles PDF, DOC/DOCX, PPTX, XLSX and more, and can OCR images with Tesseract, but it needs a JVM/Tika server. If you want no JVM, use pdfminer.six for PDFs and python-docx for DOCX and dispatch on the file extension. Avoid relying on textract for new projects; it is broad but loosely maintained.

How do I keep the search index up to date when files change?

Re-index the affected object whenever its file changes. Haystack's RealtimeSignalProcessor updates the index on every save(); because extraction is slow, run extract_text() in a Celery task (or cache the parsed text on the model) so saves stay fast, then update that object in the index from the task. Scheduled rebuild_index and update_index --age runs cover bulk catch-up.

Is Django Haystack still a good choice in 2026?

For classic keyword search over Django models it still works — version 3.x supports Django 4.2/5.x. But its backends are dated and it has no native semantic/vector search. For document-heavy or AI-driven search, a dedicated extract-and-index pipeline feeding Elasticsearch/OpenSearch, or a managed engine like Typesense or Meilisearch, usually ages better. PostgreSQL full-text search is enough for lighter needs.

Share this article