By continuing to navigate on this website, you accept the use of cookies to serve you more relevant services & content.
For more information and to change the setting of cookies on your computer, please read our Cookie Policy.

How to index binary files in django haystack

Now we are going to index text content which is stored in structured files such as PDFs, Microsoft Office documents, images, etc using haystack and sorl's

In order to read and store the data, we can use SearchBackend.extract_file_contents(self, file_obj) method. It takes the file object, returns a dictionary containing two keys: metadata and contents. The contents value will be a string containing all of the text which the backend managed to extract the file contents.

Here we are overriding NewsIndex prepare method to include the extracted content along with information retrieved from the database:

class NewsIndex(indexes.SearchIndex, indexes.Indexable):
    text = CharField(document=True, use_template=True, analyzer='synonym_analyzer')
    content = indexes.CharField(model_attr='content')

def prepare(self, obj):
        data = super(NewsIndex, self).prepare(obj)
        file_data = self._get_backend(None).extract_file_contents(obj.new_file)
        template = loader.select_template(
            ("search/indexes/proj/new_text.txt", ),
        )
        data["text"] = template.render(Context({
            "object": obj,
            "file_data": var,
        }))
        return data

This allows you to insert the extracted text at the appropriate place in your template, 

{{ file_data.contents|striptags|safe }}

To index the documents, we need to generate a schema.xml about our models

./manage.py build_solr_schema >schema.xml

In order to tell sorl about our models schema, just copy the schema.xml and put it in /etc/solr/conf

sudo cp /home/git-projs/elasticproject/schema.xml  /etc/solr/conf

and we can run solr using the following command in which folder you have downloaded 

java -jar start.jar
    Posted On
  • 27 September 2013
  • By
  • Micropyramid

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
Custom Decorators To Check User Roles And Permissions In Django

A decorator is a function that takes another function and returns a newer,prettier version of that function.

To know more about decorators in python see ...

Continue Reading...
Sendgrid Inbound Email Parsing with django

Using the Inbound parse webhook, we can parse the contents, attachments of an incoming email.

Inbound Parse API follows 3 steps:
1. sending an ...

Continue Reading...
Django Custom Template Tags And Filters

Django Template Tags are simple Python functions that accept a value, an optional argument, and return a value to be displayed on the page.
First, ...

Continue Reading...
open source packages

Subscribe To our news letter

Subscribe and Stay Updated about our Webinars, news and articles on Django, Python, Machine Learning, Amazon Web Services, DevOps, Salesforce, ReactJS, AngularJS, React Native.
* We don't provide your email contact details to any third parties