How to index binary files in django haystack

Reading Time : ~ .

Now we are going to index text content which is stored in structured files such as PDFs, Microsoft Office documents, images, etc using haystack and sorl's

In order to read and store the data, we can use SearchBackend.extract_file_contents(self, file_obj) method. It takes the file object, returns a dictionary containing two keys: metadata and contents. The contents value will be a string containing all of the text which the backend managed to extract the file contents.

Here we are overriding NewsIndex prepare method to include the extracted content along with information retrieved from the database:

class NewsIndex(indexes.SearchIndex, indexes.Indexable):
    text = CharField(document=True, use_template=True, analyzer='synonym_analyzer')
    content = indexes.CharField(model_attr='content')

def prepare(self, obj):
        data = super(NewsIndex, self).prepare(obj)
        file_data = self._get_backend(None).extract_file_contents(obj.new_file)
        template = loader.select_template(
            ("search/indexes/proj/new_text.txt", ),
        )
        data["text"] = template.render(Context({
            "object": obj,
            "file_data": var,
        }))
        return data

This allows you to insert the extracted text at the appropriate place in your template, 

{{ file_data.contents|striptags|safe }}

To index the documents, we need to generate a schema.xml about our models

./manage.py build_solr_schema >schema.xml

In order to tell sorl about our models schema, just copy the schema.xml and put it in /etc/solr/conf

sudo cp /home/git-projs/elasticproject/schema.xml  /etc/solr/conf

and we can run solr using the following command in which folder you have downloaded 

java -jar start.jar
    By Posted On
SENIOR DEVELOPER at MICROPYRAMID

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
Integrate Django-Oscar-Accounts with Django-Oscar Shirisha Gaddi

This package uses double-entry bookkeeping where every transaction is recorded twice (once for the source and once for the destination). This ensures the books always ...

Continue Reading...
Setting Up Coveralls for Django Project Ravi Kumar Gadila

Coveraslls will check the code coverage for your test cases. To use coveralls.io your code must be hosted on GitHub or BitBucket.

install coveralls
...

Continue Reading...
Understanding 'GenericForeignKey' in Django Ashwin Kumar

In some cases the we might want to store generic model object, rather a particular specific model as 'ForeignKey'. Generic model object means adding a ...

Continue Reading...

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties