Now we are going to index text content which is stored in structured files such as PDFs, Microsoft Office documents, images, etc using haystack and sorl's

In order to read and store the data, we can use SearchBackend.extract_file_contents(self, file_obj) method. It takes the file object, returns a dictionary containing two keys: metadata and contents. The contents value will be a string containing all of the text which the backend managed to extract from the file contents.

Here we are overiding NewsIndex prepare method to include the extract content along with information retrieved from the database:

class NewsIndex(indexes.SearchIndex, indexes.Indexable):
    text = CharField(document=True, use_template=True, analyzer='synonym_analyzer')
    content = indexes.CharField(model_attr='content')

def prepare(self, obj):
        data = super(NewsIndex, self).prepare(obj)
        file_data = self._get_backend(None).extract_file_contents(obj.new_file)
        template = loader.select_template(
            ("search/indexes/proj/new_text.txt", ),
        )
        data["text"] = template.render(Context({
            "object": obj,
            "file_data": var,
        }))
        return data

This allows you to insert the extracted text at the appropriate place in your template, 

{{ file_data.contents|striptags|safe }}

To index the documents,we need to generate schema.xml about our models

./manage.py build_solr_schema >schema.xml

In order to tell solr about our models schema, just copy the schema.xml and put it in /etc/solr/conf

sudo cp /home/git-projs/elasticproject/schema.xml  /etc/solr/conf

and we can run solr using the following command in which folder you have downloaded 

java -jar start.jar

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties
Latest Comments
Related Articles
Django Model managers and properties Sandeep Jagata

Django model is the single, definitive source of data about your data. It contains the essential fields and behaviors of the data you’re storing. Generally, ...

Continue Reading...
Understanding django serializers with examples Vamsi Popuri

Serializers are used for “translating” Django models into other formats like xmi,json,yaml(YAML Ain’t a Markup Language)

from django.core import serializers
data = serializers.serialize("xml", SomeModel.objects.all())

Continue Reading...
Running Django with PyPy to boost performance Dinesh Deshmukh

Running Django with PyPy to boost performance

Continue Reading...