Extract data from PDF and all Microsoft Office files in python

We will see how to extract text from PDF and all Microsoft Office files.

Generating OCR for PDF:

The quick way to get/extract text from PDFs in Python is with the Python library "slate". Slate is a Python package that simplifies the process of extracting text.

Installation:

$ pip install slate
$ pip install pdfminer

Usage:

import slate
with open('sample.pdf', 'rb') as f:
    pdf_text = slate.PDF(f)
    print pdf_text

Output: ['Sample text...', '......', '......']

* The PDF class, of slate, takes file-like object and extracts all the text from the PDF file. It provides the output as a list of strings(one for each page).

* NOTE: If the PDF file has password, then pass the password as second parameter.

Example: 

import slate
with open('test_doc.pdf', 'rb') as f:
    pdf_text = slate.PDF(f, "pass the PDF file password here")
    print pdf_text

Output: ['Sample text...', '......', '......']

Posted On 21 September 2014 By MicroPyramid


Need any Help in your Project?Let's Talk

Latest Comments
Sendgrid Inbound Email Parsing with django

Using the Inbound parse webhook, we can parse the contents, attachments of an incoming email.

Inbound Parse API follows 3 steps:
1. sending an ...

Continue Reading...
What's great about Django girls to inspire women into programming

Django girls is a non-profit organization, that helps women to learn Django programming language and to inspire them into programming. They are organizing workshops all ...

Continue Reading...
Implement search with Django-haystack and Elasticsearch Part-1

Haystack works as search plugin for django. You can use different back ends Elastic-search, Whose, Sorl, Xapian to search objects. All backends work with same ...

Continue Reading...

Subscribe To our news letter

Subscribe and Stay Updated about our Webinars, news and articles on Django, Python, Machine Learning, Amazon Web Services, DevOps, Salesforce, ReactJS, AngularJS, React Native.
* We don't provide your email contact details to any third parties