Extract data from PDF and all Microsoft Office files in python

We will see how to extract text from PDF and all Microsoft Office files.

Generating OCR for PDF:

The quick way to get/extract text from PDFs in Python is with the Python library "slate". Slate is a Python package that simplifies the process of extracting text.

Installation:

$ pip install slate
$ pip install pdfminer

Usage:

import slate
with open('sample.pdf', 'rb') as f:
    pdf_text = slate.PDF(f)
    print pdf_text

Output: ['Sample text...', '......', '......']

* The PDF class, of slate, takes file-like object and extracts all the text from the PDF file. It provides the output as a list of strings(one for each page).

* NOTE: If the PDF file has password, then pass the password as second parameter.

Example: 

import slate
with open('test_doc.pdf', 'rb') as f:
    pdf_text = slate.PDF(f, "pass the PDF file password here")
    print pdf_text

Output: ['Sample text...', '......', '......']
    By Posted On
SENIOR DEVELOPER at MICROPYRAMID

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
Working with Django Plugins Vinisha Naladala

This blog describes about how to work with django-plugins. Django Plugin is a Simple Plugin Framework for Django. By using django-plugins, you can make your ...

Continue Reading...
django Payu Payment gateway Integration Shirisha Gaddi

In this blog, we will see how to integrate Django and PayU Payment Gateway. To integrate with PayU, we have package called "django-payu" - a ...

Continue Reading...
How to use nested formsets in django Ravi Kumar Gadila

Django Formsets manage the complexity of multiple copies of a form in a view. By using formsets, you can know how many forms were their ...

Continue Reading...

Subscribe To our news letter

Free news and articles on Django, Python, Machine Learning, Amazon Web Services, DevOps, Salesforce, ReactJS, AngularJS, React Native.
*We don't provide your email contact details to any third parties