We will see how to extract text from PDF and all Microsoft Office files.
Generating OCR for PDF:
The quick way to get/extract text from PDFs in Python is with the Python library "slate". Slate is a Python package that simplifies the process of extracting text.
Installation:
$ pip install slate
$ pip install pdfminer
Usage:
import slate
with open('sample.pdf', 'rb') as f:
pdf_text = slate.PDF(f)
print pdf_text
Output: ['Sample text...', '......', '......']
* The PDF class, of slate, takes file-like object and extracts all the text from the PDF file. It provides the output as a list of strings(one for each page).
* NOTE: If the PDF file has password, then pass the password as second parameter.
Example:
import slate
with open('test_doc.pdf', 'rb') as f:
pdf_text = slate.PDF(f, "pass the PDF file password here")
print pdf_text
Output: ['Sample text...', '......', '......']