We will see how to extract text from PDF and all Microsoft Office files.
Generating OCR for PDF:
The quick way to get/extract text from PDFs in Python is with the Python library "slate". Slate is a Python package that simplifies the process of extracting text.
Installation:
$ pip install slate $ pip install pdfminer
Usage:
import slate with open('sample.pdf', 'rb') as f: pdf_text = slate.PDF(f) print pdf_text Output: ['Sample text...', '......', '......']
* The PDF class, of slate, takes file-like object and extracts all the text from the PDF file. It provides the output as a list of strings(one for each page).
* NOTE: If the PDF file has password, then pass the password as second parameter.
Example:
import slate with open('test_doc.pdf', 'rb') as f: pdf_text = slate.PDF(f, "pass the PDF file password here") print pdf_text Output: ['Sample text...', '......', '......']
Micropyramid is a software development and cloud consulting partner for enterprise businesses across the world. We work on python, Django, Salesforce, Angular, Reactjs, React Native, MySQL, PostgreSQL, Docker, Linux, Ansible, git, amazon web services. We are Amazon and salesforce consulting partner with 5 years of cloud architect experience. We develop e-commerce, retail, banking, machine learning, CMS, CRM web and mobile applications.
Django-CRM :Customer relationship management based on Django
Django-blog-it : django blog with complete customization and ready to use with one click installer Edit
Django-webpacker : A django compressor tool
Django-MFA : Multi Factor Authentication
Docker-box : Web Interface to manage full blown docker containers and images
More...