Python Web Scraping with Beautiful soup

BeautifulSoup:

BeautifulSoup is a python library which helps in managing data from html or xml files, using beautifulsoup helps in searching, navigation and parsing data with ease and less amount code.

Lets Scrape and download all One Piece Episodes from kissanime.to website:

    How are we going to do with beautifulsoup:

              1. Get the source(html code) of the One Piece page.

             2. Find the required episode links using BeautifulSoup. (without beautifulsoup we would be using regular expression which is great but for less code and easiness we use beautifullsoup) 

             3. Use webkit and gtk to grab the video source url and finally download episode using this url.

Install Requirements: 

pip install beautifulsoup4
pip install pygtk

Step-1:

Get the source code(html) from url: https://kissanime.to/Anime/One-Piece/

Since the url runs javascript before showing the original source code, we have to get the source code from browser itself, so just open the link and press ctrl+u and copy the html code to a file.

lets name the file as one_piece.html

Step-2:

Using BeautifulSoup to get the list of links:

     BeautifulSoup's find_all method is the most used and simple method to search over all the file for required data.

from bs4 import BeautifulSoup


with opne('one_piece.html', 'r') as op_file:
    html_doc = str(op_file.read())


# Now we have all the source code in html_doc name
# lets use beautifulsoup to find all the links for episodes


anchor_tags_list = BeautifulSoup.find_all('a', 'href=True')


for bs4_ele in anchor_tags_list:

    url = bs4_ele'href'

    if 'Episode' in url:

        urls_list.append(url)



print(urls_list)

The above code displays all the episode urls, as you see beautifulsoup is very easy to use, it has many more methods like title which displays the title of source code, prettify method which displays the html code in a beautifull format.

Here we used find_all method which captures all the anchor tags in a list which have href attributes.

After gettings all the anchor tags we are conditioning only the tags which have 'Episode' in the href link as these are the only urls we need.

Step 3:

Using gtk and webkit libraries we are going to actually open a simple webkit browser and run the url. 

We are using the webkit browser because the kissanime.to site only displays the real video link if a real browser requests it. So we are using webkit to request kissanime as a real browser and then once it responds with real video link we will grab that and start downloading it.

download.py

import sys
import subprocess
import gtk, webkit


def update(view, frame, resource, request, response):
    url = request.get_uri()
    if 'itqe.googlevideo' in url:
        print('\n###################################################')
        print(url) # this is the one piece video link
        subprocess.check_call('wget '+url, shell=True) # downloading the video using wget utility
        print('###################################################\n')
        sys.exit(0)


win = gtk.Window()
win.connect('destroy', lambda w: gtk.main_quit())
win.show()


box = gtk.HBox()
win.add(box)
web = webkit.WebView()
web.connect('resource-request-starting', update)
box.pack_start(web)


# sys.argv[1] is command line argument which will be our one piece video url
web.open(sys.argv[1])
box.show_all()


gtk.main()

Run download.py with url that we got by scraping as argument:

python2.7 download.py <your_one_piece_episode_url_here>

You can combine the above two scripts to make everything automated instead of running dowload.py with url for every episode url.

Posted On 05 November 2012 By MicroPyramid


Need any Help in your Project?Let's Talk

Latest Comments
Unit testing with selenium-python

Unit testing with selenium-python. Unit test case example to test the front-end

Continue Reading...
Tips to choose the best custom software development company

Choosing the best company for your software development needs is the most important step. This blog explains you tips to outsource custom software development services.

Continue Reading...
WSGI explanation with simple APP

The main goal of WSGI is to facilitate easy interconnection of servers and web frameworks/applications. WSGI defines a standerd API for web servers(uWSGI, Twisted, Gunicorn) …

Continue Reading...
Understanding self and __init__ method in python Class.

Understand self and __init__ method in python Class?
Before understanding the "self" and "__init__" methods in python class, it's very helpful if we have the idea …

Continue Reading...
List of python class special methods or magic methods

python class special methods or magic methods. magic methods allow us to override or add the default functionality of python objects. One of the biggest …

Continue Reading...
Building and Parsing XML Document using Python

Creating XML document with required elements, Then Parsing it using Python to generate a serialized form of its contents.

Continue Reading...
Working with python collections Counter

Python collections - Counter is to count the frequency of character, OrderedDict is to track the order of the contents in which they are added …

Continue Reading...
Python to Debian package: Simple, Easy and Fast

Packaging python script to debian follows strict instructions, using the following instructions, most of the steps can be skipped hence making it easy and fast.
If …

Continue Reading...
Sending SMS, MMS using Twilio.

A simple Tutorial on sending SMS and MMS in python using Twilio. In this tutorial you will learn how to send SMS, MMS and checking …

Continue Reading...
Customize and Embed Vimeo Videos using Python Requests.

Using python requests and vimeo endpoints it becomes very easy and simple to upload our videos and customize them.

Vimeo Access token:
1. Create an …

Continue Reading...
Getting Started with the IPython Notebook

IPython is a set of tools developed to make it easier for the programmers to work with Python and data. IPython provides extensions to the …

Continue Reading...
Python Arrow To Show Human Friendly Time

Arrow is a python library and command-line tool to genrerate, manipulate dates, times, timestamps.
use of arrow:
With the use of arrow, we can also create, manipulate, …

Continue Reading...
Working with python collections part 1

Python Collections - named tuple is to access by the names specified and deque is to append and pop the elements from both sides of …

Continue Reading...
Building Documentation with readthedocs

In this blog, I'm going to explain you how to write the Sphinx docs using reStructuredText to host in the Read the Docs.

Installing Sphinx

Continue Reading...
Create excel file, Insert image, Draw Bar Graphs in excel files in python using xlsxwriter

Xlsxwriter is a python module through which we can write data to Excel 2007+ XLSX file format. In this blog post we learn to write …

Continue Reading...
Integrate Twitter Social API into Django App

Integrating Twitter sign in (OAuth) in Django App, which includes
1. Capturing username via Twitter Login
2. Updating authenticated user current status on twitter(tweets).

Continue Reading...
How to implement Case Insensitive CSV DictReader in python

In general use cases we upload the CSV files to the system to store huge amount of data by uploading single file. For example in …

Continue Reading...
Python using yield and generators.

Generators are memory efficient. They allow us to code with minimum intermediate arguments, less data structures. Generators are of two types, generator expressions and generator …

Continue Reading...
Introduction to Object Oriented Programming with Python 3

Introduction to Object Oriented Programming with Python

Continue Reading...
Python Coding Techniques and Programming Practices

Coding techniques and programming practices are one of the features of a professional programmer. While writing code to solve a problem programmer should make simple …

Continue Reading...
FABRIC – LEARNING PART 2

Using Fabric, we can develop interactive script for ec2 region, ec2 flavour amazon web services. For this, you need aws account, security group, key pair, …

Continue Reading...
Generating PDF Files in Python using xhtml2pdf

There are many ways for generating PDF in python. In this post I will be presenting PDF files generation with xhtml2pdf.

xhtml2pdf: xhtml2pdf is a …

Continue Reading...
Using Python xlwt generating CSV writer and Excel files

In most of the cases, you need to export the data from your database to different formats. In this post I will show you how …

Continue Reading...
How to access development server publicly using Localtunnel

We do need to expose our local server to hit call-back URLs while programming with other APIs. There is a tool called local tunnel from …

Continue Reading...
How to generate PDF Files from HTML In Python using PDFKIT

There are many approches for generating PDF in python. pdfkit is one of the better approache as, it renders HTML into PDF with various image …

Continue Reading...
Converting Audio and Video files using FFMPEG Tool

FFMPEG is a command-line tool that converts audio or video to required formats, which handle multimedia data. It can also capture and encode in real-time.

Continue Reading...
Python Web Scraping with Beautiful soup

Download all One Piece animation series episodes by scraping site using BeautifulSoup python library.

Continue Reading...
Publishing Python Modules with PIP via PyPi

We'll install so many packages in our day to day python development. Now in this blog post, we'll try to know how to create our …

Continue Reading...
Programming with python: Decorators

Python decorators supports aspect-oriented programming. It is used to add or modify code in functions or classes. Using decorators will provide security, tracing, looking ..etc …

Continue Reading...
Vim for Python Web Development

Having a good environment setup is important for effective, fast and easy coding. We have different IDE's like eclipse, pycharm, sublime etc.. which are powerful …

Continue Reading...
Understanding Python Properties

Python Properties is a class for managing class attributes in Python. Property( ) is a built-in function that creates and returns a property object

Syntax:
attribute_name …

Continue Reading...
QRCode generation in python

A Quick Response code(QRCode) is a two-dimensional pictographic code used for its fast readability and comparatively large storage capacity. The code consists of black modules …

Continue Reading...
Understanding Audio Quality: Bit Rate, Sample Rate

Audio Quality is the accuracy and enjoyability of the audio which the user can listen from an electronic device. Audio quality depends upon the bit …

Continue Reading...
Programming with python Descriptors (_get_, _set_, _delete_) - MicroPyramid

Python descriptors are object attributes that are only invoked for new style of classes. Python descriptors comes under the category of meta programming(code that manipulates …

Continue Reading...
Debugging in Python

When something goes wrong with your code instead of using standard debugging techniques such as print statements use debugging tools. I found two great tools …

Continue Reading...
Python development environment on windows

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of …

Continue Reading...

Subscribe To our news letter

Subscribe and Stay Updated about our Webinars, news and articles on Django, Python, Machine Learning, Amazon Web Services, DevOps, Salesforce, ReactJS, AngularJS, React Native.
* We don't provide your email contact details to any third parties