Python Web Scraping with Beautiful soup

Reading Time : ~ .

BeautifulSoup:

BeautifulSoup is a python library which helps in managing data from html or xml files, using beautifulsoup helps in searching, navigation and parsing data with ease and less amount code.

Lets Scrape and download all One Piece Episodes from kissanime.to website:

    How are we going to do with beautifulsoup:

              1. Get the source(html code) of the One Piece page.

             2. Find the required episode links using BeautifulSoup. (without beautifulsoup we would be using regular expression which is great but for less code and easiness we use beautifullsoup) 

             3. Use webkit and gtk to grab the video source url and finally download episode using this url.

Install Requirements: 

pip install beautifulsoup4
pip install pygtk

Step-1:

Get the source code(html) from url: https://kissanime.to/Anime/One-Piece/

Since the url runs javascript before showing the original source code, we have to get the source code from browser itself, so just open the link and press ctrl+u and copy the html code to a file.

lets name the file as one_piece.html

Step-2:

Using BeautifulSoup to get the list of links:

     BeautifulSoup's find_all method is the most used and simple method to search over all the file for required data.

from bs4 import BeautifulSoup


with opne('one_piece.html', 'r') as op_file:
    html_doc = str(op_file.read())


# Now we have all the source code in html_doc name
# lets use beautifulsoup to find all the links for episodes


anchor_tags_list = BeautifulSoup.find_all('a', 'href=True')


for bs4_ele in anchor_tags_list:

    url = bs4_ele'href'

    if 'Episode' in url:

        urls_list.append(url)



print(urls_list)

The above code displays all the episode urls, as you see beautifulsoup is very easy to use, it has many more methods like title which displays the title of source code, prettify method which displays the html code in a beautifull format.

Here we used find_all method which captures all the anchor tags in a list which have href attributes.

After gettings all the anchor tags we are conditioning only the tags which have 'Episode' in the href link as these are the only urls we need.

Step 3:

Using gtk and webkit libraries we are going to actually open a simple webkit browser and run the url. 

We are using the webkit browser because the kissanime.to site only displays the real video link if a real browser requests it. So we are using webkit to request kissanime as a real browser and then once it responds with real video link we will grab that and start downloading it.

download.py

import sys
import subprocess
import gtk, webkit


def update(view, frame, resource, request, response):
    url = request.get_uri()
    if 'itqe.googlevideo' in url:
        print('\n###################################################')
        print(url) # this is the one piece video link
        subprocess.check_call('wget '+url, shell=True) # downloading the video using wget utility
        print('###################################################\n')
        sys.exit(0)


win = gtk.Window()
win.connect('destroy', lambda w: gtk.main_quit())
win.show()


box = gtk.HBox()
win.add(box)
web = webkit.WebView()
web.connect('resource-request-starting', update)
box.pack_start(web)


# sys.argv[1] is command line argument which will be our one piece video url
web.open(sys.argv[1])
box.show_all()


gtk.main()

Run download.py with url that we got by scraping as argument:

python2.7 download.py <your_one_piece_episode_url_here>

You can combine the above two scripts to make everything automated instead of running dowload.py with url for every episode url.

    By Posted On
SENIOR DEVELOPER at MICROPYRAMID

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
Python to Debian package: Simple, Easy and Fast Dinesh Deshmukh

Packaging python script to debian follows strict instructions, using the following instructions, most of the steps can be skipped hence making it easy and fast.
Continue Reading...

Generating CSV, Excel files Using Python Shirisha Gaddi

In most of the cases, you need to export the data from your database to different formats. In this post I will show you how ...

Continue Reading...
Python Coding Techniques and Programming Practices Anjaneyulu Batta

Coding techniques and programming practices are one of the features of a professional programmer. While writing code to solve a problem programmer should make simple ...

Continue Reading...

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties