Python Web Scraping with Beautiful soup

Reading Time : ~ .

BeautifulSoup:

BeautifulSoup is a python library which helps in managing data from html or xml files, using beautifulsoup helps in searching, navigation and parsing data with ease and less amount code.

Lets Scrape and download all One Piece Episodes from kissanime.to website:

    How are we going to do with beautifulsoup:

              1. Get the source(html code) of the One Piece page.

             2. Find the required episode links using BeautifulSoup. (without beautifulsoup we would be using regular expression which is great but for less code and easiness we use beautifullsoup) 

             3. Use webkit and gtk to grab the video source url and finally download episode using this url.

Install Requirements: 

pip install beautifulsoup4
pip install pygtk

Step-1:

Get the source code(html) from url: https://kissanime.to/Anime/One-Piece/

Since the url runs javascript before showing the original source code, we have to get the source code from browser itself, so just open the link and press ctrl+u and copy the html code to a file.

lets name the file as one_piece.html

Step-2:

Using BeautifulSoup to get the list of links:

     BeautifulSoup's find_all method is the most used and simple method to search over all the file for required data.

from bs4 import BeautifulSoup


with opne('one_piece.html', 'r') as op_file:
    html_doc = str(op_file.read())


# Now we have all the source code in html_doc name
# lets use beautifulsoup to find all the links for episodes


anchor_tags_list = BeautifulSoup.find_all('a', 'href=True')


for bs4_ele in anchor_tags_list:

    url = bs4_ele'href'

    if 'Episode' in url:

        urls_list.append(url)



print(urls_list)

The above code displays all the episode urls, as you see beautifulsoup is very easy to use, it has many more methods like title which displays the title of source code, prettify method which displays the html code in a beautifull format.

Here we used find_all method which captures all the anchor tags in a list which have href attributes.

After gettings all the anchor tags we are conditioning only the tags which have 'Episode' in the href link as these are the only urls we need.

Step 3:

Using gtk and webkit libraries we are going to actually open a simple webkit browser and run the url. 

We are using the webkit browser because the kissanime.to site only displays the real video link if a real browser requests it. So we are using webkit to request kissanime as a real browser and then once it responds with real video link we will grab that and start downloading it.

download.py

import sys
import subprocess
import gtk, webkit


def update(view, frame, resource, request, response):
    url = request.get_uri()
    if 'itqe.googlevideo' in url:
        print('\n###################################################')
        print(url) # this is the one piece video link
        subprocess.check_call('wget '+url, shell=True) # downloading the video using wget utility
        print('###################################################\n')
        sys.exit(0)


win = gtk.Window()
win.connect('destroy', lambda w: gtk.main_quit())
win.show()


box = gtk.HBox()
win.add(box)
web = webkit.WebView()
web.connect('resource-request-starting', update)
box.pack_start(web)


# sys.argv[1] is command line argument which will be our one piece video url
web.open(sys.argv[1])
box.show_all()


gtk.main()

Run download.py with url that we got by scraping as argument:

python2.7 download.py <your_one_piece_episode_url_here>

You can combine the above two scripts to make everything automated instead of running dowload.py with url for every episode url.

    By Posted On
SENIOR DEVELOPER at MICROPYRAMID

Need any Help in your Project?Let's Talk

Latest Comments
Related Articles
Building and Parsing XML Document using Python Swetha Naretla

Creating XML document with required elements, Then Parsing it using Python to generate a serialized form of its contents.

Continue Reading...
Sending SMS, MMS using Twilio. Chaitanya Kattineni

A simple Tutorial on sending SMS and MMS in python using Twilio. In this tutorial you will learn how to send SMS, MMS and checking ...

Continue Reading...
Generating PDF Files in Python using xhtml2pdf Siva Chittamuru

There are many ways for generating PDF in python. In this post I will be presenting PDF files generation with xhtml2pdf.

xhtml2pdf: xhtml2pdf is a ...

Continue Reading...

Subscribe To our news letter

Subscribe to our news letter to receive latest blog posts into your inbox. Please fill your email address in the below form.
*We don't provide your email contact details to any third parties