Innovate anywhere, anytime withruncode.io Your cloud-based dev studio.
Python

Python Web Scraping with Beautiful Soup

2022-07-19

BeautifulSoup:

BeautifulSoup is a python library which helps in managing data from html or xml files, using beautifulsoup helps in searching, navigation and parsing data with ease and less amount code.

Lets Scrape and download all One Piece Episodes from kissanime.to website:

How are we going to do with beautifulsoup:

1. Get the source(html code) of the One Piece page.

2. Find the required episode links using BeautifulSoup. (without beautifulsoup we would be using regular expression which is great but for less code and easiness we use beautifullsoup) 

3. Use webkit and gtk to grab the video source url and finally download episode using this url.

Install Requirements:

pip install beautifulsoup4
pip install pygtk

Step-1:

Get the source code(html) from url: https://kissanime.to/Anime/One-Piece/

Since the url runs javascript before showing the original source code, we have to get the source code from browser itself, so just open the link and press ctrl+u and copy the html code to a file.

lets name the file as one_piece.html

Step-2:

Using BeautifulSoup to get the list of links:

BeautifulSoup's find_all method is the most used and simple method to search over all the file for required data.

from bs4 import BeautifulSoup


with opne('one_piece.html', 'r') as op_file:
    html_doc = str(op_file.read())


# Now we have all the source code in html_doc name
# lets use beautifulsoup to find all the links for episodes


anchor_tags_list = BeautifulSoup.find_all('a', 'href=True')


for bs4_ele in anchor_tags_list:

    url = bs4_ele'href'

    if 'Episode' in url:

        urls_list.append(url)



print(urls_list)

The above code displays all the episode urls, as you see beautifulsoup is very easy to use, it has many more methods like title which displays the title of source code, prettify method which displays the html code in a beautifull format.

Here we used find_all method which captures all the anchor tags in a list which have href attributes.

After gettings all the anchor tags we are conditioning only the tags which have 'Episode' in the href link as these are the only urls we need.

Step 3:

Using gtk and webkit libraries we are going to actually open a simple webkit browser and run the url. 

We are using the webkit browser because the kissanime.to site only displays the real video link if a real browser requests it. So we are using webkit to request kissanime as a real browser and then once it responds with real video link we will grab that and start downloading it.

download.py

import sys
import subprocess
import gtk, webkit


def update(view, frame, resource, request, response):
    url = request.get_uri()
    if 'itqe.googlevideo' in url:
        print('\n###################################################')
        print(url) # this is the one piece video link
        subprocess.check_call('wget '+url, shell=True) # downloading the video using wget utility
        print('###################################################\n')
        sys.exit(0)


win = gtk.Window()
win.connect('destroy', lambda w: gtk.main_quit())
win.show()


box = gtk.HBox()
win.add(box)
web = webkit.WebView()
web.connect('resource-request-starting', update)
box.pack_start(web)


# sys.argv[1] is command line argument which will be our one piece video url
web.open(sys.argv[1])
box.show_all()


gtk.main()

Run download.py with url that we got by scraping as argument:

python2.7 download.py <your_one_piece_episode_url_here>

You can combine the above two scripts to make everything automated instead of running dowload.py with url for every episode url.