Sarath's Web Log

The memories are written in INK,for just THINK…

Tag Archives: Web Scraping

Web scraping using python

today i’m telling about some basics of Web scraping using python. i use python 2.7. after a googling and exploring python docs i got some interesting results about this.
(Web scraping is a computer software technique of extracting information from websites.)

a simple way to scrap information from web, using built in python library urllib(urlib2).

import urllib2
response = urllib2.urlopen(''http://www.sarathink.wordpress.com'')
html = response.read()
print html

urllib2 is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations – like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers

Now the real parts start,scrap the data using urls.For that I used BeautiFulSoup.Parsing data is very easy using it.

Before trying it,you have to install it, download the tarball.
Windows: setup.py install
Ubuntu:sudo apt-get install python-beautifulsoup

Parsing is as simple as below:

import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://iamsarath.blogspot.com"
html = urllib2.urlopen(url).read()
data = BeautifulSoup(html)
print data

It will print whole parsed data of the url.And then you can navigate and collect yours need html tag values from the
soup 🙂 .