August 24, 2012
Posted by on
today i’m telling about some basics of Web scraping using python. i use python 2.7. after a googling and exploring python docs i got some interesting results about this.
(Web scraping is a computer software technique of extracting information from websites.)
a simple way to scrap information from web, using built in python library urllib(urlib2).
response = urllib2.urlopen(''http://www.sarathink.wordpress.com'')
html = response.read()
urllib2 is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations – like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers
Now the real parts start,scrap the data using urls.For that I used BeautiFulSoup.Parsing data is very easy using it.
Before trying it,you have to install it, download the tarball.
Windows: setup.py install
Ubuntu:sudo apt-get install python-beautifulsoup
Parsing is as simple as below:
from BeautifulSoup import BeautifulSoup
url = "http://iamsarath.blogspot.com"
html = urllib2.urlopen(url).read()
data = BeautifulSoup(html)
It will print whole parsed data of the url.And then you can navigate and collect yours need html tag values from the
soup 🙂 .