Archives
Categories
Join 7 other subscribers
The memories are written in INK,for just THINK…
today i’m telling about some basics of Web scraping using python. i use python 2.7. after a googling and exploring python docs i got some interesting results about this.
(Web scraping is a computer software technique of extracting information from websites.)
a simple way to scrap information from web, using built in python library urllib(urlib2).
import urllib2 response = urllib2.urlopen(''http://www.sarathink.wordpress.com'') html = response.read() print html
urllib2 is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations – like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers
Now the real parts start,scrap the data using urls.For that I used BeautiFulSoup.Parsing data is very easy using it.
Before trying it,you have to install it, download the tarball.
Windows: setup.py install
Ubuntu:sudo apt-get install python-beautifulsoup
Parsing is as simple as below:
import urllib2 from BeautifulSoup import BeautifulSoup url = "http://iamsarath.blogspot.com" html = urllib2.urlopen(url).read() data = BeautifulSoup(html) print data
It will print whole parsed data of the url.And then you can navigate and collect yours need html tag values from the
soup 🙂 .