Hey People,
Crawlers are kewl.i always thought writing crawlers is a tough job.Until i had to write one.I am posting a small code snippet of how to get started to write ur own crawler in perl.
The packages which u will require is
HTML::TreeBuilder
LWP::UserAgent
HTTP::Headers
URI::Escape
Here how to start up.
if u want to crawl results for search text "rohit d'souza" then u start up with google as below.
$searchtext=uri_escape("rohit d'souza")
$ua = LWP::UserAgent->new;#ur user agent
$ua->agent("Mozilla/5.0");
$ua->timeout(3000000);
# Create a request
my $req = HTTP::Request->new(GET => "http://www.google.co.in/search?hl=en&q=$searchtext&btnG=Google+Search&meta=
");
my $res = $ua->request($req);
if ($res->is_success) {
$tree = HTML::TreeBuilder->new_from_content($res->content);
if(defined $tree->look_down( '_tag' => 'a' )){
@getlinks=$tree->look_down( '_tag' => 'a' );
for($b=0;$b<@getlinks;$b++){
if($getlinks[$b]->attr('href') and $getlinks[$b]->attr('href')!~/google|orkut|^\/(.*)/gi ){
push(@links,$getlinks[$b]->attr('href'));
#####################################################################
Do repeated processing here to crawl other pages from the links extracted
#####################################################################
}
}
}
else
{
print "got failure".$res->status_line;
}
Thats all u can build a full fledge crawler using this login.Mind u the above code provided is a simple one.i suggest to build a professional crawler u need to build a OOPS structure and use functions to call repeatedly called code snippet.
If u have any queries u can mail me at topindiancoder@gmail.com.
One who never fails never grows an inch
Wednesday, January 16, 2008
Subscribe to:
Post Comments (Atom)






1 comment:
Hello Rohit,
I am saurav . I need your help. can you please drop a mail to dce.saurav@gmail.com .
Please its urgent. I can explain through main.
Thanks
Saurav
Post a Comment