Building a simple crawler - indexing the internet starting from one page

Is this posible...? the index the internet starting from one page....? Yes, I have tested this: starting from iuliumaniu.ro, a site that was builded for a Christian comunity, by me a long time ago (when I create such a program for college diploma). This consist in some small steps which followed can collect the entire Internet content, or what you need from there.

Steps for crawling the internet:

1. set u = starting url

2. load u

3. [?store data about page u]

4. process page u - extract links from u content

5. foreach u = extracted link go yo step 2.

1st step is simple - you have to select a page/site where exists some external links, to walk and on other sites. 2nd step means that you have to get that page content ussualy using a http web request; 3rd step can be placed before or after step 4, depending about what do you need to collect (if you need to collect and the links, or you have to process the stored data, probably this step will be after step 4); it consists in some data storage (database, xml, ...) implementation. Processing the page content be managed in more maniers, I can give to you 2 simple ways to process this - XML/HTML or process as a text, eventually using the regular expressions - XML is more harder to implement but this can give to you some advantages. And in the end you follow all page urls and jump to step 2 - this will ensure that the internet will be indexed entirely by you application.

This is a small theory about crawlers, it is not very dificult to implement it. Come back soon for a small implemenation of this.


Posted by: admin
Posted on: 1/10/2010 at 5:06 PM
Tags: , , , ,
Categories: Articles | Crawlers | Google | Programing | SEO | WWW
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (0) | Post RSSRSS comment feed

Today we are stupid

En France. Remi Gaillard. Tres jolie :) C'est en faisant n'importe quoi qu'on devient n'importe qui!

 

 

This is another way, of being somebody on the web; not the simpliest way, but a good way, this only if you dare to be somebody. On his web page he posted a lot of videos like ^ this one. And now he is a "star", a web star. Like Britney Spears.

The question if we start to do like him, what will happen on Earth? I belive that this should be a point where we forget thousand of yeas of civilization. Fortunately, many of us, still have something in head. More than brain: a working brain.

All over the time/ anywhere the human race has joined this kind of people, with "no brain". This is sensation, this is something "new" - even if this kind of events was happened and before, but accidentaly, not with Remi's wishes. So, nothing new?

Where comes the succes ... I see every day a lot of sites, with this subject: "Today we are stupid": funy messages, funny jokes, funny images, but always the same "new" message/image/... in another format. And the people smile. This are succesusfully sites. Not my site, where I try to tell to the world, "Yes, I know something"; the latins says "panem et circum"; here is more circus than ever .... I am thinking, if I start my own Circus, of course on www, how should I call it? Pan Circus? because I know, the succes is guaranted.

So, today we are stuppid!

CoolMoney mouthTongue outWinkSmileKiss


Posted by: admin
Posted on: 9/18/2009 at 5:03 PM
Tags: , , ,
Categories: Articles | blog | Community | WWW
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (0) | Post RSSRSS comment feed

Long day Yesterday, long day tomorow

But it is okay. Yesterday I made a lot of optimization for my main site, pan-internet.com. I fix the problems that were a on article page, I made the possibility that the registered user will add articles to my site (after I verify it – also this functionality will be used and by me), I implement RSS Syndication on articles page.


Another great „realization” of yesterday is that I made the documentation for two of my services: the web service where you can get the IP, IP to geo location and the Little Box, the same functionality but this is only client side. See the articles from "Documentation" category.


For today I propose to solve the problem called automatically updating of the proxy servers. This will be another fight with me, but I am sure that I will win.


Also yesterday I have uploaded a lot of pictures on my Picasa account! This is just a demo publish, the original pictures will stay on my computer, until you ask from me this. Enjoy it!


And another realization of today is the blog: until now I keep my promise, to write an article every day, in September. But also and here are many thinks to do!

 

It is 00:20 and i have to sleep.....


Posted by: admin
Posted on: 9/13/2009 at 12:25 AM
Tags: , , , ,
Categories: Articles | Community | Google | Proxy
Actions: E-mail | Kick it! | DZone it! | del.icio.us
Post Information: Permalink | Comments (0) | Post RSSRSS comment feed