Thanks to KermM for convincing me to make a topic for this :)
Over last semester, I started a project with two friends of mine; the purpose of which was to scan the entire web recursively based on the links situated in each webpage that I scanned. Starting off a seed website, such as cemetech.net, the program would proceed to find every link on the webpage, and scan the webpages linked to for their links and so on and so forth, until theoretically, we have successfully scanned every single page on the Internet and established how it connects to every other page on the Internet. A completed setup would appear very similar to a computer's folder hierarchy, with each webpage represented with its own directory or "node" full of links to each other webpage/directory or "node" the webpage linked to.
I first created a proof-of-concept based off of a single computer that started at www.google.com, and continued to log the webpages that it found until I told it to stop. Within a couple minutes of running it, the program had traversed to www.apple.com and had discovered various different things such as the iTunes installer and the eBooks section of the iTunes Store.
However, I knew that for any chance to succeed in fully scanning the web, using only a single computer and its Internet connection would not be sufficient. Thus, the project grew into a crowdsourced program, designed in such a way that tens, hundreds, or thousands of people could scan the web and relay the results back to a central server(s). A connection would appear like this:
Code:
The server was built with a verification procedure. Essentially, each link had to be scanned twice by two separate clients to make sure the entries matched. If they didn't, both results were thrown out and the webpage was rescanned. The program also had built in robots.txt support so as to avoid websites who didn't want to be scanned.
The hope was to eventually create a visualization of the web, which showed all the connections and intersections between each node in one enormous web.
The framework has been essentially completed on both the server and the client; the final step would be to finish debugging both and get the scanning part of the client up and working.
While I have been very busy with school recently, I hope to finish both programs by the end of June (earlier if help comes along) and to hopefully have the internet scanned by the end of next summer.
Thoughts or offers of help appreciated!!
Over last semester, I started a project with two friends of mine; the purpose of which was to scan the entire web recursively based on the links situated in each webpage that I scanned. Starting off a seed website, such as cemetech.net, the program would proceed to find every link on the webpage, and scan the webpages linked to for their links and so on and so forth, until theoretically, we have successfully scanned every single page on the Internet and established how it connects to every other page on the Internet. A completed setup would appear very similar to a computer's folder hierarchy, with each webpage represented with its own directory or "node" full of links to each other webpage/directory or "node" the webpage linked to.
I first created a proof-of-concept based off of a single computer that started at www.google.com, and continued to log the webpages that it found until I told it to stop. Within a couple minutes of running it, the program had traversed to www.apple.com and had discovered various different things such as the iTunes installer and the eBooks section of the iTunes Store.
However, I knew that for any chance to succeed in fully scanning the web, using only a single computer and its Internet connection would not be sufficient. Thus, the project grew into a crowdsourced program, designed in such a way that tens, hundreds, or thousands of people could scan the web and relay the results back to a central server(s). A connection would appear like this:
Code:
Client -> Server: Client Hello: Query
Server -> Client: Webpage to scan
Client: scans webpage and finds all links
Client -> Server: Sends links
Server -> Client: Server ACK, end connection
Server: Finds all non-duplicate links and creates nodes for said links
Server:Adds links to newly created nodes in the scanned webpage's node.
Server: Adds non-duplicate links to central database of webpages to be scanned.
Server: Repeat!
The server was built with a verification procedure. Essentially, each link had to be scanned twice by two separate clients to make sure the entries matched. If they didn't, both results were thrown out and the webpage was rescanned. The program also had built in robots.txt support so as to avoid websites who didn't want to be scanned.
The hope was to eventually create a visualization of the web, which showed all the connections and intersections between each node in one enormous web.
The framework has been essentially completed on both the server and the client; the final step would be to finish debugging both and get the scanning part of the client up and working.
While I have been very busy with school recently, I hope to finish both programs by the end of June (earlier if help comes along) and to hopefully have the internet scanned by the end of next summer.
Thoughts or offers of help appreciated!!