Look Ma, everyone's computing out there for me!

SETI@Home is probably the greatest example of low cost distributed computing which become a big hit. After their tremendous success, many others over there started following the same strategy and used the power of distributed computing for other purposes like cancer research. In this article I will show you how we can use the same power at almost zero cost, and specially for your web applications.

As I am currently working on building an open source version of FriendFeed (not targetted to be an alternative, coz those people at FriendFeed done their job really well) and scaling such a huge load effectively at low cost, so I will mainly talk more about FriendFeed through out this blog post and use it as an example for my proposal.

If you consider FriendFeed as a repository of feed URLs, a lot of peoples and how they are related to each other, you can assume how big it is or it could be in near future. And scaling such a service would cost numbers of sleepless nights of many developers out there. So in basic, lets focus where actually the problem is and how we can introduce distributed computing.

Beside optimizing database to serve huge sets of data, one of the main problems of such a service has to parse millions of feeds in a regular interval. If we want to bear all the loads on your server, fine, if you can afford. But what about some low cost solutions. Lets consider a simple scenario, if your application has one million of users and each of them browse your application for 10 minutes a day, you really have 10 millions of computational power just wasting over there, in lanes and by lanes of internet – heh heh. So lets make use of such an incredible CPU power. All you have to do let the visitors machine do some calculations for you and free your server from gigantic load.

When the users of your application and relation among them are stored in your database, you can easily find out the first degree and second degree friends of a specific user. If you don’t know what does that mean, its simple, If A is a friend of B and C is a friend of A, then A is B’s first degree friend and C is B’s second degree friend. For a huge social network, it may look like the following one when you visualize the relationship


image courtesy: http://prblog.typepad.com

Now what we want to do is when B is visiting our application we want to parse most of his/her second degree friends in client using his browser. So while generating the page for B, we will supply him a bunch of feed URLs, a hash of their last known update time or hash of the latest item of each of these corresponding feeds, and a javascript based parser (for example Google’s ajax Feed API would do fine) script. Now, while B is browsing our application we will parse those feeds of his second degree friends using javascript without bothering him for a single second and post back the parsed contents (after checking against the hash for really updated contents) to a server side script which will then blindly (not totally, after some validation or authentication for sure) insert those result to our database. Now when A comes and visit his page (A is C’s first degree friend), he will get all the latest result from C’s feeds as B has already done the parsing job for us and we have those latest result from C’s feeds stored in our Database.

There are definitely more challenge than it is explained here, like what if a person is second degree friend of a multiple user. In such cases as we supplied last update time of these feeds while generating a page, we can calculate from server side which feeds we really want to parse.

And moreover, we can add more check for our javascript parser than just blindly parse those feeds. We can download a chunk of RSS or Atom feeds (using a proxy script developped using curl range options) and read just up to the latest update time and extract the time using simple reg ex or string functions instead of downloading full feed data. Now if we know that a script is not uptodate just by downloading 1-2 killobyte of data instead of downloading full feed, parsing the xml data, it would save us more computing resources for performing other jobs.

But of course, you cannot live completely on your client’s machine for parsing all your data. You must have several cron’d scripts to parse left overs or other feeds at server side. But what I am saying is that with little help of javascript you can make use of such a tremendous distributed computing power in your application, all at almost no cost.

I will come with example code once I am done developing my open source clone of FriendFeed and then I am sure, you will find it was worth writing a blog post about.

Distributed Computing
image courtesy: http://www.naccq.ac.nz/bacit/0203/2004Caukill_OffPeakGrid.htm

Have a nice RnDing time. 🙂