collecting data from streaming APIs in twitter

twitter’s streaming API is still in beta and is a good source of collecting public tweets. but unfortunately not all those methods are instantly usable by third parties (u need to provide written statements and so on). but for testing, three of these streaming APIs are usable by anyone at this moment which are spritzer, track and follow. spritzer streams a tiny part of public tweets to the collecting processes. in this blog post i’ll show you how to collect data from spritzer API.

as it is a stream data, so twitter keeps the HTTP connection “alive” infinitely (until any hiccup, by using Keep Alive). so when you write code, you must take care of that. and i would also suggest to make separate processes for collecting data+writing them (or sending them in queue to be written) – and for analyzing those data. and of course, to minimize the bandwidth consumption, use the json format. and json data is also easier to parse than XML as every tweet is separated by a new line (“\n”) character from twitter :) – so you can read these data line by line, dcode them using json_decode() and do whatever you want

here is how you can create the collector process in php

< ?php
//datacollector.php
$fp = fopen("http://username:password@stream.twitter.com/spritzer.json","r");
while($data = fgets($fp))
{
    $time = date("YmdH");
    if ($newTime!=$time)
    {
        @fclose($fp2);
        $fp2 = fopen("{$time}.txt","a");
    }
    fputs($fp2,$data);
    $newTime = $time;
}
?>

this script will write the data collected hourly from the spritzer streaming API in filen (with names like <YmdH>.txt ). so in the directory where you are runnign this script u will see hourly data files. like 2009062020.txt . there is a special advantage to keep collecting in this way – as the file will remain open for writing (hence LOCKED) you will process files only for previous hours. it will make analyzing the data more hassle free :)

now run this script in background via the following command in your terminal

php datacollector.php &

the reason for appending an “&’ at the end of the command is starting this process in background. so that you dont have to wait for the script to end to get access to your shell back. as it is a streaming data, the script will run infinitely. and it will consume very minimal bandwidth :) you can check yourself.

so i hope it will help those developers who are looking for a solution to collect data from twitter’s streaming API via PHP. If you want to track any specific keywords, use the “track” API instead :). and if you want to follow some particular person use the “follow“. Check out twitter’s documentation of streaming API for more :)