collecting data from streaming APIs in twitter

twitter’s streaming API is still in beta and is a good source of collecting public tweets. but unfortunately not all those methods are instantly usable by third parties (u need to provide written statements and so on). but for testing, three of these streaming APIs are usable by anyone at this moment which are spritzer, track and follow. spritzer streams a tiny part of public tweets to the collecting processes. in this blog post i’ll show you how to collect data from spritzer API.

as it is a stream data, so twitter keeps the HTTP connection “alive” infinitely (until any hiccup, by using Keep Alive). so when you write code, you must take care of that. and i would also suggest to make separate processes for collecting data+writing them (or sending them in queue to be written) – and for analyzing those data. and of course, to minimize the bandwidth consumption, use the json format. and json data is also easier to parse than XML as every tweet is separated by a new line (“\n”) character from twitter :) – so you can read these data line by line, dcode them using json_decode() and do whatever you want

here is how you can create the collector process in php

< ?php
//datacollector.php
$fp = fopen("http://username:password@stream.twitter.com/spritzer.json","r");
while($data = fgets($fp))
{
    $time = date("YmdH");
    if ($newTime!=$time)
    {
        @fclose($fp2);
        $fp2 = fopen("{$time}.txt","a");
    }
    fputs($fp2,$data);
    $newTime = $time;
}
?>

this script will write the data collected hourly from the spritzer streaming API in filen (with names like <YmdH>.txt ). so in the directory where you are runnign this script u will see hourly data files. like 2009062020.txt . there is a special advantage to keep collecting in this way – as the file will remain open for writing (hence LOCKED) you will process files only for previous hours. it will make analyzing the data more hassle free :)

now run this script in background via the following command in your terminal

php datacollector.php &

the reason for appending an “&’ at the end of the command is starting this process in background. so that you dont have to wait for the script to end to get access to your shell back. as it is a streaming data, the script will run infinitely. and it will consume very minimal bandwidth :) you can check yourself.

so i hope it will help those developers who are looking for a solution to collect data from twitter’s streaming API via PHP. If you want to track any specific keywords, use the “track” API instead :). and if you want to follow some particular person use the “follow“. Check out twitter’s documentation of streaming API for more :)

About these ads

21 thoughts on “collecting data from streaming APIs in twitter

  1. @Ishtiaque – the spritzer, follow and track APIs are open for all :) others are not available for everyone.

    The tremendous “Firehose” API is avilable to friendfeed fyi :) thats why they got all your tweets

  2. thanks for the info Hasin bhai. That’s really cool. One more question, aren’t they going to use the oauth. I thot twitter has set it as the standard way to access their APIs.

  3. @Ishtiaque yes of course they use oAuth. but i’ve shown the shortcut way by HTTP BASIC AUTH. Because this process will run in background as a shell process and you must use your account for that (which means u know un/pw) – and I dont know (and confused if there is even anything exist like it) how to use oAuth tokens for CLI :)

    But yes, you can do it with oAuth token. Check my previous blog post to get idea how to implement oAuth using PHP in twitter :)

  4. Awesome!

    Can you provide an example in case the data stream is interrupted and you have to re-connect?

    Also, how would you do the “track” stream using parameters? Does fopen allow for sending of the paramters?

    Or do I need to use cURL for “track”.

    Thanks

  5. This runs great, and really appreciate the post. Been having problems with it this week though, keeps dropping off, anybody have a solution to auto restart it when it does?

  6. To those asking how to make it re-start should the twitter API fail… use this…
    while(‘1′ == ‘1’) {
    $fp = fopen(“http://username:password@stream.twitter.com/spritzer.json”,”r”);
    while($data = fgets($fp)) {
    $time = date(“YmdH”);
    if ($newTime!=$time) {
    @fclose($fp2);
    $fp2 = fopen(“{$time}.txt”,”a”);
    }
    fputs($fp2,$data);
    $newTime = $time;
    }
    sleep(1000);
    }

    It is a little hackish… but its simple, and it works.
    Yes the sleep is needed or else you could get a temp IP block from twitter…

  7. Thanks for posting this… to give some back — those wondering how to use track and follow, see below:

    $curl = curl_init();

    curl_setopt($curl, CURLOPT_POST, true);
    curl_setopt($curl, CURLOPT_POSTFIELDS, ‘track=#NowPlaying’);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_URL, ‘http://stream.twitter.com/1/statuses/filter.json’);
    curl_setopt($curl, CURLOPT_USERPWD, $_CONFIG[‘twitter’][‘username’] . ‘:’ . $_CONFIG[‘twitter’][‘password’]);
    curl_setopt($curl, CURLOPT_WRITEFUNCTION, ‘progress’);
    curl_exec($curl);
    curl_close($curl);

    function progress($curl, $str)
    {
    print “$str\n\n”;
    return strlen($str);
    }

  8. Anybody who can tell me how to adapt the above code of datacollector.php to work with the new twitter api. Please help sorry if the question is simple but am very new to that. Thanks in advance
    Nagy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s