Social Media, The Future of News, and Data Mining

I went to a very good panel discussion yesterday hosted by the Center for Information Technology Policy at Princeton University. There has been a conference going on there that covers a lot of the overlap between technology, law, and journalism, and the panel discussion yesterday, Data Mining, Visualization, and Interactivity was even more enlightening than I had anticipated.

The panel members included Matt Hurst, of Microsoft Live Labs, Kevin Anderson, blog editor for The Guardian, and David Blei, a professor at the Computer Science Dept., Princeton University. This made for a very lively discussion, covering a wide range of perspectives about social media, “what is news?”, how technology is changing how people interact with information (including news), how the news game is changing as a result (which was far more fascinating than it sounds), and how this unfathomably enormous stream of bits, enabled by lots of open APIs, feeds, and other data streams can be managed, mined, reduced, and presented in some value-added way (part of the value being the sheer reduction in noise).

Cool Tools for Finding News

Some of the tools presented by the panelists were new to me, and aside from being great tools for bloggers and other content publishers, there are some excellent examples of how to make effective use of the data you have access to through APIs like the Digg API.


This was presented by Matt Hurst. It’s is pretty neat – it’s a tool that essentially charts blog buzz of a given phrase over time, and it even lets you compare multiple phrases, which is really interesting as well. Check it out here.

I’d like to know more about how it derives the metrics, but in doing a couple of quick comparisons using the tool, it seems to line up to some degree with simple comparisons of the number of search results for different phrases on sites like technorati and bloglines. Interestingly, even though there appears to be lots more data available at Technorati, in my very limited experimenting, the percent difference between search results for any two phrases appears to be similar, indicating that bloglines may be a representative sampling of technorati data. More experimentation, of course, would be needed to lend any credibility whatsoever to that claim. It’s probably irrelevant, because you can’t ask either service for any kind of historical data regarding search results 🙂


This has the potential to be really interesting. Right now, it lets you pick from several different terms, like “love”, “wish”, “think” and “feel”, and after clicking one of those, it’ll start producing a constantly updating stream of twitters that contain those words. If this experiment is successful, I would imagine they’d eventually enable the same service for arbitrary keywords, which would be really powerful, and quite a lot of fun!


Oh how boring my life according to twitter is. I’m still in the schizophrenic stage of settling on a live ‘update your friends on what you’re doing whether they care or not’ services. Facebook, myspace, twitter, jaiku… there are too many. I’m trying out the imified route now to consolidate all the cruft. According to tweetwheel, there are more places to update my status at any given moment than there are people who give a damn what my status is.

Anyway, tweetwheel shows how you’re connected to people through twitter. If you have lots of followers and follow lots of people, the wheel is really exciting to look at, as displayed by Kevin Anderson, who has a much more “robust” wheel than me — it’s actually interesting to look at. At some point I’d like to see this idea expanded to cover the other services like Facebook and even LinkedIn.

Digg Labs

You have to go to the Digg Labs site and see what people are doing with the Digg API. There are too many awesome utlities to cover them all here. It almost makes me wish I did fancy Flash UI stuff instead of back end data mining and infrastructure administration.

At a higher level…

Most of the discussion about social media seems to be about measuring buzz created by bloggers (at least where news/content publishing is concerned). However, although things have shifted dramatically in a ‘consumers are producers’ direction, causing people to start rethinking the definition of news, this shift is caused as much by consumers who are still *only* consuming as anyone else, and I didn’t see much in the way of tools that measure the interest of those people in any meaningful way. Perhaps the consensus is that the bloggers are a representative sampling of the wider internet readership? I don’t know. I would disagree with that if it were the case.

I work for, which seeks to provide publishers of news and all kinds of other content with statistics that help them figure out not just what pages people happen to be landing on, but which ones they have elected to take a greater interest in, either by emailing it to a friend, adding it to their favorites, or posting it to digg, delicious, or some other service. Maybe some day there will be an AddThis API that’ll let you easily do even more interesting things with social media.