General Chairs

William W. Cohen
Carnegie Mellon/Google
Nicolas Nicolov
J.D.Power and Associates, McGraw-Hill

Program Chairs

Natalie Glance
Google Inc
Matthew Hurst
Live Labs, Microsoft

Data Chairs

Ian Soboroff
Akshay Java
Live Labs, Microsoft

Panel Chair

Kathy Gill
University of Washington

Local Chair

Cameron Marlow

Tutorials Chair

Chris Diehl
Johns Hopkins University



3rd Int'l AAAI Conference on Weblogs and Social Media

May 17 - 20, 2009, San Jose, California

Sponsored by the Association for the Advancement of Artificial Intelligence.

ICWSM 2009 Data Challenge

Continuing the ICWSM tradition, ICWSM 2009 is making a dataset available to researchers in the blog and social media fields. We invite you to download the dataset, explore it, learn something interesting about it, and submit a paper about it to ICWSM 2009.

Good research topics might include...

But you should feel free to explore any aspect of the data that you feel would be of interest to the ICWSM community.

List of papers accepted to the Data Challenge Workshop

Identifying Personal Stories in Millions of Weblog Entries [PDF]
Andrew Gordon and Reid Swanson

SentiSearch: Exploring Mood on the Web [PDF]
Sara Sood and Lucy Vasserman

Flash Floods and Ripples: The Spread of Media Content through the Blogosphere [PDF] [Best Data workshop paper]
Meeyoung Cha, Juan Antonio Navarro Perez, and Hamed Haddadi

Event Intensity Tracking in Weblog Collections [PDF]
Viet Ha Thuc, Yelena Mejova, Christopher Harris and Padmini Srinivasan

Quantification of Topic Propagation using Percolation Theory: A study of the ICWSM Network [PDF]
Ali Azimi Bolourian, Yashar Moshfeghi and C. J. van Rijsbergen

Authors are invited to submit papers to a special data challenge workshop, to be held on the last day of ICWSM. Papers for the workshop may be submitted here. The deadline for workshop submissions is March 1st. Submissions may be up to 8 pages in length, must be in PDF format, and must follow the ICWSM formatting guidelines. The workshop itself will feature presentations by authors as well as a broader discussion of data issues and opportunities confronting the social media community.
We also welcome authors to submit papers on the dataset to the main ICWSM conference. Time permitting, we will invite authors of accepted ICWSM papers on the dataset to also briefly present their work at the workshop.
The best paper (main conference or workshop) on the dataset will be selected by the data chairs and will receive a prize at the conference.
Please note that the datasets made available through ICWSM are not restricted to only ICWSM 2009 or even ICWSM in general. Our long-term goal is to make weblog and social media datasets available to the research community, and while we hope that ICWSM will be a premier venue for presenting that research, we are happy to see the ICWSM datasets used far and wide.

ICWSM 2009 Spinn3r Blog Dataset

239 people have downloaded the dataset so far! (as of March 25th, 2009)

The dataset, provided by, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).
This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial crisis; ...) as well as everything else you might expect to find posted to blogs.
To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) Once your form is processed (usually within 1-3 days), you will be sent a URL and password where you can download the collection.
Here is a sample of blog posts from the collection. The XML format is described on the Spinn3r website.
Spinn3r provides free access to researchers. If you are interested in making use of their data beyond the ICWSM collection, for example to crawl linked posts or earlier stories from certain blogs, visit their site,

When citing this dataset in a paper, please use the following reference:
K. Burton, A. Java, and I. Soboroff. The ICWSM 2009 Spinn3r Dataset. In Proceedings of the Third Annual Conference on Weblogs and Social Media (ICWSM 2009), San Jose, CA, May 2009.

We have a mailing list for discussing the datasets at Please join to talk about whatever you're doing with the data. In particular, if you are looking for groups to collaborate with, here's a forum for you. We also have a project at Google Code,, where we can host tools and resources that you create to go along with the datasets.

Data Chairs
Ian Soboroff, NIST
Akshay Java, Live Labs, Microsoft

