International Conference on Weblogs and Social Media

Contact Information
For questions, please e-mail:

 

March 26-28, 2007

Conference Dataset

Update As of Dec. 8, 2006 (the conference deadline) we will not be taking any more requests for the dataset. This website will remain up for referencing.

Continuing the tradition from the WWE2006 workshop, we will once again be offering a large blog dataset to conference participants.

The data release comprises a complete set of weblog posts collected by Nielsen BuzzMetrics for May 2006 (consisting of about 14M posts from 3M weblogs). The data set includes the full content of the posts plus mark-up and requires 10G of space to download in compressed form.

In addition to the data we hope to release processing code and a shared repository for those making use of the dataset

Obtaining the Data

The process for obtaining the data:
(1) download the data share agreement
(2) sign and fax the agreement as instructed in the document
(3) e-mail datashare@icwsm.org to let us know that the fax is on its way

Update As of Dec. 8, 2006 (the conference deadline) we will not be taking any more requests for the dataset. This website will remain up for referencing.

Once we've received the fax, we will e-mail you a unique username/login that will permit you to download the data from the hosting site at Ebiquity UMBC. The username/login expires after one week.

Description of the Data

The dataset consists of about 14M weblog posts in XML format from 3M weblogs collected by Nielsen BuzzMetrics for May 2006. The data is annotated with 1.7M blog-blog links (Please note that the complete dataset is over 10GB).

The marked-up fields include: date of posting, time of posting, author name, title of the post, weblog url, permalink, tags/categories, and outlinks classified by type.

Breakdown (of posts) by language:

  • English 51%
  • Chinese 14%
  • Japanese 14%
  • Russian 6%
  • Spanish 3%
  • French 2%
  • Italian 2%
  • Unknown 3%

Additional details for the dataset are available here.