ICWSM-12

Datasets

Every year ICWSM strives to facilitate better and broader use of datasets in social media research. This year, ICWSM has two dataset initiatives: (1) we are providing some compelling ICWSM datasets for the community to use; and (2) we will provide a service whereby authors can share datasets associated with papers published at the ICWSM conference.

ICWSM-Provided Datasets

ICSWM is providing two datasets this year.

TREC Tweets2011 Dataset: This dataset consists of identifiers, provided by Twitter, for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included. To obtain the dataset, follow the instructions available online.
ICWSM boards.ie Forums Dataset: This dataset makes available ten years of discussions from the Irish forum site boards.ie. The top-level site document links to users (and to FOAF files) as well as to top-level forums. Forums link to subforums and threads, which finally link to individual posts. The posts link to each other based on replying and quoting. The FOAF files also link to each other, describing a social network based on the users' buddy lists. The data in total (over 10 years) is around 9 million documents and takes about 50 gigabytes of disk space. To obtain this dataset, email a signed and completed copy of the usage agreement to dataset-request@icwsm.org. Once processed, you will receive download instructions.

Dataset Sharing Service

Update! The ICWSM Dataset Sharing Service is now available.

This year, for the first time, we have provided a service for hosting datasets pertaining to research presented at the conference. Authors of accepted papers have been encouraged to share the datasets on which their papers are based, while adhering to the terms and conditions of the data provider. Of these datasets, one is selected for an award which will be based on the quality, scope, and timeliness of each dataset.