Datasets made available through ICWSM are not restricted to use in the ICWSM conferences. Our long-term goal is to make weblog and social media datasets available to the research community, and while we hope that ICWSM will be a premier venue for presenting that research, we are happy to see the ICWSM datasets used far and wide.
ICWSM 2011 Spinn3r Dataset
That dataset, provided by Spinn3r.com, is a continuation of the 2009 Spinn3r Dataset. The dataset consists of over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th. It spans events such as the Tunisian revolution and the Egyptian protests (see http://en.wikipedia.org/wiki/January_2011 for a more detailed list of events spanning the dataset's time period).
The content includes the syndicated text, its original HTML as found on the web, annotations and metadata (e.g., author information, time of publication and source URL), and boilerplate/chrome extracted content.
The data is formatted as Spinn3r's protostreams - an extension to Google protobuffers – providing better performance in terms of speed and size over XML encoding. The data is separated by date, content type and language, thereby allowing researchers to either work with the entire dataset or selected aspects thereof. With over 133 million blog posts and 231 million social media publication (see table below for more details) the dataset comes to a size of ~3 TB (2.1 TB compressed).
|Content Type||# of Elements|
Access to this dataset is provided just as it is for the 2009 dataset. In fact, if you've already signed the agreement, your access credentials will already permit you to download the collection with no additional paperwork. If you are planning on obtaining this and/or the 2009 datasets for the first time, please download and sign the usage agreement, and email it to dataset-request (at) icwsm.org. Once your form is processed (usually within 1-3 days), you will be sent a URL and password where you can download the collection.
Cite this dataset using the following reference:
K. Burton, N. Kasch, and I. Soboroff. The ICWSM 2011 Spinn3r Dataset. In Proceedings of the Fifth Annual Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain, July 2011.
ICWSM 2009 Spinn3r Blog Dataset
The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed). This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial crisis; ...) as well as everything else you might expect to find posted to blogs. To get access to the Spinn3r dataset, please download and sign the usage agreement, and email it to dataset-request (at) icwsm.org. Once your form is processed (usually within 1-3 days), you will be sent a URL and password where you can download the collection.
Here is a sample of blog posts from the collection (To download the sample data, right click on the link and select save as from the context menu). The XML format is described on the Spinn3r website. Spinn3r provides free access to researchers. If you are interested in making use of their data beyond the ICWSM collection, for example to crawl linked posts or earlier stories from certain blogs, visit their site, spinn3r.com
Cite this dataset using the following reference:
K. Burton, A. Java, and I. Soboroff. The ICWSM 2009 Spinn3r Dataset. In Proceedings of the Third Annual Conference on Weblogs and Social Media (ICWSM 2009), San Jose, CA, May 2009.
JDPA Sentiment Corpus
The JDPA Corpus consists of user-generated content (blog posts) containing opinions about automobiles and digital cameras. They have been manually annotated for named, nominal, and pronominal mentions of entities. Entities are marked with the aggregate sentiment expressed toward them in the document. Mentions of each entity are marked as co-referential. Mentions are assigned semantic types consisting of the Automatic Content Extraction (ACE) mention types and additional domain-specific types. Meronymy (part-of and feature-of) and instance relations are also annotated. Expressions which convey sentiment toward an entity are annotated with the polarity of their prior and contextual sentiments as well the mentions they target. The following modifiers are annotated. These may target other modifiers or sentiment expressions
- negators (expressions which invert the polarity of a sentiment expression or modifier)
- neutralizers (expressions that do not commit the the speaker to the truth of the target sentiment expression or modifier)
- committers (expressions which shift the commitment of the speaker toward the truth a sentiment expression or modifier)
- intensifiers (expressions which shift the intensity of a sentiment expression or modifier)
Additionally, we have annotated when the opinion holder of a sentiment expression is someone other than the author of the blog by linking the expression to the holder. We also annotate when two entities are compared on a particular dimension.
The data, organized into training and testing sets, consists of 515 documents (blog posts) covering 330,762 tokens which make up 19,322 sentences. 87,532 mentions and 15,637 sentiment expressions are annotated. To get access to the JDPA collection, download and sign the usage agreement, and email it to ICWSM.JDPA.Corpus@gmail.com. Once your form is processed, you will be sent a URL and password where you can download the collection.
Wikipedia User Contribution Dataset
Sara Javanmardi and Yasser Ganjisaffar, University of California, Irvine
This dataset has been prepared for an ongoing study on user reputation and content quality in Wikipedia at University of California, Irvine. This research is done mainly by two PhD candidates: Sara Javanmardi and Yasser Ganjisaffar under the supervision of Prof. Lopes, Prof. Baldi, and Prof. Grant. One of the building blocks of this study was a software component that can monitor changes in content of the wiki pages over time. We have developed the component and we are please to share one of our datasets on English Wikipedia which contains user contributions. For each article we have modeled the evolution of the content through insert and delete events over time (up to September 2009). Since the dumps released after October 2007 for English Wikipedia don't contain full text of revisions, and also processing the text of revisions is a complicated and time consuming task, we hope sharing this dataset helps to expedite research studies on Wikipedia and Social Media in general.
Community-Created Data Resources derived from the Spinn3r Collection
The following resources were created by ICWSM 2009 data challenge authors and are being made available for general use alongside the 2009 ICWSM Spinn3r dataset.
- Spinn3r collection metadata
- Meeyoung Cha, Juan Antonio Navarro Perez, and Hamed Haddadi, MPI-SWS: "A collection containing the data that we extracted from the Spinn3er dataset. It includes data about all the posts from mayor blogging domains that were included in the dataset, as well as the extracted social graph among their corresponding blog users. We also include metadata that we collected from the most popular YouTube videos shared in blogs."
- Large-scale personal story corpus
- Andrew Gordon and Reid Swanson, USC:"To facilitate the distribution of large-scale story corpora, our group has identified individual blog posts that contain personal stories within existing large-scale corpora of posts. Most recently, we identified nearly one million personal stories in the ICWSM 2009 Spinn3r Blog Dataset, which we call the ICWSM 2009 Story Subset."
- Lucene index of the ICWSM 2009 collection
- Dan Knights, JD Power & Assoc:Includes
- README (explains how to use the lucene index and search)
- index-all.tar.gz (the lucene index)
- jdpa.lucene.tar.gz (the JDPA Java packages that does the indexing and searching)
- parsexml.py (a quick-and-dirty python script to parse and strip the post text from the Spinn3r xml)
We have a mailing list for discussing the datasets at http://groups.google.com/group/icwsm-data. Please join to talk about whatever you're doing with the data. In particular, if you are looking for groups to collaborate with, here's a forum for you. We also have a project at Google Code, http://code.google.com/p/icwsm-data/, where we can host tools and resources that you create to go along with the datasets.
Data Challenge Chairs
- Ian Soboroff, NIST
- Niels Kasch, UMBC