ICWSM-2020

ICWSM Expresses Solidarity with the Black Community

This is an ICWSM like no other. As we endeavor to hold our conference against the dual backdrop of the COVID-19 pandemic and widespread protests over the murders of George Floyd, Ahmaud Arbery, Breonna Taylor, and so many other Black citizens who have been killed by police officers in the United States, we reaffirm now and always: Black Lives Matter.

To that end, we will work to include more Black and marginalized people in our community starting now. We are offering 20 conference fee waivers so that Black scholars and scholars who are members of other marginalized groups from around the globe may join us at ICWSM 2020. You may apply for one of the waivers here.

We will also dedicate a Slack channel to collecting anti-racist ideas for ICWSM and commit to convenining a steering group to implement these ideas. Please visit this channel during the conference and leave us your suggestions.

In Solidarity,
The ICWSM Organizing Committee

Introducing ICWSM-2020: Virtual Edition

It was inevitable that a conference about social media would host a virtual conference.

ICWSM-2020 Conference Proceedings: here

ICWSM-2020 Workshop Proceedings: here

ICWSM-2020 Technical Program Schedule: here

ICWSM-2020 Virtual Conference URL: here (Need Sign-In Access Token)

In response to the ongoing health crisis, your organizing committee has been working to put together an engaging, accessible, online program for ICWSM. Thanks to the dedicated program committee, who were amazingly responsive even amidst the chaos, we have an excellent program of 72 full papers, 12 posters, 12 datasets, 3 demos, and 3 wonderful keynote speakers. Please register early to get access to pre-recorded talks!

After much spirited debate among the organizing and steering committees, we have arrived at the following mix of synchronous and asynchronous activities that we hope will make ICWSM as engaging and equitable as possible:

Pre-recorded talks

All papers and keynote presentations will be pre-recorded, accessible to registrants prior to the conference. Full papers will have 10 minute videos, while other papers will have 3 minute videos. Each paper additionally has synchronous and asynchronous interaction options, described below.

Fireside Chats with Keynote Speakers

We’re happy to share that we have three wonderful keynote speakers! The pre-recorded talks will be available before the conference, and we will have a live, synchronous fireside chat with each speaker during the conference. Registrants can ask questions before the session via Slack and during the session via Zoom.

Spotlight Paper Q&A

Several of the top papers will be highlighted by having their own dedicated, synchronous Q&A sessions on Zoom.

Zoom Poster Sessions

Every paper will have a one-hour slot in which the presenter will host their own Zoom poster session. Registrants can hop between sessions to visit with each presenter in an interactive session. BYOWOC (Bring Your Own Wine Or Coffee, depending on time zone!) We’ve worked to limit the number of parallel sessions so that attendees have time to visit each paper.

Asynchronous Chat

Every paper will have a dedicated Slack channel session where registrants can ask questions via text. Presenters will visit the session throughout the conference.

Timezone accommodations

We have allocated six hours of synchronous activities per day, split into two, three-hour time slots (8a-11a; 4p-7p EDT/GMT-4). The two time slots are divided across the day to give as many people as possible a chance to interact, while also recognizing that a typical 12 hour day of conference activities is not feasible for most people. Of course, these times will not please everyone, but we hope this mix of synchronous and asynchronous activities will result in an engaging and interactive conference.

Workshops/Tutorials

Workshops will take place via Zoom -- details forthcoming!

We appreciate your patience and support as we figure out all the technical details. We’ll be posting the full schedule soon. In the meantime, please register and take a Zoom training session if you haven’t already been using it non-stop. Send questions/threats to pc.chairs@icwsm.org.

ICWSM COVID-19 Information

THE PHYSICAL MEETING OF THE 2020 INTERNATIONAL AAAI CONFERENCE ON WEB AND SOCIAL MEDIA HAS BEEN CANCELED
VIRTUAL MEETING PLANNED

In accordance with the current recommendations of the Centers for Disease Control and prevention (CDC) and the United States Government to slow the spread of the COVID-19, the physical meeting of ICWSM-2020 in Atlanta, Georgia has been cancelled. However, plans are underway to convert ICWSM-20 to a virtual meeting during the same time period of June 8-11, 2020. More information will be posted here in the next month with information on how to register and participate.

Registration will be available in late April, and fees will reflect the significantly lower cost associated with the a virtual meeting, while still covering the expenses that have and will be incurred to produce a successful event for all participants.

The production of the proceedings will continue as planned, and papers will be posted in the AAAI Digital Library by the time of the virtual meeting.

We appreciate your patience while organizers investigate possible solutions. We will post developments as soon as more information is available.

Thank you for your support of ICWSM-2020!

The International AAAI Conference on Web and Social Media (ICWSM) is a forum for researchers from multiple disciplines to come together to share knowledge, discuss ideas, exchange information, and learn about cutting-edge research in diverse fields with the common theme of online social media. This overall theme includes research in new perspectives in social theories, as well as computational algorithms for analyzing social media. ICWSM is a singularly fitting venue for research that blends social science and computational approaches to answer important and challenging questions about human social behavior through social media while advancing computational tools for vast and unstructured data.

ICWSM, now in its fourteenth year, has become one of the premier venues for computational social science, and previous years of ICWSM have featured papers, posters, and demos that draw upon network science, machine learning, computational linguistics, sociology, communication, and political science. The uniqueness of the venue and the quality of submissions have contributed to a fast growth of the conference, and a competitive acceptance rate around 20% for full-length research papers published in the proceedings by the Association for the Advancement of Artificial Intelligence (AAAI).

For ICWSM-2020, in addition to the usual program of contributed talks, posters and demos, the main conference will include a selection of keynote talks from prominent scientists and technologists. Building on successes in previous years, ICWSM-2020 will also hold a day of workshops and tutorials in addition to the main conference.

Disciplines

Computational approaches to social media research including

Natural Language Processing

Text / Data Mining

Machine Learning

Image / Multimedia Processing

Graphics and Visualization

Distributed Computing

Graph Theory and Network Science

Human-computer Interaction

Social science approaches to social media research including

Psychology

Sociology and social network analysis

Communication

Political Science

Economics

Anthropology

Media Studies and Journalism

Digital Humanities

Interdisciplinary approaches to social media research combining computational algorithms and social science methodologies

Topics Include (But Not Limited To)

Studies of digital humanities (culture, history, arts) using social media

Psychological, personality-based and ethnographic studies of social media

Analysis of the relationship between social media and mainstream media

Qualitative and quantitative studies of social media

Centrality/influence of social media publications and authors

Ranking/relevance of social media content and users

Credibility of online content

Social network analysis; communities identification; expertise and authority discovery

Trust; reputation; recommendation systems

Human computer interaction; social media tools; navigation and visualization

Subjectivity in textual data; sentiment analysis; polarity/opinion identification and extraction, linguistic analyses of social media behavior

Text categorization; topic recognition; demographic/gender/age identification

Trend identification and tracking; time series forecasting

Measuring predictability of real world phenomena based on social media, e.g., spanning politics, finance, and health

New social media applications; interfaces; interaction techniques

Engagement, motivations, incentives, and gamification.

Social innovation and effecting change through social media

Social media usage on mobile devices; location, human mobility, and behavior

Organizational and group behavior mediated by social media; interpersonal communication mediated by social media

Example Data Sources (Web and Social Media)

Social networking sites (e.g., Facebook, LinkedIn)
Microblogs (e.g., Twitter, Tumblr)
Wiki-based knowledge sharing sites (e.g., Wikipedia)
Social news sites and websites of news media (e.g., Huffington Post)
Forums, mailing lists, newsgroups
Community media sites (e.g., YouTube, Flickr, Instagram)
Social Q & A sites (e.g., Quora, Yahoo Answers)
User reviews (e.g., Yelp, Amazon.com)
Social curation sites (e.g., Reddit, Pinterest)
Location-based social networks (e.g., Foursquare)

Call For Submissions

Full Papers

Abstract submission: Not required.

Submission Information:
For September 15, 2020 submission deadline (For ICWSM-2021):
Full Paper Submission Site

After registering/logging in, click on the Submissions tab, and make the following selections from the dropdown menus:
Society: AAAI. Conference/Journal: ICWSM 21. Track: ICWSM 21 Sept Submissions.

After clicking "Go" a placeholder submission will appear in the table below. Click on "Edit Submission" to add your paper details.

The submission will be marked complete when you have entered all the required fields, but you will still be able to edit it until submissions are closed.

There are three submission deadlines before ICWSM-2021:

1st Full-paper Deadline: May 15, 2019 (by 23:59 PM Anywhere on Earth) (New submissions, and R&Rs from ICWSM-19) [Notifications: July 15, 2020]
2nd Full-paper Deadline: September 15, 2019 (by 23:59 PM Anywhere on Earth) (New submissions, and R&Rs from Jan & May 2019 submissions) [Notifications: Nov 15, 2019]
~~3rd Full-paper Deadline~~~~: January 15, 2020 (by 23:59 PM Anywhere on Earth) (New submissions, and R&Rs from Sep 2019 submissions) [Notifications: Mar 16, 2020]~~

Submission deadlines before ICWSM-2021:

1st Full-paper Deadline: May 15, 2020 (by 23:59 PM Anywhere on Earth) (New submissions, and R&Rs from Jan* 2020 submissions) [Notifications: July 15, 2020]
2nd Full-paper Deadline: September 15, 2020 (by 23:59 PM Anywhere on Earth) (New submissions, and R&Rs from Jan* & May 2020 submissions) [Notifications: Nov 15, 2020]
3rd Full-paper Deadline: January 15, 2021 (by 23:59 PM Anywhere on Earth) (New submissions, and R&Rs from Sep 2020 submissions) [Notifications: Mar 16, 2021]

* Given the current situation due to COVID-19 and its ramification in people's lives, their communities, their workplaces, and societies, ICWSM is offering a later revision option. So authors (of papers currently in R&R) may decide to resubmit a revision either by May 15, 2020, or they may decide to resubmit by September 15, 2020. In either case, the next notification will include a final accept/reject decision without the possibility for additional rounds of revisions.

The 2020 reviewing process will be similar to that in the 2019 edition. Papers to be considered for publication in the ICWSM proceedings and presentation at the ICWSM 2020 conference must be submitted by one of the three submission deadlines. Authors who receive the "Accept" recommendation will have the opportunity to respond to reviewer suggestions by making minor edits when preparing the camera-ready version. Authors who receive the "Revise and Resubmit" recommendation will have the opportunity to address reviewer suggestions and resubmit an improved manuscript in the next submission deadline. Authors who receive "Revise & Resubmit" in January 2020 will likely be presenting during the 2021 conference if their papers get accepted during the next submission round. Papers accepted to this track will be presented as full-length presentations integrated with the conference, and they will be published as journal articles in the ICWSM proceedings. For submission guidelines, please refer to the guidelines.

Social Science and Sociophysics Track We will be continuing the 'social science and sociophysics' track at ICWSM-2020 following its successful debut in 2013. This option is for researchers in social science and sociophysics who wish to submit works without publication in the conference proceedings. While papers in this track will not be published, we expect these submissions to describe the same high-quality and complete work as the main track submissions. Papers accepted to this track will be presented either as full or poster presentations integrated with the conference, but they will be published only as abstracts in the conference proceedings. Papers submitted to this track will be reviewed through the same reviewing mechanism as full papers.

Rumi Chunara, Aron Culotta, and Brooke Foucault Welles
(ICWSM-2020 PC Chairs | pc.chairs@icwsm.org)

Posters and Demos

Poster and Demos Abstract submission: Not required.
Poster and Demo Paper submission: January 15, 2020 (by 23:59 Anywhere on Earth Time)

Submission Information:
Posters, Demos, and Datasets Site

After registering/logging in, click on the Submissions tab, and make the following selections from the dropdown menus:
Society: AAAI. Conference/Journal: ICWSM 20. Track: ICWSM 20 Posters, Demos, and Datasets.

After clicking "Go" a placeholder submission will appear in the table below. Click on "Edit Submission" to add your paper details.

The submission will be marked complete when you have entered all the required fields, but you will still be able to edit it until submissions are closed.

Poster/Demo Format: Poster papers must be no longer than 5 pages, with page 5 containing nothing but references, and demo descriptions must be no longer than 3 pages, with page 3 containing nothing but references, and all must be submitted by the deadlines given above.

The reviewing process for posters and demos will follow the same pattern as in previous years. Papers will either be accepted or rejected. Authors of accepted papers will have the opportunity to respond to reviewer suggestions by making minor edits when preparing the camera-ready version. Poster and demo papers will not have a revise and resubmit phase. For submission guidelines, please refer to the guidelines.

Rumi Chunara, Aron Culotta, and Brooke Foucault Welles
(ICWSM-2020 PC Chairs | pc.chairs@icwsm.org)

Datasets

Poster and Demos Abstract submission: Not required.
Poster and Demo Paper submission: January 15, 2020 (by 23:59 Anywhere on Earth Time)

Submission Information:
Posters, Demos, and Datasets Site

After clicking "Go" a placeholder submission will appear in the table below. Click on "Edit Submission" to add your paper details.

The submission will be marked complete when you have entered all the required fields, but you will still be able to edit it until submissions are closed.

Dataset Paper Format: Dataset paper submissions must comprise two parts: a dataset or group of datasets, and metadata describing the content, quality, structure, potential uses of the dataset(s), and methods employed for data collection. Descriptive statistics may be included in the metadata (more sophisticated analyses should be part of a regular paper submission). Authors are encouraged to include a description of how they intend to make their datasets FAIR. Datasets and metadata must be published using a dataset sharing service (e.g. Zenodo, datorium, dataverse, or any other dataset sharing services that indexes your dataset and metadata and increases the re-findability of the data) that provides a DOI for the dataset, which should be included in the dataset paper submission. Dataset paper review will be single blind, and all datasets have to be identified and uploaded at the time of submission. Dataset paper submissions must be between 2-10 pages long and will be part of the full proceedings. All papers must follow AAAI formatting guidelines. For submission guidelines, please refer to the guidelines.

Fabricio Benevenuto, Yelena Mejova, and Claudia Wagner
(ICWSM-2020 Data Chairs | data.chairs@icwsm.org)

Workshops

ICWSM-2020 invites proposals for Workshops Day at the 14th International AAAI Conference on Web and Social Media (ICWSM). Workshop participants will have the opportunity to meet and discuss issues with a selected focus -- providing an informal setting for active exchange among researchers and developers from a wide range of disciplines, including the social science and computer science. Workshops are an excellent forum for exploring emerging approaches and task areas, bridging gaps between the social sciences and computing, and elucidating results of exploratory research.

Members of all segments of the social media research community are encouraged to submit proposals. To foster interaction and exchange of ideas, the workshops will be kept small, with up to 40 participants.

The format of workshops will be determined by their organizers. The two main criteria for the selection of the workshops will be the following:

The organizers encourage workshops that promote different types of activities, including challenges, games, and brainstorming and networking sessions. The organizers discourage workshops that are structured as “mini-conferences” dominated by long talks and short discussions. Workshops should leave ample time for discussions and interaction between the participants and should also encourage the submission and presentation of position papers that discuss new research ideas.

The workshop should have the potential to attract the interest of researchers in computer science and social science. Proposals involving people of different backgrounds in the organizing committee and addressing topics at the intersection of different disciplines will have a higher chance of acceptance.

Workshop organizers who want to publish the papers from their workshop (or significant portions of it) will have the opportunity to do so through a venue TBD. For a list of last year's workshops see hhttp://www.icwsm.org/2019/program/workshop/.

Workshop proposal content

Proposals for workshops should be no more than five (5) pages in length (10pt, single column, with reasonable margins), written in English, and should contain the following:

A concise title

The names, affiliations, and contact information of the organizing committee. A main contact author should be specified. A typical proposal should include no more than four co-chairs.

An indication as to whether the workshop should be considered for a half-day or full-day meeting.

A short abstract describing the scope and main objective of the workshop. Identify the specific issues and research questions the workshop will focus on, with a brief discussion of why the topic is of particular interest at this time and for which research communities.

A two/three paragraph description of the workshop topic and themes.

A description of the proposed workshop format and a detailed list of proposed activities, with special emphasis on those activities that distinguish it from a mini-conference (e.g., games, brainstorming sessions, challenges, group activities).

An approximate timeline of the activities, including submission deadlines and evaluation. Include whether you will have an early-bird deadline for submissions to allow extra time for visa applications.

A description of how workshop submissions will be evaluated and selected (invited contributions, peer review, etc.). In case a PC is needed, provide a tentative list of the members.

Historical information about the workshop, if available. Short description of the previous editions reporting highlights and details about the approximate number of attendees and number of submissions.

A list of other related workshops held previously at related conferences, if any (list does not have to be exhaustive), together with a brief statement on how the proposed workshop differs from or how it follows-up on work presented at previous workshops.

A short bio for each member of the organizing committee, including a description of their relevant expertise. Strong proposals include organizers who bring differing perspectives to the workshop topic and who are actively connected to the communities of potential participants.

A tentative website address (workshop organizers will have to provide their own workshop websites, see schedule below).

Please email your proposal in a single file to the workshop chairs at workshops@icwsm.org before the deadline. For additional information please contact the workshop chairs at the same address.

Important Dates

(All deadlines are on 23:59:59 anywhere in the world)

~~Workshop proposal submission deadline: February 1, 2020~~
Workshop acceptance notification: February 15, 2020
Workshop websites online: March 1, 2020
Submissions due to individual workshops (suggested deadline): April 1, 2020
Submissions acceptance notifications for individual workshops (on or before): May 1, 2020
Camera Ready Submissions: TBA
ICWSM-2020 Workshops Day: June 8th, 2020

Our schedule has been determined by a number of factors, including allowing ease between paper notification deadlines and visa application timelines. Please be mindful that unless stated otherwise these are not suggested dates.

Stevie Chancellor, Kiran Garimella, and Katrin Weller
(ICWSM-2020 Workshop Chairs | workshops@icwsm.org)

Tutorials

Submission Deadline: February 1, 2020 (by 23:59 PM Anywhere on Earth)
Proposal Submission Email: tutorials@icwsm.org

ICWSM-2020 invites proposals for Tutorials Day at the 14th International AAAI Conference on Web and Social Media (ICWSM). ICWSM-2020 is seeking proposals for tutorials on topics related to the analysis and understanding of social phenomena in the following themes:

Course: Traditional tutorials to teach concepts, methodologies, tools, and software packages. We encourage hands on tutorials that promote interactivity.
Translation: Tutorials that aim to translate concepts between disciplines. For example, such a tutorial could introduce social science concepts to computer scientists, or computational concepts to social scientists. Thus, these tutorials should be geared towards a beginner audience.
Case study: Focused tutorials that emphasize on real world applications of ICWSM work. These tutorials should walk the audience through how research insights and tools were applied in practice. We welcome submissions from practitioners from the industry, government, and NGO in addition to academics.

We welcome tutorials of various lengths (including 45 minutes, 90 minutes, or a half day). We are looking for contributions from experts in both the social and computational sciences, in industry, academia, and beyond. For a list of 2018 and 2019 tutorials from last years, visit here and here.

Acceptance criteria

The tutorial format will be determined by the tutorial organizers. The proposals will be selected considering the following criteria:

Cross-pollination potential. Tutorials that attract an interdisciplinary audience will be given preference. The proposals should highlight, when applicable, the tutorial potential to transfer knowledge from one discipline/area to another.
Interactivity. We will favor tutorials that aim to include hands-on experiences, collaborative approaches, and interactivity.

Proposals of tutorials presented at past events are allowed, although novelty is a plus.

Tutorial proposal content and format

Proposals for tutorials should be no more than three (3) pages in length (as much as possible, please try to use AAAI Author Kit to format your submission; the author kit is available here). The submissions should include the following:

Tutorial title and summary. A short description (200 words) of the main objective of the tutorial to be published on the main ICWSM website. Please indicate the type of tutorial you are proposing: course, translation, or case study.
Names, affiliations, emails, and personal websites of the tutorial organizers. A main contact author should be specified. A typical proposal should include no more than than three presenters (more people can be involved in the organization).
Duration. A short timeline description of how you plan to break down the material over 45 mins, 90 mins, or half-day. Please mention here the proposed duration, but keep in mind that the Tutorial Chairs might conditionally accept a proposal and suggest a different duration to best fit the organization of the whole event.
Tutorial schedule and activities. A description of the proposed tutorial format, a schedule of the proposed activities (e.g., presentations, interactive sessions) along with a *detailed* description for each of them.
Target audience, prerequisites and outcomes. A description of the target audience, the prerequisite skill set for the attendee (if any) as well as a brief list of goals for the tutors to accomplish by the end of the tutorial..
Materials. The organizers of accepted tutorials will be required to set up a web page containing all the information for the tutorial attendees before the tutorial day (roughly 2 weeks before the tutorial day). The proposal should contain the list of materials that will be made available on the website.
Precedent [when available]. A list of other tutorials held previously at related conferences, if any, together with a brief statement on how it follows-up on previous events. If the authors of the proposal have organized other tutorials in the past, pointers to the relevant material (e.g., slides, videos, web pages, code) should be provided.

Submissions must be in PDF to the submission email (tutorials@icwsm.org). Pre-submission questions can be sent to the chairs at the same address (tutorials@icwsm.org)

Important Dates

Tutorials Day: June 8th, 2020

Tanushree Mitra, Alexandra Olteanu, and Chenhao Tan
(ICWSM-2020 Tutorial Chairs | tutorials@icwsm.org)

Data Challenge

ICWSM0-2020 is hosting the first ICWSM data challenge to bring together researchers from across disciplines to solve societally-relevant problems together as a community. This will be enabled by fostering collaboration and exchange of ideas in a structured setting. This year’s data challenge theme is Safety. To achieve this, we invite participants to work on two pertinent datasets in the areas of Misinformation and Abusive behavior in social media.

For more details, please visit: ICWSM-2020 Data Challenge Website

Important Dates

Data Challenge Opens: February 21, 2020
Submission Deadline: ~~April 25, 2020~~ May 15, 2020
Notification of acceptance: ~~May 1, 2020~~ May 25, 2020
Data Challenge Full Day Workshop: June 8, 2020

Surya Kallumadi, Srijan Kumar, and Diyi Yang
(ICWSM-2020 Data Challenge Chairs)

Detailed Guidelines for Paper Submission

Content Guidelines

Format: Papers must be in high resolution PDF format, formatted for US Letter (8.5" x 11") paper, using Type 1 or TrueType fonts. Full papers are recommended to be 8 pages and must be at most 10 pages long, including references and any appendix. Revision papers and final camera-ready full papers can be up to 12 pages. Poster papers must be no longer than 4 pages, and demo descriptions must be no longer than 2 pages, and all must be submitted by the deadlines given above, and formatted in AAAI two-column, camera-ready style. No source files (Word or LaTeX) are required at the time of submission for review; only the PDF file is permitted. Finally, the copyright slug may be omitted in the initial submission phase and no copyright form is required until a paper is accepted for publication.

Anonymity: ICWSM-2020 review is double-blind. Therefore, please anonymize your submission: do not put the author(s) names or affiliation(s) at the start of the paper, and do not include funding or other acknowledgments in papers submitted for review. References to authors' own prior relevant work should be included, but should not specify that this is the authors' own work. Citations to the author's own work should be anonymized, if possible, or can be added later to the final camera-ready version for publication. It is up to the authors' discretion how much to further modify the body of the paper to preserve anonymity. The requirement for anonymity does not extend outside of the review process, e.g. the authors can decide how widely to distribute their papers over the Internet before the program committee meeting. Even in cases where the author's identity is known to a reviewer, the double-blind process will serve as a symbolic reminder of the importance of evaluating the submitted work on its own merits without regard to authors' reputation. Note that 2-page demo submissions and the dataset paper submissions, and only these, are exempt from the anonymization requirement as they often contain system URLs or URLs to data sharing services.

Language: All submissions must be in English.

Revisions: Papers that were previously submitted to ICWSM and received a "Revise and Resubmit " decision should be accompanied by a copy of the previous reviews and an author response statement. The response statement may be in any format, but many reviewers appreciate a response that begins with an overall summary and then includes a table, with each row containing a reviewer comment in the left cell, and author's response in the right cell. The response cell may explain why no changes were made, or may describe changes and direct the reviewer to a particular page, section, or figure, where the revised content appears. At the discretion of the Senior PC member handling the paper, the revised version may be sent back to some or all of the original reviewers for comment and evaluation, and may also be sent to additional reviewers.

Optional publication for social sciences and sociophysics papers

Researchers who wish to submit full papers without publication in the conference proceedings, may designate their submission as 'social sciences and sociophysics (not for publication)'. Submissions must adhere to the formatting and content guidelines above. They will be reviewed according to the same process and criteria as all other full paper submissions. While we will not accept previously published papers, papers submitted as social sciences and sociophysics (not for publication) may be under review concurrently at a journal. Papers accepted to this track will be full presentations, integrated with the conference, but will be published only as abstracts in the ICWSM conference proceedings.

Submissions originally designated as not for publication cannot be converted at the end to publication in the ICWSM conference proceedings, because that would provide a mechanism enabling simultaneous consideration of the same paper for publication in two venues. Researchers who do wish to publish their papers in the ICWSM proceedings should submit to the regular track. All submitted papers, whether targeted for publication or not, will be judged according to the same acceptance criteria.

Duplicate Submission

ICWSM-2020 will not accept any paper that, at the time of submission, is under review for or has already been published or accepted for publication in a journal or conference. This restriction does not apply to submissions for non-archival workshops.

If duplicate submissions are identified during the review process then:

All submissions from that author will be disqualified from the current ICWSM conference;

And authors will not be permitted to submit papers to the ICWSM conference in the following year.

Conference Registration

At least one author must register for the conference by the deadline for camera-ready copy submission. In addition, the registered author must attend the conference to present the paper in person.

Publication

All accepted papers and extended abstracts will be published in the conference proceedings, except for those submitted to the 'social sciences and sociophysics (not for publication)'; only abstracts will be published for those. Though initial submissions of full papers must not exceed ten (10) pages, full papers accepted for publication will be allocated up to twelve (12) pages in the conference proceedings to facilitate to address comments raised by the reviewers. Authors will be required to transfer copyright to AAAI.

Datasets

ICWSM provides a service for hosting datasets pertaining to research presented at the conference. Authors of accepted papers will be encouraged to share the datasets on which their papers are based, while adhering to the terms and conditions of the data provider. Of these datasets, one will be selected for an award which will be based on the quality, scope, and timeliness of each dataset. More information will be available on our website.

Venue

Conference Venue

~~Georgia Tech Hotel and Conference Center~~ ~~will be hosting ICWSM-2020~~.
ICWSM-2020 is planned to be a virtual meeting. More details here.

Registration

ICWSM-20 Registration

Online registration is now available at https://aaaiconf.cvent.com/icwsm20.
The ICWSM-2020 technical conference registration fee includes admission to the Workshop/Tutorial Day, all technical sessions, and access to the electronic version of the ICWSM-2020 Conference Proceedings.
Registration Deadline: ~~May 22, 2020~~ June 07, 2020

ICWSM-2020 is committed to anti-racist practices. We are offering 20 registration fee waivers for Black scholars and scholars who are members of other marginalized communities around the world. Apply here

ICWSM-20 Registration Fees:

AAAI Member: $175
AAAI Student Member: $100
Nonmember: $195
Nonmember Student: $115

Silver:
Includes Discounted conference registration, plus a one-year online new or renewed membership in AAAI.
Silver Regular: $274
Silver Student: $149

Workshop/Tutorial Day

Tutorial Information: Link.
Workshop Information: Link.
ICWSM-20 workshops and tutorials will be held June 8, just prior to the technical conference. Technical registrants may sign up for any combination of workshops and/or tutorials on June 8 as part of their technical registration. For those wishing to attend only the Workshop/Tutorial Day, a Workshop/Tutorial Day Only registration is offered. PARTICIPANTS SHOULD NOT SIGN UP FOR CONCURRENT EVENTS, so please consult the schedule carefully before making your selections.

Workshop/Tutorial Day Only Fee:

Regular: $45
Student: $25

Registration / Proof of Student Status

To register online, please complete this form. Students will be required to submit proof of student status during the registration process.

Refund Requests

The deadline for refund requests is May 29, 2020. All refund requests must be made in writing to AAAI at icwsm20@aaai.org. A $50.00 processing fee will be assessed for all refunds.

Schedule

Download the ICWSM-2020 Technical Program Schedule here

Keynotes

Should you believe Wikipedia?
How social epistemology can help us be better internet researchers

Social computing researchers increasingly need to question the nature of "truth" in our day-to-day work. In this talk, I’ll review ideas from philosophy, especially social epistemology, to give us practical, working definitions of "truth" and "knowledge." I'll apply these ideas to explain why a popular Wikipedia page is arguably the most reliable form of information ever created. Finally, I’ll explain how these ideas help us to be better social computing researchers, and more thoughtfully address controversies about what kind of content is appropriate.

Amy Bruckman is Professor and Senior Associate Chair in the School of Interactive Computing at the Georgia Institute of Technology. Her research focuses on social computing, with interests in collaboration, social movements, content moderation, and internet research ethics. Bruckman is chair of the ACM CSCW Steering Committee. She is an ACM Fellow and a member of the ACM CHI Academy. Bruckman received her Ph.D. from the MIT Media in 1997, and a B.A. in physics from Harvard University in 1987. Her book “Should You Believe Wikipedia?” is forthcoming from Cambridge University Press in 2021.

How Does it Feel to be a Problem?: Race, Blackness & Our Technological Past, Present, Future

A century ago, Sociologist W.E.B. DuBois posed the question, "How does it feel to be a problem?" speaking then of the Negro's place in the world. But what does it mean to be a problem in the context of computing technology? I argue that the relationship between African Americans and other people of color to computing technology has been mediated by "the problem." In this talk I tell the story of how Black people became the nation's (U.S.) first problem for computers to solve, the consequences of that relationship, and the lessons it provides about how we frame today's technological solutions for the future.

Author of the new book Black Software: The Internet & Racial Justice, From the Afronet to Black Lives Matter, Charlton McIlwain is Vice Provost for Faculty Engagement & Development at New York University and Professor of Media, Culture, and Communication. His work focuses on the intersections of computing technology, race, inequality, and racial justice activism. In addition to Black Software, McIlwain has authored Racial Formation, Inequality & the Political Economy of Web Traffic, in the journal Information, Communication & Society, and co-authored, with Deen Freelon and Meredith Clark, the recent report Beyond the Hashtags: Ferguson, #BlackLivesMatter, and the Online Struggle for Offline Justice. He recently testified before the U.S. House Committee on Financial Services about the impacts of automation and artificial intelligence on the financial services sector.

Papers

ICWSM-2020 Conference Proceedings: here

Full Papers

A Quantitative Approach to Understanding Online Antisemitism

Savvas Zannettou, Joel Finkelstein, Barry Bradlyn, Jeremy Blackburn

A new wave of growing antisemitism, driven by fringe Web communities, is an increasingly worrying presence in the socio-political realm.The ubiquitous and global nature of the Web has provided tools used by these groups to spread their ideology to the rest of the Internet.Although the study of antisemitism and hate is not new, the scale and rate of change of online data has impacted the efficacy of traditional approaches to measure and understand these troubling trends.In this paper, we present a large-scale, quantitative study of online antisemitism.We collect hundreds of million posts and images from alt-right Web communities like 4chan's Politically Incorrect board (/pol/) and Gab.Using scientifically grounded methods, we quantify the escalation and spread of antisemitic memes and rhetoric across the Web.We find the frequency of antisemitic content greatly increases (in some cases more than doubling) after major political events such as the 2016 US Presidential Election and the "Unite the Right" rally in Charlottesville.We extract semantic embeddings from our corpus of posts and demonstrate how automated techniques can discover and categorize the use of antisemitic terminology.We additionally examine the prevalence and spread of the antisemitic "Happy Merchant'" meme, and in particular how these fringe communities influence its propagation to more mainstream communities like Twitter and Reddit.Taken together, our results provide a data-driven, quantitative framework for understanding online antisemitism.Our methods serve as a framework to augment current qualitative efforts by anti-hate groups, providing new insights into the growth and spread of hate online.

Aggressive, Repetitive, Intentional, Visible, and Imbalanced: Refining Representations for Cyberbullying Classification

Caleb Ziems, Fred Morstatter

Cyberbullying is a pervasive problem in online communities. To identify cyberbullying cases in large-scale social networks, content moderators depend on machine learning classifiers for automatic cyberbullying detection. However, existing models remain unfit for real-world applications, largely due to a shortage of publicly available training data and a lack of standard criteria for assigning ground truth labels. In this study, we address the need for reliable data using an original annotation framework. Inspired by social sciences research into bullying behavior, we characterize the nuanced problem of cyberbullying using five explicit factors to represent its social and linguistic aspects. We model this behavior using social network and language-based features, which improves classifier performance. These results demonstrate the importance of representing and modeling cyberbullying as a social phenomenon.

An Experimental Study of Structural Diversity in Social Networks

Jessica Tsu-Yun Su, Krishna Kamath, Aneesh Sharma, Johan Ugander, Sharad Goel

Several recent studies of online social networking platforms have found that adoption rates and engagement levels are positively correlated with structural diversity, the degree of heterogeneity among an individual's contacts as measured by network ties. One common theory for this observation is that structural diversity increases utility, in part because there is value to interacting with people from different network components on the same platform. While compelling, evidence for this causal theory comes from observational studies, making it difficult to rule out non-causal explanations. We investigate the role of structural diversity on retention by conducting a large-scale randomized controlled study on the Twitter platform. We first show that structural diversity correlates with user retention on Twitter, corroborating results from past observational studies. We then exogenously vary structural diversity by altering the set of network recommendations new users see when joining the platform; we confirm that this design induces the desired changes to network topology. We find, however, that low, medium, and high structural diversity treatment groups in our experiment have comparable retention rates. Thus, at least in this case, the observed correlation between structural diversity and retention does not appear to result from a causal relationship, challenging theories based on past observational studies.

Attention-based Explanations Don't Help Humans Recognize Online Toxicity

Samuel Carton, Qiaozhu Mei, Paul Resnick

We present an experimental assessment of the impact of feature attribution-style explanations on human performance in predicting the consensus toxicity of social media posts with advice from an error-prone machine learning model. By doing so we add to a small but growing body of literature inspecting the true utility of interpretable machine learning in terms of human outcomes. We also evaluate interpretable machine learning for the first time in the important domain of online toxicity, where fully-automated methods have faced criticism as being inadequate as a measure of toxic behavior.We find that, contrary to expectations, explanations have no significant impact on accuracy or agreement with model predictions, through they do change the distribution of subject error somewhat while reducing the cognitive burden of the task for subjects. Our results contribute to the recognition of an intriguing expectation gap in the field of interpretable machine learning between the general excitement the field has engendered and the ambiguous results of recent experimental work, including this study.

Auditing News Curation Systems: A Case Study of Apple News

Jack Bandy, Nicholas Diakopoulos

This work presents an audit study of Apple News as a sociotechnical news curation system that exercises gatekeeping power in the media. We examine the mechanisms behind Apple News as well as the content presented in the app, outlining the social, political, and economic implications of both aspects. We focus on the Trending Stories section, which is algorithmically curated, and the Top Stories section, which is human-curated. Results from a crowdsourced audit showed minimal content personalization in the Trending Stories section, and a sock-puppet audit showed no location-based content adaptation. Finally, we perform an extended two-month data collection to compare the human-curated Top Stories section with the algorithmically-curated Trending Stories section. Within these two sections, human curation outperforms algorithmic curation in several measures of source diversity, concentration, and evenness. Furthermore, algorithmic curation featured more ``soft news'' about celebrities and entertainment, while editorial curation featured more news about policy and international events. To our knowledge, this study provides the first dat

Auditing Race and Gender Discrimination in Online Housing

Joshua T Asplund, Motahhare Eslami, Professor Hari Sundaram, Christian Sandvig, Karrie Karahalios

While researchers have developed rigorous practices for offline housing audits to enforce the US Fair Housing Act, the online world lacks similar practices. In this work we lay out principles for developing and performing online fairness audits. We demonstrate a controlled sock-puppet audit technique for building online profiles associated with a specific demographic profile or intersection of profiles, and describe the requirements to train and verify profiles of other demographics. We also present two audits using these sock-puppet profiles. The first audit explores the number and content of housing-related ads served to a user. The second compares the ordering of personalized recommendations on major housing and real-estate sites. We examine whether the results of each of these audits exhibit indirect discrimination: whether there is correlation between the content served and users' protected features, even if the system does not know or use these features explicitly. Our results show differential treatment in the number and type of housing ads served based on the user's race, as well as bias in property recommendations based on the user's gender. We believe this framework provides a compelling foundation for further exploration of housing fairness online.

Behind the Mask: A Computational Study of Anonymous' Presence on Twitter

Keenan Jones, Jason R.C.Nurse, Shujun Li

The hacktivist group Anonymous is unusual in its public-facing nature. Unlike other cybercriminal groups, which rely on secrecy and privacy for protection, Anonymous is prevalent on the social media site Twitter. In this paper we re-examine some key findings reported in past small-scale qualitative studies of the group via a large-scale computational analysis of Anonymous on Twitter. We specifically refer to reports which reject the group's claims of leaderlessness, and indicate a fracturing of the group after the arrests of key members in 2011-2013. In our research, we present the first attempts to use machine learning to identify and analyse the presence of a network of over 20,000 Anonymous accounts spanning from 2008-2019 on the Twitter platform. In turn, this research utilises social network analysis (SNA) and centrality measures to examine influence within this large network, thus helping to provide a computational perspective on the findings of smaller-scale, more qualitative studies. Moreover, we present the first study of tweets from some of the identified `key' influencer accounts, through the use of topic modelling, finding a similarity in overarching topics of discussion between these influential accounts. Findings which further support the claims of smaller-scale, qualitative studies of the Anonymous collective.

Beyond Positive Emotion: Deconstructing Happy Moments based on Writing Prompts

Kokil Jaidka, Niyati Chhaya, Saran Mumick, Matthew Killingsworth, Alon Halevy, Lyle Ungar

The widespread adoption of social media has improved researchers' access to unsolicited expressions and behaviors. However, most of the work in analyzing these expressions is collected artificially in an keyword search, and focuses on predicting sentiment or emotional content rather than understanding a deeper psychological state, such as happiness. This study looks beyond positive emotion in modeling descriptions of happy moments collected through writing prompts. It is the first effort to distinguish the personal agency and social interaction in writings about happiness, which do not yet have an exact equivalent concept in existing text-based approaches. We report that state of the art approaches for emotion detection have different topical characteristics, and do not generalize well to detect happiness in our dataset. Language models trained on the happy moments' dataset, on the other hand, generalize to social media writing and are a valid approach for downstream tasks, such as predicting life satisfaction from social media posts.

Bridging Qualitative and Quantitative Methods for User Modeling: Tracing Cancer Patient Behavior in an Online Health Community

Zachary Levonian, Drew Richard Erikson, Wenqi Luo, Saumik Narayanan, Sabirat Rubya, Prateek Vachher, Loren Terveen, Svetlana Yarosh

Researchers construct models of social media users to understand human behavior and deliver improved digital services. Such models use conceptual categories arranged in a taxonomy to classify unstructured user text data. In many contexts, useful taxonomies can be defined via the incorporation of qualitative findings, a mixed-methods approach that offers the ability to create qualitatively-informed user models. But operationalizing taxonomies from the themes described in qualitative work is non-trivial and has received little explicit focus. We propose a process and explore challenges bridging qualitative themes to user models, for both operationalization of themes to taxonomies and the use of these taxonomies in constructing classification models. For classification of new data, we compare common keyword-based approaches to machine learning models. We demonstrate our process through an example in the health domain, constructing two user models tracing cancer patient experience over time in an online health community. We identify patterns in the model outputs for describing the longitudinal experience of cancer patients and reflect on the use of this process in future research.

Call to Action: Needing a Better Performance Evaluation Framework for Fake News Classification Benchmarking

Lia Bozarth, Ceren Budak

The rising prevalence of fake news and its alarming downstream impact have motivated both the industry and academia to build a substantial number of fake news classification models, each with its unique architecture. Yet, the research community currently lacks a comprehensive model evaluation framework that can provide multifaceted comparisons between these models beyond the simple evaluation metrics such as accuracy or f1 scores. In our work, we examine a representative subset of classifiers using a very simple set of performance evaluation and error analysis steps. We demonstrate that model performance vary considerably based on i) dataset, ii) evaluation archetype, and iii) performance metrics. Additionally, classifiers also demonstrate a potential bias against small and conservative-leaning credible news sites. Finally, models' performance vary based on external shocks and article topic. In sum, our results highlight the {\it need} to move towards systematic benchmarking to build towards more accurate and better understood fake news classifiers.

Causal Factors of Effective Psychosocial Outcomes in Online Mental Health Communities

Koustuv Saha, Amit Sharma

Online mental health communities enable people to seek and provide support, and there is growing evidence showing the efficacy of participation in these communities to help cope with mental health distress. However, what factors of peer support lead to favorable psychosocial outcomes for individuals is less clear. Using a dataset of over 300K posts by ~39K individuals on an online community TalkLife, we present a longitudinal causal inference study to investigate the effect of several factors, such as adaptability, diversity, immediacy, and nature of support. Unlike typical causal inference studies that focus on the effect of each treatment, we focus on the outcome and address the reverse causal question of identifying treatments that may have led to the outcome, drawing on case-control studies in epidemiology. Specifically, we define the outcome as an aggregate of affective, behavioral, and cognitive psychosocial change and identify Case (most improved) and Control (least improved) cohorts of individuals. Considering supportive responses from peers as treatments, we evaluate the differences in the responses received by Case and Control individuals, per matched clusters of similar individuals. We find that effective support include complex language factors such as diversity, adaptability and language style, but simple indicators such as the quantity, immediacy or emotionality of support are not causally relevant. Our work bears methodological and design implications for online mental health platforms, and has the potential to guide suggestive interventions for peer supporters on these platforms.

Characterizing Collective Attention via Descriptor Context: A Case Study of Public Discussions of Crisis Events

Ian Stewart, Diyi Yang, Jacob Eisenstein

Collective attention, i.e. the public attention paid to a particular topic, is a key factor in understanding how emerging topics and breaking news spread in online discussions. In most research, collective attention on social media is measured via aggregate metrics, such as the number of posts that mention a given name. However, collective attention is expressed not only in frequency, but also in content: linguistic features of the description of events and names reflect how writers expect their readers to perceive the information being discussed. In this work, we conduct a large-scale language analysis of public online discussions of breaking news events on Facebook and Twitter, focusing on five recent crisis events. We examine how people refer to locations, focusing specifically on contextual descriptors, such as ``San Juan'' versus ``San Juan, Puerto Rico.'' We find that the use of such descriptors is associated with proxies for social and informational expectations, including macro-level factors like the location's global importance and micro-level factors like audience engagement. We also find a consistent decrease in descriptor context use over time at a collective level, particularly for less active authors. These insights provide evidence for theories about information expectations in public discussions, and they inform how researchers and crisis response organizations can better understand public perception of crisis events as they unfold.

Characterizing the Social News Sphere through User Co-Sharing Practices

Mattia Samory, Vartan Kesiz Abnousi, Tanushree Mitra

We describe the landscape of news sources which share social media audience. We focus on 639 news sources, both credible and questionable, and characterize them according to the audience that shares their articles on Twitter. Based on user co-sharing practices, what communities of news sources emerge? We find four groups: one is home to mainstream, high-circulation sources from all sides of the political spectrum; one to satirical, left-leaning sources; one to bipartisan conspiratorial, pseudo-scientific sources; and one to right-leaning, deliberate misinformation sources. Next, we measure which assessments of credibility, impartiality, and journalistic integrity correspond to social media readers’ choices of news sources, and uncover the multifaceted structure of the social news sphere. We show how news articles shared on Twitter differ across the four groups along linguistic and psycholinguistics measures. Further, we find that with a high degree of accuracy (~80%), we can classify in what news community an article belongs to. Our data-driven categorization of news sources will help to navigate the complex landscape of online news and has implications for social media platforms as well as for journalism scholars.

Characterizing the Use of Images by State-Sponsored Troll Accounts on Twitter

Savvas Zannettou, Barry Bradlyn, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn

State-sponsored organizations are increasingly linked to efforts aimed to exploit social media for information warfare and manipulating public opinion. Typically, their activities rely on a number of social network accounts they control, aka trolls, that post and interact with other users disguised as “regular” users. These accounts often use images and memes, along with textual content, in order to increase the engagement and the credibility of their posts.In this paper, we present the first study of images shared by state-sponsored accounts by analyzing a ground truth dataset of 1.8M images posted to Twitter by accounts controlled by the Russian Internet Research Agency. First, we analyze the content of the images as well as their posting activity. Then, using Hawkes Processes, we quantify their influence on popular Web communities like Twitter, Reddit, 4chan’s Politically Incorrect board (/pol/), and Gab, with respect to the dissemination of images. We find that the extensive image posting activity of Russian trolls coincides with real-world events (e.g., the Unite the Right rally in Charlottesville), and shed light on their targets as well as the content disseminated via images. Finally, we show that the trolls were more effective in disseminating politics-related imagery than other images.

Characterizing User Content on a Multi-lingual Social Network

Pushkal Agarwal, Kiran Garimella, Sagar Joglekar, Nishanth Sastry, Gareth Tyson

Social media has been on the vanguard of political information diffusion in the 21st century. Most studies that look into disinformation, political influence and fake-news focus on mainstream social media platforms. This has inevitably made English an important factor in our current understanding of political activity on social media. As a result, there have been a very limited number of representative studies on a large section of the democratic world, including the largest, multilingual and multicultural democracy: India. In this paper we present our characterisation of a multilingual social network in India called ShareChat. We collect an exhaustive dataset across 72 weeks before and during the Indian general elections of 2019, across 14 languages. We investigate the cross lingual dynamics by clustering visually similar images together, and exploring how they move across language barriers. We find that Telugu, Malayalam, Tamil and Kannada languages tend to be dominant in soliciting political memes and images, and posts from Hindi (and images having text in English) have the largest cross-lingual diffusion across ShareChat. In the case of memes that cross language barriers, we see that language translation is used to widen the accessibility. That said, we find cases where the same image is associated with very different text (and therefore meanings). This initial characterisation paves the way for more advanced pipelines to understand the dynamics of fake and political content in a multi-lingual and non-textual setting.

Communal quirks and circlejerks: Processes contributing to insularity in online communities

Kimberley R Allison, Kay Bussey

Online communication offers the potential for bridging connections, exposing users to new views and experiences by fostering socially heterogenous communities. However, in the absence of deliberate attempts to promote diversity, communities may tend towards insularity: a state where members and content are similar or homogenous, and where deviation from these norms is discouraged. This paper presents a taxonomy of processes contributing to insularity, synthesizing findings from a broader longitudinal interview study on engagement with online communities over time with previous literature. Using thematic analysis, sixteen processes were identified which were associated with four broad stages: formation (selective connections, network homophily, shared interests, audience segmentation); propagation (circlejerking, upholding community standards, avoiding conflict, tailoring content); reaction (individual avoidance, collective reaction, mocking deviance, derogating outsiders); and perpetuation (modelling, prior feedback, echo chambers, gatekeeping). These findings highlight the need to consider more diverse mechanisms by which communities become insular, and the role that platform features play in facilitating these processes.

Confidence Boost in Dyadic Online Teamwork: An Individual-Focused Perspective

Liye Fu, Andrew Wang, Cristian Danescu-Niculescu-Mizil

Individuals are often more confident in their solutions when working in teams than when working on their own. This confidence boost is observed even when it is not accompanied by a corresponding gain in performance, raising the question of what other factors might be responsible. We address this question by developing a large-scale experimental setting in the form of a two-player online game that allows us to track the confidence of individuals in naturally-occurring online collaborative tasks. This setting enables us to disentangle and compare the effects of different components of the collaborative process on the confidence of each team member. We show that confidence evaluations are subject to social influence: a low-confidence individual receives a confidence boost as a direct consequence of interacting with their teammate, and the extent of the boost depends more on the confidence, rather than on the competence, of the teammate. The resulting framework can enhance our understanding of confidence boost as an often overlooked byproduct of online teamwork, and has implications for designing better platforms for online teamwork that meet diverse collaborative objectives.

Detecting Troll Behavior via Inverse Reinforcement Learning: A Case Study of Russian Trolls in the 2016 US Election

Luca Luceri, Silvia Giordano, Emilio Ferrara

Since the 2016 US Presidential election, social media abuse has been eliciting massive concern in the academic community and beyond. Preventing and limiting the malicious activity of users, such as trolls and bots, in their manipulation campaigns is of paramount importance for the integrity of democracy, public health, and more. However, the automated detection of troll accounts is an open challenge. In this work, we propose an approach based on Inverse Reinforcement Learning (IRL) to capture troll behavior and identify troll accounts. We employ IRL to infer a set of online incentives that may steer user behavior, which in turn highlights behavioral differences between troll and non-troll accounts, enabling their accurate classification. We report promising results: the IRL-based approach is able to accurately detect troll accounts (AUC=89.1%). The differences in the predictive features between the two classes of accounts enables a principled understanding of the distinctive behaviors reflecting the incentives trolls and non-trolls respond to.

Diffusion of Scientific Articles across Online Media

Igor Zakhlebin, Emoke Agnes Horvat

Online platforms have become the primary source of information about scientific advances for the wider public. As the online dissemination of scientific findings increasingly influences personal decision-making and government action, there is a growing necessity and interest in studying how people disseminate research findings online beyond one individual platform. In this paper, we study the simultaneous diffusion of scientific articles across major online platforms based on 63 million mentions of about 7.2 million articles spanning a 7-year period. First, we find commonalities between people sharing science and other content such as news articles and memes. Specifically, we find recurring bursts in the coverage of individual articles with initial bursts co-occurring in time across platforms. This allows for a ranking of individual platforms based on the speed at which they pick up scientific information. Second, we explore specifics of sharing science. We reconstruct the likely underlying structure of information diffusion and investigate the transfer of information about scientific articles within and across different platforms. In particular, we (i) study the role of different users in the dissemination of information to better understand who are the prime sharers of knowledge, (ii) explore the propagation of articles between platforms, and (iii) analyze the structural virality of individual information cascades to place science sharing on the spectrum between pure broadcasting and actual peer-to-peer diffusion. Our work provides the broadest study to date about the sharing of science online and builds the basis for an informed model of the dynamics of research coverage across platforms.

Disturbed YouTube for Kids: Characterizing and Detecting Inappropriate Videos Targeting Young Children

Kostantinos Papadamou, Antonis Papasavva, Savvas Zannettou, Jeremy Blackburn, Nicolas Kourtellis, Ilias Leontiadis, Gianluca Stringhini, Michael Sirivianos

A large number of the most-subscribed YouTube channels target children of very young age. Hundreds of toddler-oriented channels on YouTube feature inoffensive, well produced, and educational videos. Unfortunately, inappropriate content that targets this demographic is also common. YouTube's algorithmic recommendation system regrettably suggests inappropriate content because some of it mimics or is derived from otherwise appropriate content. Considering the risk for early childhood development, and an increasing trend in toddler's consumption of YouTube media, this is a worrisome problem.In this work, we build a classifier able to discern inappropriate content that targets toddlers on YouTube with 84.3% accuracy, and leverage it to perform a first-of-its-kind, large-scale, quantitative characterization that reveals some of the risks of YouTube media consumption by young children. Our analysis reveals that YouTube is still plagued by such disturbing videos and its currently deployed counter-measures are ineffective in terms of detecting them in a timely manner. Alarmingly, using our classifier we show that young children are not only able, but likely to encounter disturbing videos when they randomly browse the platform starting from benign videos.

Driving the Last Mile: Characterizing and Understanding Distracted Driving Posts on Social Networks

Hemank Lamba, Shwetanshu Singh, Dheeraj Reddy Pailla, Shashank Srikanth, Karandeep Singh Juneja, Ponnurangam Kumaraguru

In 2015, 391,000 people were injured due to distracted driving in the US. One of the major reasons behind distracted driving is the use of cell-phones, accounting for 14% of fatal crashes. Social media applications have enabled users to stay connected, however, the use of such applications while driving could have serious repercussions - often leading the user to be distracted from the road and ending up in an accident. In the context of impression management, it has been discovered that individuals often take a risk (such as teens smoking cigarettes, indulging in narcotics, and participating in unsafe sex) to improve their social standing. Therefore, viewing the phenomena of posting distracted driving posts under the lens of self-presentation, it can be hypothesized that users often indulge in risk-taking behavior on social media to improve their impression among their peers. In this paper, we first try to understand the severity of such social-media-based distractions by analyzing the content posted on a popular social media site where the user is driving and is also simultaneously creating content. To this end, we build a deep learning classifier to identify publicly posted content on social media that involves the user driving. Furthermore, a framework proposed to understand factors behind voluntary risk-taking activity observes that younger individuals are more willing to perform such activities, and men (as opposed to women) are more inclined to take risks. Grounding our observations in this framework, we test these hypotheses on 173 cities across the world. We conduct spatial and temporal analysis on a city-level and understand how distracted driving content posting behavior changes due to varied demographics. We discover that the factors put forth by the framework are significant in estimating the extent of such behavior.

Empirical Analysis of Multi-Task Learning for Reducing Identity Bias in Toxic Comment Detection

Ameya Vaidya, Feng Mai, Yue Ning

With the recent rise of toxicity in online conversations on social media platforms, using modern machine learning algorithms for toxic comment detection has become a central focus of many online applications. Researchers and companies have developed a variety of shallow and deep learning models to identify toxicity in online conversations, reviews, or comments with mixed successes. However, these existing approaches have learned to incorrectly associate non-toxic comments that have certain trigger-words (e.g. gay, lesbian, black, muslim) as a potential source of toxicity. In this paper, we evaluate dozens of state-of-the-art models with the specific focus of reducing model bias towards these commonly-attacked identity groups. We propose a multi-task learning model with an attention layer that jointly learns to predict the toxicity of a comment as well as the identities present in the comments in order to reduce this bias. We then compare our model to an array of shallow and deep-learning models using metrics designed especially to test for unintended model bias within these identity groups.

Engagement Patterns of Peer-to-Peer Interactions on Mental Health Platforms

Ashish Sharma, Monojit Choudhury, Tim Althoff, Amit Sharma

Mental illness is a global health problem, but access to mental health care resources remain poor worldwide. Online peer-to-peer support platforms attempt to alleviate this fundamental gap by enabling those that struggle with mental illness to provide and receive social support from their peers. However, successful social support requires users to engage with each other and failures may have serious consequences for users in need. Our understanding of engagement patterns on mental health platforms is limited but critical to inform the role, limitations, and design of these platforms. Here, we present a large-scale analysis of engagement patterns of two popular online mental health platforms, TalkLife and Reddit. We leverage communication models in human-computer interaction and communication theory to operationalize a set of four engagement indicators based on attention and interaction. We then propose a generative model to jointly model the indicators of engagement, the output of which is synthesized into a novel set of 11 distinct, interpretable patterns. We demonstrate that this framework of engagement patterns enables informative evaluations and analysis of online support platforms. We show that mutual, back-and-forth, interactions are associated with significantly higher user retention rates on TalkLife. Further investigating the mutual interactions, we find that early response and post sentiment are important factors in bringing about mutual interactions.

Examining Peer-to-Peer and Patient-Provider Interactions on a Social Media Community Facilitating Ask the Doctor Services

Alicia Lynn Nobles, Eric Leas, Mark Dredze, John Ayers

Ask the Doctor (AtD) services provide patients the opportunity to seek medical advice using online platforms. While these services represent a new mode of healthcare delivery, study of these online health communities and how they are used is limited. In particular, it is unknown if these platforms replicate existing barriers and biases in traditional healthcare delivery across demographic groups. We present an analysis of AskDocs, a subreddit that functions as a public AtD plat- form. We examine the demographics of users, the health top- ics discussed, if biases present in offline healthcare settings exist on this platform, and how empathy is expressed in inter- actions between users and physicians. Our findings suggest a number of implications to enhance and support peer-to-peer and patient-provider interactions on online platforms.

Falling into the Echo Chamber: the Italian Vaccination Debate on Twitter

Alessandro Cossard, Gianmarco De Francisci Morales, Kyriaki Kalimeri, Yelena Mejova, Daniela Paolotti, Michele Starnini

The reappearance of measles in the US and Europe, a disease considered eliminated in early 2000s, has been accompanied by a growing debate on the merits of vaccination on social media. In this study we examine the extent to which the vaccination debate on Twitter is conductive to potential outreach to the vaccination hesitant. We focus on Italy, one of the countries most affected by the latest measles outbreaks. We discover that the vaccination skeptics, as well as the advocates, reside in their own distinct "echo chambers". The structure of these communities differs as well, with skeptics arranged in a tightly connected cluster, and advocates organizing themselves around few authoritative hubs. At the center of these echo chambers we find the ardent supporters, for which we build highly accurate network- and content-based classifiers (attaining 95% cross-validated accuracy). Insights of this study provide several avenues for potential future interventions, including network-guided targeting, accounting for the political context, and monitoring of alternative sources of information.

Generalized Euclidean Measure to Estimate Network Distances

Michele Coscia

Estimating the distance covered by a propagation phenomenon on a network is an important task: it can help us estimating the infectiousness of a disease or the effectiveness of an online viral marketing campaign. However, so far the only way to make such an estimate relies on solving the optimal transportation problem, or by adapting graph signal processing techniques. Such solutions are either inefficient, because they require solving a complex optimization problem; or fragile, because they were not designed with this problem in mind. In this paper, we propose a new generalized Euclidean approach to estimate distances between weighted groups of nodes in a network. We do so by adapting the Mahalanobis distance, incorporating the graph's topology via the pseudoinverse of its Laplacian. In experiments we see that this measure returns intuitive distances which agree with the ones a human would estimate. We also show that the measure is able to recover the infection parameter in an epidemic model, or the activation threshold in a cascade model. We conclude by showing that the measure can be used in online social media settings to identify fast-spreading behaviors. Our measure is also less computationally expensive.

Generating realistic interest-driven information cascades

Federico Cinus, Francesco Bonchi, Corrado Monti, André Panisson

We propose a model for the synthetic generation of information cascades in social media. In our model the information ``memes" propagating in the social network are characterized by a probability distribution in a topic space, accompanied by a textual description, i.e., a bag of keywords coherent with the topic distribution. Similarly, every person is described by a vector of interests defined over the same topic space. Information cascades are governed by the topic of the meme, its level of virality, the interests of each person, community pressure, and social influence. The main technical challenge we face towards our goal is the generation of realistic interest vectors, given a known network structure and a tunable level of homophily. We tackle this problem by means of a method based on non-negative matrix factorization, which is shown experimentally to outperform non-trivial baselines based on label propagation and random-walk-based graph embedding. As we showcase in our experiments, our model offers a small set of simple and easily interpretable ``knobs'' which allow to study, \emph{in vitro}, how each set of assumptions affects the resulting propagations. Finally, we show how to generate synthetic cascades that have similar macro-statistics to the real-world cascades for a dataset containing both the network and the cascades.

Gossip and Attend: Context-Sensitive Graph Representation Learning

Zekarias Tilahun Kefato, Sarunas Girdzijauskas

Graph (Network) representation learning (NRL) is a powerful technique for learning low-dimensional vector representation of high-dimensional and sparse graphs. Most studies explore the structure and metadata associated with the graph using random walks and employ an unsupervised or semi-supervised learning schemes. Learning in these methods is context-free, resulting in only a single representation per node. Recently studies have argued on the adequacy of a single representation and proposed context-sensitive approaches, which are capable of extracting multiple node representations for different contexts. This proved to be highly effective in applications such as link prediction and ranking. However, most of these methods rely on additional textual features that require complex and expensive RNNs or CNNs to capture high-level features or rely on a community detection algorithm to identify multiple contexts of a node. In this study we show that in-order to extract high-quality context-sensitive node representations it is not needed to rely on supplementary node feature, to employ computationally heavy and complex models. We propose GOAT, a context-sensitive algorithm inspired by gossip communication and a mutual attention mechanism simply over the structure of the graph. We show the efficacy of GOAT using 6 real-world datasets on link prediction and node clustering tasks and compare it against 12 popular and state-of-the-art (SOTA) base lines. GOAT consistently outperforms them and achieves up to ≈ 12% and ≈ 19% gain over the best performing methods on link prediction and clustering tasks, respectively.

Gravity of Location-based Service: Analyzing the Effects for Mobility Pattern and Location Prediction

Keiichi Ochiai, Yusuke Fukazawa, Wataru Yamada, Hiroyuki Manabe, Yutaka Matsuo

Predicting user location is one of the most important topics in data mining. Although human mobility is reasonably predictable for frequently visited places, novel location prediction is much more difficult. However, location-based services (LBSs) can influence users' choice of destination and can be exploited to more accurately predict user location even for new locations. In this study, we assessed the behavior difference for specific LBS users and non-users by using large-scale check-in data. We found a remarkable difference between specific LBS users and non-users (e.g., check-in locations) that had previously not been revealed. Then, we proposed a location prediction method exploiting the characteristics of check-in locations and analyzed how specific LBS usage influences location predictability. We assumed that users who use the same LBS tend to visit similar locations. The results showed that the novel location predictability of specific LBS users is up to 43.9% higher than that of non-users.

Hierarchical Propagation Networks for Fake News Detection: Investigation and Exploitation

Kai Shu, Deepak Mahudeswaran, Suhang Wang, Huan Liu

Consuming news from social media is becoming increasingly popular. However, social media also enables the wide dissemination of fake news. Because of the detrimental effects of fake news, fake news detection has attracted increasing attention. However, the performance of detecting fake news only from news content is generally limited as fake news pieces are written to mimic true news. In the real world, news pieces spread through propagation networks on social media. The news propagation networks usually involve multi-levels. In this paper, we study the challenging problem of investigating and exploiting news hierarchical propagation network on social media for fake news detection. In an attempt to understand the correlations between news propagation networks and fake news, first, we build hierarchical propagation networks for fake news and true news pieces; second, we perform a comparative analysis of the propagation network features from structural, temporal, and linguistic perspectives between fake and real news, which demonstrates the potential of utilizing these features to detect fake news; third, we show the effectiveness of these propagation network features for fake news detection. We further validate the effectiveness of these features from feature important analysis. We conduct extensive experiments on real-world datasets and demonstrate the proposed features can significantly outperform state-of-the-art fake news detection methods by at least 1.7% with an average F1>0.84. Altogether, this work presents a data-driven view of hierarchical propagation network and fake news and paves the way towards a healthier online news ecosystem.

Higher Ground? How Groundtruth Labeling Impacts Our Understanding of the Spread of Fake News During the 2016 Election

Lia Bozarth, Aparajita Saraf, Ceren Budak

The spread of fake news in online social media platforms has garnered much public attention and apprehension. Consequently, both the industry and academia alike are investing increased effort to understand, detect, and curb fake news. Yet, researchers differ in what they consider to be fake news sites. In this paper, we first aggregate 5 distinct lists of fake news sites, and 3 lists of mainstream news sites published by experts and reputable organizations. Then, using each pair of fake and mainstream news lists as an independent groundtruth, we examine i) the prevalence and ii) temporal characteristics of fake news as well as iii) the agenda-setting differences between fake and mainstream news sites. We observe that depending on the groundtruth, the prevalence of fake news varies significantly. However, the temporal trends and agenda-setting differences between fake and mainstream news sites remain moderately consistent across different groundtruth lists.

Hyperpartisanship, Disinformation and Political Conversations on Twitter: The Brazilian Presidential Election of 2018

Raquel Recuero, Felipe Bonow Soares, Anatoliy Gruzd

This paper examines the role of hyperpartisanship and polarization on Twitter during the 2018 Brazilian Presidential Election. Based on a mixed-methods approach, we collected and analyzed a dataset of over 8 million tweets about Jair Bolsonaro, a far-right candi-date from the Social Liberty Party. Our results show that there is a strong connection between polarization, hyperpartisanship and disinformation. As the centrality of hyperpartisan outlets on Twitter grew, more traditional media outlets became less central and conversations became more polarized. We also confirmed that hyperpartisan outlets often shared disinformation or biased information, presented as a “truth-telling” alterna-tive to journalistic outlets. And while disinformation was more frequently observed in the far-right group, it was also present in the anti-Bolsonaro cluster, especially towards the runoff period.

Identifying and Quantifying Coordinated Manipulation of Upvotes and Downvotes in Naver News Comments

Jiwan Jeong, Jeong-han Kang, Sue Moon

Today, many news sites let users write comments on news articles, rate others' comments by upvoting and downvoting, and order the comments by the rating. Top-rated comments are placed right below the news article and read widely, reaching a large audience and wielding great influence. As their importance grew, upvotes and downvotes are increasingly manipulated by coordinated efforts in order to push certain comments to the top. Boosted comments are often well-written, but they represent biased opinions, making the manipulated consensus to be seen as public opinion.In this paper, we analyze comment sections of articles targeted by coordinated efforts and identify a trace of vote manipulation. Based on the findings, we propose a parameterized classifier that distinguishes comment threads affected by coordinated voting. Using the classifier and our choice of parameters, we have examined six years of the entire commenting history on a leading news portal in South Korea. Manual inspection with side-channel information could only identify hundreds of targeted articles. With our classifier, we have identified more than ten thousands of comment threads with a high likelihood of manipulation. We report that this type of coordinated manipulation increased significantly in recent years.

Identity-Based Roles in Rhizomatic Social Justice Movements on Twitter

Judeth Oden Choi, James D. Herbsleb, Jessica Hammer, Jodi Forlizzi

Contemporary social movements can be understood as rhizomatic,growing laterally without a central structure. In thismixed methods study, we investigate the stable roles that activistsdevelop based on their personal and professional identitiesand carry with them through the dynamic landscape ofrhizomatic social movements on Twitter. We conduct interviewswith self-identified social justice activists and analyzeseven weeks of their Twitter timeline and retweets. We findthree activist roles–organizer, storyteller and advocate–anddescribe the professional identities, approaches to activism,behaviors on Twitter, and the relationship to social movementsfor each role. We use these roles as a lens to betterunderstand how movement identities are constructed, lay outan agenda for future research on roles in rhizomatic movementsand suggest design directions.

Influence Maximization using Influence and Susceptibility Embeddings

George Panagopoulos, Fragkiskos Malliaros, Michalis Vazirgiannis

Finding a set of users that can maximize the spread of information in a social networkis an important problem in social media analysis -- being a critical part of several real-world applications such as viral marketing, political advertising and epidemiology. Although influence maximization has been studied extensively in the past,the majority of works focus on the algorithmic aspect of the problem, overlooking several practical improvements that can be derived by data-driven observations or the inclusion of machine learning.The main challenges of realistic influence maximization is on the one hand the computational demand of the diffusion models' repetitive simulations, and on the other the accuracy of the estimated influence spread.In this work, we propose L-CELFIE, an influence maximization method that utilizes learnt influence representations from diffusion cascades to overcome the use of diffusion models.It comprises of two parts. The first is based on inf2vec, an unsupervised learning model that embeds influence relationships between nodes from a set of diffusion cascades. We create a new version of the model,based on observations from influence analysis on a large scale dataset, to match the scalability needs and the purpose of influence maximization.The second part capitalizes on the learned representations to redefine the traditional live-edge model sampling for the computation of the marginal gain.For evaluation, we apply our method in the Sina Weibo and MAG-CS datasets, two large scale networks acompanied by diffusion cascades.We observe that our algorithm outperforms various baseline methods in terms of seed set quality and speed. In addition, the proposed inf2vec modification for influence maximization provides substantial computational advantages in the price of a minuscule loss in the influence spread.

Learning Cross-lingual Word Embeddings from Twitter via Distant Supervision

Jose Camacho-Collados, Yerai Doval, Eugenio Martínez Cámara, Luis Espinosa-Anke, Francesco Barbieri, Steven Schockaert

Cross-lingual embeddings represent the meaning of words from different languages in the same vector space. Recent work has shown that it is possible to construct such representations by aligning independently learned monolingual embedding spaces, and that accurate alignments can be obtained even without external bilingual data. In this paper we explore a research direction which has been surprisingly neglected in the literature: leveraging noisy user-generated text to learn cross-lingual embeddings particularly tailored towards social media applications.While the noisiness and informal nature of the social media genre poses additional challenges to cross-lingual embedding methods, we find that it also provides key opportunities due to the abundance of code-switching and the existence of a shared vocabulary of emoji and named entities. Our contribution consists in a very simple post-processing step that exploits these phenomena to significantly improve the performance of state-of-the-art alignment methods.

Leveraging mobility flows from location-based services to test crime pattern theory

Cristina Kadar, Stefan Feuerriegel, Anastasios Noulas, Cecilia Mascolo

Crime has been previously explained by social characteristics of the residential population and, as stipulated by crime pattern theory, might also be linked to human movements of non-residential visitors. Yet a full empirical validation of the latter is lacking. The prime reason is that prior studies are limited to aggregated statistics of human visitors rather than mobility flows and, because of that, neglect the temporal dynamics of individual human movements. As a remedy, we provide the first work which studies the ability of granular human mobility in describing and predicting crime concentrations at an hourly scale. For this purpose, we propose the use of data from location-based services. This type of data allows us to trace individual transitions and, therefore, we succeed in distinguishing different mobility flows that (i) are incoming or outgoing from a neighborhood, (ii) remain within it, or (iii) refer to transitions where people only pass through the neighborhood. Our evaluation infers mobility flows by leveraging an anonymized dataset from Foursquare that includes almost 14.8 million consecutive check-ins in three major U.S. cities. According to our empirical results, mobility flows are significantly and positively linked to crime. These findings advance our theoretical understanding, as they provide confirmatory evidence for crime pattern theory. Furthermore, our novel use of digital location services data proves to be an effective tool for crime forecasting. It also offers unprecedented granularity when studying the connection between human mobility and crime.

Linking the Social and Academic Profiles of Researchers

Asmelash Teka Hadgu

People have presence across different information networks on the social web. The problem of user identity linking, is the task of establishing a connection between accounts of the same user across different networks. Solving this problem is useful for: personalized recommendations, cross platform data enrichment and verifying online information among others. In this paper, we propose a deep learning based approach that jointly models heterogeneous data: text content, network structure as well as profile names and images, in order to solve the user identity linking problem. We perform experiments on a real world problem of connecting the social profile (Twitter) and academic profile (DBLP) of researchers. Our experimental results show that our joint model outperforms state-of-the-art results that consider profile, content or network features only.

Local Trends in Global Music Streaming

Samuel Frederick Way, Jean Garcia-Gathright, Henriette Cramer

Audio streaming services have made it easier for countries around the world to listen to each other's music. This expansion in listeners' access to global content, however, has raised questions about streaming's impact on the import and export flows of music between countries and their preferences for local or global content. Here, we analyze five and a half years of all streaming data from Spotify, a global music streaming service, and find that preferences for local content have increased from 2014 through 2019, reversing previously noted trends. Perhaps correspondingly, the roles of both common official language and geographic proximity between countries have expanded during this period, particularly for younger audiences. Further, we show that these trends persist across different genres, listener age groups, and early- and late-adopters of streaming, providing new insights into this newest phase in the continued evolution of music and its impact on listeners around the world.

Measuring Edge Sparsity on Large Social Networks

Johnathan David Smith, My Thai

How strong are the connections between individuals? This is a fundamental question in the study of social networks. In this work, we take a topological view rooted in the idea of local sparsity to answer this question on large social networks to which we have only incomplete access. Prior approaches to measuring network structure are not applicable to this setting due to the strict limits on data availability. Therefore, we propose a new metric, the Edgecut Weight}, for this task. This metric can be calculated efficiently in an online fashion, and we empirically show that it captures important elements of communities. Further, we demonstrate that the distribution of these weights characterizes connectivity on a network. Subsequently, we estimate the distribution of weights on Twitter and show both a lack of strong connections and a corresponding lack of community structure.

MimicProp: Learning to Incorporate Lexicon Knowledge into Distributed Word Representation for Social Media Analysis

Muheng Yan, Yu-Ru Lin, Rebecca Hwa, Ali Mert Ertugrul, Meiqi Guo, Wen-Ting Chung

Lexicon-based methods and word embeddings are the two widely used approaches for analyzing texts in social media. The choice of an approach can have a significant impact on the reliability of the text analysis. For example, lexicons provide manually curated, domain-specific attributes about a limited set of words, while word embeddings learn to encode some loose semantic interpretations for a much broader set of words. Text analysis can benefit from a representation that offers both the broad coverage of word embeddings and the domain knowledge of lexicons. This paper presents MimicProp, a new graph-mode method that learns a lexicon-aligned word embedding. Our approach improves over prior graph-based methods in terms of its interpretability (i.e., lexicon attributes can be recovered) and generalizability (i.e., new words can be learned to incorporate lexicon knowledge). It also effectively improves the performance of downstream analysis applications, such as text classification.

Minimizing Interference and Selection Bias in Network Experiment Design

Zahra Fatemi, Elena Zheleva

Current approaches to A/B testing in networks focus on limiting interference, the concern that treatment effects can "spill over" from treatment nodes to control nodes and lead to biased causal effect estimation. Prominent methods for network experiment design rely on two-stage randomization, in which sparsely-connected clusters are identified and cluster randomization dictates the node assignment to treatment and control. Here, we show that cluster randomization does not ensure sufficient node randomization and it can lead to selection bias in which treatment and control nodes represent different populations of users. To address this problem, we propose a principled framework for network experiment design which jointly minimizes interference and selection bias. We introduce the concepts of edge spillover probability and cluster matching and demonstrate their importance for designing network A/B testing. Our experiments on a number of real-world datasets show that our proposed framework leads to significantly lower error in causal effect estimation than existing solutions.

Modeling and Measuring Expressed (Dis)belief in (Mis)information

Shan Jiang, Miriam Metzger, Andrew Flanagin, Christo Wilson

The proliferation of online misinformation has been raising increasing societal concerns about its potential consequences, e.g., polarizing the public, eroding trust in institutions. These consequences are framed under the public's susceptibility to such misinformation -- a narrative that needs further investigation and quantification. To this end, our paper proposes an observational approach to model and measure expressed (dis)beliefs in (mis)information by leveraging social media comments as a proxy. We collect a sample of tweets in response to misinformation and annotate them with (dis)belief labels, explore the dataset using lexicon-based methods, and finally build classifiers based on the state-of-the-art neural transfer-learning models. Under a domain-specific thresholding strategy, the best-performing unbiased classifier archives macro-F1 scores around 0.86 for disbelief and 0.80 for belief. Applying the classifier, we conduct a large-scale measurement study and show that, overall, 12%--15% social media comments express disbelief and 26%--20% express belief, with the left-bounds representing comments in response to true claims and the right-bounds to false ones. Our results also suggest an extremely slight time effect of falsehood awareness, a positive effect of fact-checks to false claims, and a difference in (dis)belief across social media platforms.

No Robots, Spiders, or Scrapers: Legal and Ethical Regulation of Data Collection Methods in Social Media Terms of Service

Casey Fiesler, Nate Beard, Brian C Keegan

Researchers from many different disciplines rely on social media data as a resource. Whereas some platforms explicitly allow data collection, even facilitating it through an API, others explicitly forbid automated or manual collection processes. A current topic of debate within the social computing research community involves the ethical (or even legal) implications of collecting data in ways that violate Terms of Service (TOS). Using a sample of TOS from over one hundred social media sites from around the world, we analyze TOS language and content in order to better understand the landscape of prohibitions on this practice. Our findings show that though these provisions are very common, they are also ambiguous, inconsistent, and lack context. By considering our analysis of the nature of these provisions alongside legal and ethical analysis, we propose that ethical decision-making for data collection should extend beyond TOS and consider contextual factors of the data source and research.

Pie Chart or Pizza: Identifying Chart Types and their Virality on Twitter

Pavlos Vougiouklis, Leslie Carr, Elena Simperl

We aim to understand how data, rendered visually as charts or infographics, "travels" on social media. To do so we propose a neural network architecture that is trained to distinguish among different types of charts, for instance line graphs or scatter plots, and predict how much they will be shared. This poses significant challenges because of the varying format and quality of the charts that are posted, and the limitations in existing training data. To start with, our proposed system outperforms related work in chart type classification on the ReVision corpus. Furthermore, we use crowdsourcing to build a new corpus, more suitable to our aims, consisting of chart images shared by data journalists on Twitter. We evaluate our system on the second corpus with respect to both chart identification and virality prediction, with promising results.

Quasi-experimental Designs for Assessing Response on Social Media to Policy Changes

Yijun Tian, Rumi Chunara

Regulation of tobacco products is rapidly evolving. Understanding public sentiment in response to changes is very important as authorities assess how to effectively protect population health. Social media systems are widely recognized to be useful for collecting data about human preferences and perceptions. However, how social media data may be used, in rapid policy change settings, given challenges of narrow time periods and specific locations and non-representative the population using social media is an open question. In this paper we apply quasi-experimental designs, which have been used previously in observational data such as social media, to control for time and location confounders on social media, and then use content analysis of Twitter and Reddit posts to illustrate the content of reactions to tobacco flavor bans and the effect of taxation on e-cigarettes. Conclusions distill the potential role of social media in settings of rapidly changing regulation, in complement to what is learned by traditional denominator-based representative surveys.

Quick, Community-Specific Learning: How Distinctive Toxicity Norms are Maintained in Political Subreddits

Ashwin Rajadesingan, Paul Resnick, Ceren Budak

Online communities about similar topics may maintain very different norms of interaction. Past research identifies many processes that contribute to maintaining stable norms, including self-selection, pre-entry learning, post-entry learning, and retention. We analyzed political subreddits that had distinctive, stable levels of toxic comments on Reddit, in order to identify the relative contribution of these four processes. Surprisingly, we find that the largest source of norm stability is pre-entry learning. That is, newcomers' first comments in these distinctive subreddits differ from those same people's prior behavior in other subreddits. Through this adjustment, they nearly match the toxicity level of the subreddit they are joining. We also show that behavior adjustments are community-specific and not broadly transformative. That is, people continue to post toxic comments at their previous rates in other political subreddits. Thus, we conclude that in political subreddits, compatible newcomers are neither born nor made-- they make local adjustments on their own.

REST: A thread embedding approach for identifying and classifying user-specified information in security forums

Joobin Gharibshah, Vagelis Papalexakis, Michalis Faloutsos

How can we extract useful information from a security forum?We focus on identifying threads of interest to a securityprofessional: (a) alerts of worrisome events, such asattacks, (b) offering of malicious services and products, (c)hacking information to perform malicious acts, and (d) usefulsecurity-related experiences. The analysis of security forumsis in its infancy despite several promising recent works.Novel approaches are needed to address the challenges in thisdomain: (a) the difficulty in specifying the “topics” of interestefficiently, and (b) the unstructured and informal natureof the text. We propose, REST, a systematic methodology to:(a) identify threads of interest based on a, possibly incomplete,bag of words, and (b) classify them into one of the fourclasses above. The key novelty of the work is a multi-stepweighted embedding approach: we project words, threads andclasses in appropriate embedding spaces and establish relevanceand similarity there. We evaluate our method with realdata from three security forums with a total of 164k posts and21K threads. First, REST is robustness to initial keyword selectioncan extend the user-provided keyword set and thus,it can recover from missing keywords. Second, REST categorizesthe threads into the classes of interest with superioraccuracy compared to five other methods: REST exhibits anaccuracy between 63.3-76.9%. We see our approach as a firststep for harnessing the wealth of information of online forumsin a user-friendly way, since the user can loosely specify herkeywords of interest.

See and Read: Detecting Depression Symptoms in Higher Education Students from Pictures and Captions Posted in Social Media

Paulo Mann, D.Sc. Aline Paes, Elton Hiroshi Matsushima Matsushima

Mental disorders such as depression and anxiety have been increasing at alarming rates in the worldwide population. Notably, the major depressive disorder has become a common problem among higher education students, aggravated, and maybe even occasioned, by the academic pressures they must face. While the reasons for this alarming situation remain unclear (although widely investigated), the student already facing this problem must receive treatment. To that, it is first necessary to screen the symptoms. The traditional way for that is relying on clinical consultations or answering questionnaires. However, nowadays, the data shared at social media is a ubiquitous source that can be used to detect the depression symptoms even when the student is not able to afford or search for professional care. Previous works have already relied on social media data to detect depression on the general population, usually focusing on either posted images or texts or relying on metadata. In this work, we focus on detecting the severity of the depression symptoms in higher education students, by comparing deep learning with feature engineering models induced from both the pictures and their captions posted on Instagram. The experimental results show that students presenting a BDI score higher than 20 can be detected with 0.92 of recall and 0.69 of precision in the best case, reached by a fusion model. Our findings show a potential of help on further investigation of depression, by bringing students at-risk to light, to guide them to access adequate treatment.

Semantic Representations of Purchase Intentions on Social Media as Predictors of Consumer Spending

Viktor Pekar

The paper addresses the problem of forecasting consumer expenditure from social media data. Previous research of the topic exploited the intuition that search engine traffic reflects purchase intentions and constructed predictive models of consumer behavior from search query volumes. In contrast, we derive predictors from explicit expressions of purchase intentions found in social media posts. Two types of predictors created from these expressions are explored: those based on word embeddings and those based on topical word clusters. We introduce a new clustering method, which takes into account temporal co-occurrence of words, in addition to their semantic similarity, in order to create predictors relevant to the forecasting problem. The predictors are evaluated against baselines that use only macroeconomic variables, and against models trained on search traffic data. Conducting experiments with three different regression methods on Facebook and Twitter data, we find that both word embeddings and word clusters help to reduce forecasting errors in comparison to purely macroeconomic models. In most experimental settings, the error reduction is statistically significant, and is comparable to error reduction achieved with search traffic variables.

Sentiment Paradoxes in Social Networks: Why Your Friends are More Positive Than You?

Xinyi Zhou, Shengmin Jin, Reza Zafarani

Most individuals consider their friends to be more positive than themselves, exhibiting a sentiment paradox. Psychological research attributes this paradox to human cognition bias. With the goal to understand this phenomenon, we study sentiment paradoxes in social networks. Our work shows that social connections (friends, followees, or followers) of users are indeed generally (not illusively) more positive than the users themselves. Five existing sentiment paradoxes are identified at different network levels ranging from triads to large-scale communities. Empirical and theoretical evidence are provided to verify the observed and expected existence of such sentiment paradoxes. By investigating the relationships between the sentiment paradox and other well-developed network paradoxes, i.e., friendship paradox and activity paradox, we found that user sentiments are positively correlated to their number of social connections while hardly to their social activity. Finally, we demonstrate how the validated sentiment paradoxes can be used in turn to predict user sentiments.

Social Media Relevance Filtering using Perplexity-based Positive-Unlabelled Learning

Sunghwan Mac Kim, Stephen Wan, Cecile Paris, Andreas Duenser

Internet user-generated data, like Twitter, offers data scientistsa public real-time data source that can provide insights,supplementing traditional data sources. However, identifyingrelevant data for such analyses can be time consuming. In thispaper, we introduce the Perplexity variant of our Positive-Unlabelled Learning (PPUL) framework as a means to performsocial media relevance filtering. We note that this taskis particularly well suited to a PU Learning approach. Wedemonstrate how perplexity can identify candidate examplesof the negative class, using language models. To learn suchmodels, we experiment with both statistical methods and aVariational Autoencoder. Our PPUL method generally outperformsstrong PU Learning baselines, which we demonstrateon five different datasets: the Hazardous Product Reviewdataset, two well known social media datasets, and tworeal case studies in relevance filtering. All datasets have manualannotations for evaluation and in each case, PPUL attainsstate-of-the-art performance, with gains ranging from 4 to17% improvement over competitive baselines. We show thatthe PPUL framework is effective when the amount of positiveannotated data is small, and it is appropriate for both contentthat is triggered by an event or a general topic of interest.

Source Attribution: Recovering the Press Releases Behind Health Science News

Ansel MacLaughlin, John Wihbey, David Smith

We explore the task of intrinsic source attribution: inferring which portions of a derived document were adapted from an \textit{unobserved} source document. Specifically, we model the relationship between news articles and their press release sources using a dataset of 64,784 health science news articles and 23,068 press releases. We approach the problem at the sentence level and work with science journalism professors to develop a four point Likert scale describing the extent to which a news article sentence is derived from the content in the corresponding press release. Because manual annotation of news article - press release pairs is time-consuming, we turn to a mix of expert, non-expert, and heuristic-based annotation to label our dataset. After a small pilot study, which found that humans, when only able to view the text of the news article, struggle to identify which content is derived or not, we compare four different sentence regression models on the task. We find that modeling a sentence's context in the entire document is important, with the best performing model, a sequence regression model with BERT token representations, achieving a spearman's $\rho$ of 0.49 and $NDCG@1$ of 0.60 on the expert-labeled test set. Examining the model's predictions, we find that it successfully identifies copied or closely paraphrased sentences in articles with a mix of derived and original content, but struggles to differentiate between loosely paraphrased and original sentences in articles with mostly original writing.

Style Matters!: Investigating Linguistic Style in Online Communities

Padmini Srinivasan, Osama Khalid

Content has historically been the primary lens used to study the language of online communities. This paper instead focuses on style of communication. While we know that individuals have distinguishable styles, here we ask whether communities have distinguishable styles. Additionally, while prior work has relied on a narrow definition of style, we employ a broad definition involving 262 features to analyze the language style of 9 online communities from 3 social media platforms discussing politics, television and travel. We find that communities indeed have distinct styles. Also, style is an excellent predictor of group membership and is on average better than prediction using content while also being more resilient to reductions in training data.

The Effect of Homophily on Disparate Visibility of Minorities in People Recommender Systems

Francesco Fabbri, Francesco Bonchi, Ludovico Boratto, Carlos Castillo

Evaluating (and mitigating) the potential negative effects of algorithms has become a central issue in computer science. While research on algorithmic bias in ranking systems has dealt with disparate exposure of products or individuals, little attention has been devoted to the analysis of the disparate exposure of subgroups of the population. In this paper, we investigate the visibility of minorities in people recommender systems in social networks. Specifically, we consider a bi-populated social network, i.e., a graph where the nodes belong to two different groups (majority and minority) and, by applying state-of-the-art people recommenders, we analyze how disparate visibility can be amplified or mitigated by different levels of homophily within each subgroup. We start our analysis on real-world social graphs, where the two subgroups are defined by sensitive demographic attributes such as gender or age. Our findings suggest that the way and the extent in which people recommenders can produce disparate visibility on the two subgroups, might depend in large part on the level of homophily within the subgroups. In order to verify our preliminary findings, we then move our analysis to syntectic datasets, where we can control characteristics of the input social graph, such as the size of the minority and the level of homophily. Our results show that homophily plays a key role in promoting or reducing visibility for different subgroups under various combinations of dataset characteristics and recommendations algorithms.

The structure of U.S. college networks on Facebook

Jan Overgoor, Bogdan State, Lada A. Adamic

Anecdotally, social connections made in college have life-long impact. Yet knowledge of social networks formed in college remains episodic, due in large part to the difficulty and expense involved in collecting a suitable dataset for comprehensive analysis. To advance and systematize insight into college social networks, we describe a dataset of the largest online social network platform used by college students in the United States. We combine anonymized and aggregated Facebook data with College Scorecard data, campus-level information provided by U.S. Department of Education, to produce a dataset covering the 2008-2015 entry year cohorts for 1,156 U.S. colleges and universities, spanning 7.4 million students. To perform the difficult task of comparing these networks of different sizes we develop a new methodology. We compute features over sampled ego-graphs, train binary classifiers for every pair of graphs, and operationalize distance between graphs as predictive accuracy. Social networks of different year cohorts at the same school are structurally more similar to one another than to cohorts at other schools. Networks from similar schools have similar structures, with the public/private and graduation rate dimensions being the most distinguishable. We also relate school types to specific outcomes. For example, students at private schools have larger networks that are more clustered and with higher homophily by year. Our findings may help illuminate the role that colleges play in shaping social networks which partly persist throughout people's lives.

Top Comment or Flop Comment? Predicting and Explaining User Engagement in Online News Discussions

Julian Risch, Ralf Krestel

Comment sections below online news articles enjoy growing popularity among readers. However, the overwhelming number of comments makes it infeasible for the average news consumer to read all of them and hinders engaging discussions. Most platforms display comments in chronological order, which neglects that some of them are more relevant to users and are better conversation starters. In this paper, we systematically analyze user engagement in the form of the upvotes and replies that a comment receives. Based on comment texts, we train a model to distinguish comments that have either a high or low chance of receiving many upvotes and replies. Our evaluation on user comments from TheGuardian.com compares recurrent and convolutional neural network models, and a traditional feature-based classifier. Further, we investigate what makes some comments more engaging than others. To this end, we identify engagement triggers and arrange them in a taxonomy. Explanation methods for neural networks reveal which input words have the strongest influence on our model's predictions. In addition, we evaluate on a dataset of product reviews, which exhibit similar properties as user comments, such as featuring upvotes for helpfulness.

Towards Automated Sexual Violence Report Tracking

Naeemul Hassan, Amrit Poudel, Jason Hale, Claire Hubacek, Khandakar Tasnim Huq, Shubhra Kanti Karmaker Santu, Syed Ishtiaque Ahmed

Tracking sexual violence is a challenging task. In this paper, we present a supervised learning-based automated sexual violence report tracking model that is more scalable, and reliable than its crowdsource based counterparts. We define the sexual violence report tracking problem by considering victim, perpetrator contexts and the nature of the violence. We find that our model could identify sexual violence reports with a precision and recall of 80.4% and 83.4%, respectively. Moreover, we also applied the model during and after the #MeToo movement. Several interesting findings are discovered which are not easily identifiable from a shallow analysis.

Towards Measuring Adversarial Twitter Interactions against Candidates in the US Midterm Elections

Yiqing Hua, Thomas Ristenpart, Mor Naaman

Adversarial interactions against politicians on social media such as Twitter have significant impact on society, and in particular both discourage people from seeking office and disrupt substantive political discussions online. In this study, we measure the adversarial interactions towards candidates during the run-up to the 2018 US general election. We gather a new dataset consisting of 1.7 million tweets involving candidates, one of the largest corpora focusing on political discourse. We then develop new techniques for detecting tweets with toxic content and the target of its hostility, which allows us to quantify adversarial interactions towards political candidates at scale. We go on to design a new algorithm to induce candidate-specific adversarial terms to capture more nuanced adversarial interactions that are in most other contexts not considered toxic. Together our techniques enable us to categorize the breadth of adversarial interactions seen in the election, including offensive name-calling, threats of violence, posting discrediting information, attacks on identity, and adversarial message repetition.

Towards Quantifying the Distance between Opinions

Saket Gurukar, Deepak Ajwani, Sourav Dutta, Juho Lauri, Srinivasan Parthasarathy, Alessandra Sala

Increasingly, critical decisions in public policy, governance, and business strategy rely on a deeper understanding of the needs and opinions of constituent members (e.g. citizens, shareholders). While it has become easier to collect a large number of opinions on a topic, there is a necessity for automated tools to help navigate the space of opinions. In such contexts understanding and quantifying the similarity between opinions is key. We find that measures based solely on text similarity or on overall sentiment often fail to effectively capture the distance between opinions. Thus, we propose a new distance measure for capturing the similarity between opinions that leverages the nuanced observation – similar opinions express similar sentiment polarity on specific relevant entities-of-interest. Specifically, in an unsupervised setting, our distance measure achieves significantly better Adjusted Rand Index scores (up to 56x) and Silhouette coefficients (up to 21x) compared to existing approaches. Similarly, in a supervised setting, our opinion distance measure achieves considerably better accuracy (up to 20% increase) compared to extant approaches that rely on text similarity, stance similarity, and sentiment similarity.

Trust me, I have a Ph.D.: A propensity score analysis on the halo effect of disclosing one’s offline social status in online communities

Kunwoo Park, Haewoon Kwak, Hyunho Song, Meeyoung Cha

Online communities adopt various reputation schemes to measure content quality. This study analyzes the effect of a new reputation scheme that exposes one's offline social status, such as an education degree, within an online community. We study two Reddit communities that adopted this scheme, whereby posts include tags identifying education status referred to as flairs, and we examine how the "transferred" social status affects the interactions among the users. We computed propensity scores to test whether flairs give ad-hoc authority to the adopters while minimizing the effects of confounding variables such as topics of content. The results show that exposing academic degrees is likely to lead to higher audience votes as well as larger discussion size, compared to the users without the disclosed identities, in a community that covers peer-reviewed scientific articles. In another community with a focus on casual science topics, exposing mere academic degrees did not obtain such benefits. Still, the users with the highest degree (e.g., Ph.D. or M.D.) were likely to receive more feedback from the audience. These findings suggest that reputation schemes that link the offline and online worlds could induce halo effects on feedback behaviors differently depending upon the community culture. We discuss the implications of this research for the design of future reputation mechanisms.

Two Computational Models for Analyzing Political Attention in Social Media

Libby Hemphill, Angela Mariana Schopke

Understanding how political attention is divided and over what subjects is crucial for research on areas such as agenda setting, framing, and political rhetoric. Existing methods for measuring attention, such as manual labeling according to established codebooks, are expensive and can be restrictive. We describe two computational models that automatically distinguish topics in politicians' social media content. Our models---one supervised classifier and one unsupervised topic model---provide different benefits. The supervised classifier reduces the labor required to classify content according to pre-determined topic list. However, tweets do more than communicate policy positions. Our unsupervised model uncovers both political topics and other Twitter uses (e.g., constituent service). These models are effective, inexpensive computational tools for political communication and social media research. We demonstrate their utility and discuss the different analyses they afford by applying both models to the tweets posted by members of the 115th U.S. Congress.

Understanding the Political Ideology of Legislators from Social Media Images

Nan Xi, Di Ma, Marcus Liou, Zachary Steinert-Threlkeld, Lefteris Anastasopoulos, Jungseock Joo

In this paper, we seek to understand how politicians use images to express ideological rhetoric through Facebook images posted by members of the U.S. House and Senate. In the era of social media, politics has become saturated with imagery, a potent and emotionally salient form of political rhetoric which has been used by politicians and political organizations to influence public sentiment and voting behavior for well over a century. To date, however, little is known about how images are used as form of political rhetoric. Using deep learning techniques to automatically predict Republican or Democratic party affiliation solely from the Facebook photographs of the members of the 114th U.S. Congress, we demonstrate that predicted class probabilities from our model function as an accurate proxy of the political ideology of images along a left--right (liberal--conservative) dimension. After controlling for the gender and race of politicians, our method achieves an accuracy of 59.28% from single photographs and 82.35% when aggregating scores from multiple photographs (up to 150) of the same person. To better understand image content distinguishing liberal from conservative images, we also performed in-depth content analyses on the photographs. Our findings suggest that conservatives tend to use more images supporting for status quo political institutions and hierarchy maintenance, featuring individuals from dominant social groups, and displaying greater happiness than liberals.

Understanding visual memes: an empirical analysis of text superimposed on memes shared on Twitter

Yuhao Du, Muhammad Aamir Masood, Kenneth Joseph

Visual memes have become an important mechanism through which ideologically potent and hateful content spreads on today's social media platforms. At the same time, they are also a mechanism through which we convey much more mundane things, like pictures of cats with strange accents. Little is known, however, about the relative percentage of visual memes shared by real people that fall into these, or other, thematic categories. The present work focuses on visual memes that contain superimposed text. We carry out the first large-scale study on the themes contained in the text of these memes, which we refer to as \emph{image-with-text} memes. We find that 30\% of the image-with-text memes in our sample which have identifiable themes are politically relevant and that these politically relevant memes are shared more often by Democrats than Republicans. We also find disparities in who expresses themselves via image-with-text memes, and images in general, versus other forms of expression on Twitter. The fact that some individuals use images with text to express themselves, instead of sending a plain text tweet, suggests potential consequences for the representativeness of analyses that ignore text contained in images.

Unsupervised User Stance Detection on Twitter

Kareem Darwish, Michaël Aupetit, Peter Stefanov, Preslav Ivanov Nakov

We present a highly effective unsupervised framework for detecting the stance of prolific Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our framework has three major advantages over pre-existing methods, which are based on supervised or semi-supervised classification. First, we do not require any prior labeling of users: instead, we create clusters, which are much easier to label manually afterwards, e.g., in a matter of seconds or minutes instead of hours. Second, there is no need for domain- or topic-level knowledge either to specify the relevant stances (labels) or to conduct the actual labeling. Third, our framework is robust in the face of data skewness, e.g., when some users or some stances have greater representation in the data. We experiment with different combinations of user similarity features, dataset sizes, dimensionality reduction methods, and clustering algorithms to ascertain the most effective and most computationally efficient combinations across three different datasets (in English and Turkish). We further verified our results on additional tweet sets covering six different controversial topics. Our best combination in terms of effectiveness and efficiency uses retweeted accounts as features, UMAP for dimensionality reduction, and Mean Shift for clustering, and yields a small number of high-quality user clusters, typically just 2--3, with more than 98% purity. The resultant user clusters can be used to train downstream classification. Moreover, our framework is robust to variations in the hyper-parameter values and also with respect to random initialization.

Variation across Scales: Measurement Fidelity under Twitter Data Sampling

Siqi Wu, Marian-Andrei Rizoiu, Lexing Xie

A comprehensive understanding of data bias is the cornerstone of mitigating biases in social media research. This paper presents in-depth measurements of the effects of Twitter data sampling across different timescales and different subjects (entities, networks, and cascades). By constructing two complete tweet streams, we show that Twitter rate limit message is an accurate measure for the volume of missing tweets. Despite sampling rates having clear temporal variations, we find that the Bernoulli process with a uniform rate well approximates Twitter data sampling, and it allows to estimate the ground-truth entity frequency and ranking with the observed sample data. In terms of network analysis, we observe significant structure changes in both the user-hashtag bipartite graph and the retweet network. Finally, we measure the retweet cascades. We identify risks for information diffusion models that rely on tweet inter-arrival times and user influence. This work calls attention to the social data bias caused by data collection, and proposes methods to measure the systematic biases introduced by sampling.

What Makes People Feel Close to Online Groups?

Robert E Kraut, John M Levine, Marisol Martinez Escobar, Amaç Herdağdelen

Most research assumes that the determinants of members’ feelings of connection to groups are constant across types of groups. The current paper challenges this assumption by assessing members' feelings of affinity toward a large, di-verse sample of online groups. 10,567 members of 6,458 Facebook groups reported on their feelings of connection to these groups. Objectively measured group characteristics and features of members' relationship to the groups explained over 16% of the variance in members’ affinity. Be-ing an administrator and being in groups with fewer mem-bers, more even communication, and more close friends were the strongest predictors. Half of the independent varia-bles significantly interacted with group type in predicting affinity (e.g., large group size was negatively associated with affinity in task groups and positively associated with affinity in topical groups).

When Does Trust in Online Social Groups Grow?

Shankar Iyer, Justin Cheng, Nick Brown, Xiuhua Wang

The trust that people feel in their social groups is linked to important social outcomes such as member satisfaction and collective task performance. To understand the behaviors and conditions linked to trust, past studies of trust in groups have typically relied on cross-sectional surveys, but these are limited in their ability to identify causation. To better test the potential causal pathways between trust and behaviors or group properties, we paired a two-wave longitudinal survey of 2358 participants in Facebook Groups with logged activity on Facebook. We find evidence for a positive feedback loop related to active engagement and trust: people who contribute written content to a group tend to trust the group more over time, and people who trust a group tend to write more over time. In other cases, we found unidirectional relationships: people tend to trust a group more over time if the group is well-connected and active overall, but tend to trust a group less over time if they are also actively involved in multiple other groups. However, greater trust is not associated with changes in overall group activity and is only weakly associated with changes in connectedness. And while groups that are more trusted tend to add more administrators and moderators over time, adding more administrators and moderators does not tend to increase trust over time. Overall, our findings suggest that trust is best promoted by increasing individual- level active engagement in groups with certain group-level properties (e.g., high friendship density or overall activity).

When Your Friends Become Sellers: An Empirical Study of Social Commerce Site Beidian

Hancheng Cao, Zhilong Chen, Fengli Xu, Tao Wang, Yujian Xu, Lianglun Zhang, Yong Li

Past few years have witnessed the emergence and phenomenal success of intimacy-based social commerce. Embedded in social networking sites, these E-Commerce platforms transform ordinary people into sellers, where they advertise and sell products to their friends and family in online social networks. These sites can acquire millions of users within a short time, and are growing fast at an accelerated rate. However, little is known about how these social commerce develop as a blend of intimacy and economic transactions. In this paper we present the first measurement study on the full-scale data of Beidian, one of the fastest growing WeChat based social commerce sites in China, which involves 11.8 million users. We first analyzed the topological structure of the Beidian platform and highlighted its decentralized nature. We then studied the site's rapid growth and its growth mechanism via invitation cascade. Finally, we investigate purchasing behavior on Beidian, where we focus on user proximity and loyalty, which contributed to the site's high conversion rate. As the consequences of interactions between intimacy and economic logics, emerging social commerce demonstrates significant property deviations from all known social networks and E-Commerce. To the best of our knowledge, this work is the first quantitative study on the network characteristics and dynamics of emerging intimacy based social commerce platforms.

"And We Will Fight For Our Race!'' A Measurement Study of Genetic Testing Conversations on Reddit and 4chan

Alexandros Mittos, Savvas Zannettou, Jeremy Blackburn, Emiliano De Cristofaro

Recent progress in genomics has enabled an emerging market for ``direct-to-consumer'' genetic testing. Nowadays, companies like 23andMe and AncestryDNA provide affordable health, genealogy, and ancestry reports, and have already tested tens of millions of customers. At the same time, far-right groups have also reportedly taken an interest in genetic testing, using them to attack minorities and prove their genetic ``purity.'' In this paper, we present a quantitative measurement study shedding light on how genetic testing is being discussed on Web communities in Reddit and 4chan. We collect 1.3M comments from both platforms, posted over 27 months, using a set of 280 keywords related to genetic testing. We then use NLP and computer vision tools to identify trends, themes, and topics of discussion.Our analysis shows that genetic testing attracts a lot of attention on Reddit and 4chan, with discussions often including highly toxic language expressed through hateful, racist, and misogynistic comments. In particular, on 4chan's politically incorrect board (/pol/), content from genetic testing conversations involves several alt-right personalities and openly antisemitic rhetoric, often conveyed through memes. Finally, we find that discussions build around user groups, from technology enthusiasts to communities promoting fringe political views.

“Musicalization of the Culture”: Is Music Becoming Louder, More Repetitive, Monotonous and Simpler?

Yukun Yang

“Musicalization of the Culture” is the social science concept proposed by American philosopher George Stainer. He depicted the glooming future of music—it would be-come omnipresent while having increasing volume, repetitiveness, and monotony, which are ascribed to the de-base of literal aesthetics. Although research that relates to one or some of the predictions exists, neither of them en-compassing all these “musicalization” manifestations, nor do they study the trend over time. Therefore, this preliminary research tries to validate whether music has gained acoustic loudness, and lyrical repetitiveness, monotony, and simplicity in a computational fashion. Conducting time-series analysis with trend detection, we confirmed all these trends for music from 1970 to 2016 using the MetroLyrics dataset and Spotify API. To investigate the simultaneity of these trends, we further conducted synchrony analysis and found that there is little evidence they would influence each other in a lagged fashion. In addition, we briefly discussed the result by relating to the music industry change. Our research makes the first attempt to answer this music sociological preposition. On top of that, we also proposed novel metrics to quantify repetitiveness using closed frequent itemset mining, which could be luminous for future research.

#MeTooMA: Multi-Aspect Annotations of Tweets Related to the MeToo movement

Akash Gautam, Puneet Mathur, Rakesh Gosangi, Debanjan Mahata, Ramit Sawhney, Rajiv Ratn Shah

In this paper, we present a dataset that contains 9,973 tweets related to the MeToo movement that were manually annotated for five different linguistic aspects: relevance, stance, hate speech, sarcasm, and dialogue acts. We present a detailed account of the data collection and annotation processes. The annotations have a very high inter-annotator agreement (0.79 to 0.93 k-alpha) due to well laid out annotator guidelines and data extraction procedure. We analyze the data in terms of geographical distribution, label correlations, and keywords. Lastly, we present some potential use cases of this dataset. We expect this dataset would be of great interest to psycholinguists, socio-linguists and computational linguists in general to study the discursive space of digitally mobilized social movements on sensitive issues like sexual harassment. The dataset can be found at \url{ https://doi.org/10.7910/DVN/JN4EYU}.

Posters

A Framework for Political Portmanteau Decomposition

Nabil Hossain, Minh Tran, Henry Kautz

Portmanteaus are new words formed by combining the sounds and meanings of two words. Given their sticky nature, portmanteaus are often used to create political and personal attacks by combining a target entity with derogatory terms, which can then be spread online for promoting hate speech and defamation. In this paper, we present a framework to decompose political portmanteaus used online into their component words. Using our annotated dataset of political portmanteaus, we train a system that decomposes 76.2% of the political portmanteaus into their component words. Furthermore, for 93.4% of the political portmanteaus, our system finds the correct component words in its top 10 results, suggesting that using better ranking methods can lead to stronger results. This work provides a framework for both understanding an intriguing linguistic phenomena and for building hate-speech filters that could catch novel words that would bypass traditional hate speech detection approaches.

Aligning Public Feedback To Requests For Comments On Regulations.gov

Manya Wadhwa, Silvio Amir, Mark Dredze

In an effort to democratize the regulatory process, the United States Federal government created regulations.gov, a portal through which federal agencies can share proposed regulations and solicit feedback from the public. A proposed regulation will contain several requests for feedback on specific topics, and the public can then submit comments in response. While this reduces barriers to soliciting feedback, it still leaves regulators with a challenge: how to produce a summary and incorporate feedback from the sometimes tens of thousands of submitted comments. We propose an information retrieval system by which comments are aligned to specific regulatory requests. We evaluate several measures of semantic similarity for matching comments to information requests. We evaluate our proposed system over several

Analysing the Extent of Misinformation in Cancer Related Tweets

Rakesh Bal, Sayan Sinha, Swastika Dutta, Rishabh Joshi, Sayan Ghosh, Ritam Dutt

Twitter has become one of the most sought after places to discuss a wide variety of topics, including medically relevant issues such as cancer. This helps spread awareness regarding the various causes, cures and prevention methods of cancer. However, no proper analysis has been performed, which discusses the validity of such claims. In this work, we aim to tackle the misinformation spread in such platforms. We collect and present a dataset regarding tweets which talk specifically about cancer and propose an attention-based deep learning model for automated detection of misinformation along with its spread. We then do a comparative analysis of the linguistic variation in the text corresponding to misinformation and truth. This analysis helps us gather relevant insights on various social aspects related to misinformed tweets.

Can Badges Foster a More Welcoming Culture on Q&A Boards?

Tiago Santos, Keith Burghardt, Kristina Lerman, Denis Helic

Thriving online communities rely on a steady stream of newcomers to contribute new content. However, retaining newcomers has proven challenging. In this paper, we measure the success of an intervention used by Stack Exchange question-answering communities to create a more welcoming environment for newcomers. That intervention consisted in highlighting contributions by new users with a special indicator. We hypothesize that Stack Exchange's new policy would reduce negative reactions to new users and, ultimately, increase new user retention. We leverage causal modeling to assess the introduction of the so-called "new contributor indicator", and we find it did not counter user retention decline in the short- and long-terms. However, our results indicate it did reduce unwelcoming reactions towards newcomers in the short-term. Our work has practical implications for online community managers aiming to improve their onboarding processes.

Characterizing variation in toxic language by social context

Bahar Radfar, Karthik Shivaram, Aron Culotta

How two people speak to one another depends heavily on the nature of their relationship. For example, the same phrase said to a friend in jest may be offensive to a stranger. In this paper, we apply this simple observation to study toxic comments in online social networks. We curate a collection of 6.7K tweets containing potentially toxic terms from users with different relationship types, as determined by the nature of their follower-friend connection. We find that such tweets between users with no connection are nearly three times as likely to be toxic as those between users who are mutual friends, and that taking into account this relationship type improves toxicity detection methods by about 5% on average. Furthermore, we provide a descriptive analysis of how toxic language varies by relationship type, finding for example that mildly offensive terms are used to express hostility more commonly between users with no social connection than users who are mutual friends.

Empirical Evaluation of Three Common Assumptions in Building Political Media Bias Datasets

Soumen Ganguly, Juhi Kulshrestha, Haewoon Kwak, Jisun An

In this work, we empirically validate three common assumptions in building political media bias datasets, which are (i) labelers' political leanings do not affect labeling tasks; (ii) news articles follow their source outlet's political leaning; and (iii) political leaning of a news outlet is stable across different topics. We build a ground-truth dataset of manually annotated article-level political leaning and validate the three assumptions. Our findings warn that the three assumptions could be invalid even for a small dataset. We hope that our work calls attention to the (in)validity of common assumptions in building political media bias datasets.

External Information Sharing on Health Forums: An Exploration

Dana Nguyen, Alexandra Olteanu, Emre Kiciman

Online health forums are an important avenue for receiving social support and learning about fellow patients' experiences with similar diagnoses. We seek to understand what kinds of external information (i.e., web links) are shared on online health forums as a proxy to participants' information needs. For this purpose, we collect a dataset of web links shared publicly on a lung cancer forum over a period of 16 years and perform a comparative analysis with three different website typologies, uncovering a diverse ecosystem of websites. To understand changes in the role health forums play for patients, our study also investigates typological variations as this forum gains and then loses popularity over time.

On the Splitting Dynamics of Meetup Social Groups

Ayan Kumar Bhowmick, Soumajit Pramanik, Sayan Pathak, Bivas Mitra

Groups in online social networks witness continuous evolution by loss of existing members and gain of new members. In this paper, we present a study of group split in Meetup, where a major fraction of members leave the existing group together and join a newly formed group. We identify pivotal group members, called 'splitters', playing key roles in the group split by influencing the existing members to leave the group. We provide an in-depth analysis of the empirical data to reveal key motivating factors leading to a group split and its subsequent effects. Finally, we develop a prediction model for early detection of 'splitters', as well as the group members likely to be influenced by the 'splitter' to leave the group.

Realtime predictive patrolling and routing with mobility and emergency calls data

Shakila Khan Rumi, Flora D Salim, Wei Shao, Ke Deng

A well-planned patrol route plays a crucial role in increasing public security. Most of the existing studies designed the patrol route in a static manner. Situations when rerouting of patrol path are required due to the emergencies, e.g., an accident or ongoing homicide, are not considered. In this paper, we formulate the crime patrol routing problem jointly with dynamic crime event prediction, utilising crowdsourced check-in and real-time emergency call data. The extensive experiment on real-world datasets verifies the effectiveness of the proposed dynamic crime patrol route using different evaluation metrics.

The Effects of an Informational Intervention on Attention to Anti-Vaccination Content on YouTube

Sangyeon Kim, Omer Yalcin, Samuel Bestvater, Kevin Munger, Burt Monroe, Bruce Desmarais

The spread of misinformation related to health, especially vaccination, is a potential contributor to myriad public health problems. This misinformation is frequently spread through social media. Recently, social media companies have intervened in the dissemination of misinformation regarding vaccinations. In the current study we focus on YouTube. Recognizing the extent of the problem, YouTube implemented an informational modification that affected many videos related to vaccination beginning in February 2019. We collect original data and analyze the effects of this intervention on video viewership. We find that this informational intervention reduced traffic to the affected videos, both overall, and in comparison to a carefully-matched set of control videos that did not receive the informational modification.

The Relative Value of Facebook Advertising Data for Poverty Mapping

Masoomali Fatehkia, Benjamin L Coles, Ferda Ofli, Ingmar Weber

Having reliable and up-to-date poverty data is a prerequisite for monitoring the United Nations Sustainable Development Goals (SDGs) and for planning effective poverty reduction interventions. Unfortunately, traditional data sources are often outdated or lacking appropriate disaggregation. As a remedy, satellite imagery has recently become prominent in obtaining geographically-fine-grained and up-to-date poverty estimates. Satellite data can pick up signals of economic activity by detecting light at night, it can pick up development status by detecting infrastructure such as roads, and it can pick up signals for individual household wealth by detecting different building footprints and roof types. It can, however, not look inside the households and pick up signals from individuals. On the other hand, alternative data sources such as audience estimates from Facebook's advertising platform provide insights into the devices and internet connection types used by individuals in different locations. Previous work has shown the value of such anonymous, publicly-accessible advertising data from Facebook for studying migration, gender gaps, crime rates, and health, among others. In this work, we evaluate the added value of using Facebook data over satellite data for mapping socioeconomic development in two low and middle income countries -- the Philippines and India. We show that Facebook features perform roughly similar to satellite data in the Philippines with value added for urban locations. In India, however, where Facebook penetration is lower, satellite data perform better.

Towards Using Word Embedding Vector Space for Better Cohort Analysis

Mohamed Bahgat, Steven Wilson, Walid Magdy

Social media platforms can provide a place for users to express their opinions, interact with others and reflect on their personal experiences. On websites like Reddit, users join communities where they discuss specific topics which cluster them into possible groups of cohorts. These cohorts provide opportunity to analyse individuals with specific tendencies. The authors within these cohorts have the opportunity to post more openly under the blanket of anonymity, and such openness provides a more accurate signal on the real issues individuals are facing. Some communities within Reddit contain discussions about mental health struggles such as depression and suicidal ideation. To better understand and analyse these individuals, we propose to exploit properties of word embeddings that group related concepts close to each other in the embeddings space. For the posts from each topically situated sub-community, we build a word embedding model and use handcrafted lexicons to identify emotions, values and psycholinguistically relevant concepts. We then extract insights into the way that users perceive these concepts by measuring the distance between them and references made by users either to themselves, others or other things around them. We put our tool to the test and see if we can extract meaningful signals.

Datasets

A Benchmark Dataset of Check-worthy Factual Claims

Fatma Arslan, Naeemul Hassan, Chengkai Li, Mark Tremayne

In this paper we present the ClaimBuster dataset of 23,533 statements extracted from all U.S. general election presidential debates and annotated by human coders. The ClaimBuster dataset can be leveraged in building computational methods to identify claims worth fact-checking from the myriad of sources of digital or traditional media. The ClaimBuster dataset is publicly available to the research community, and it can be found at http://doi.org/10.5281/zenodo.3609356.

A Dataset of Fact-Checked Stories Shared in WhatsApp during the Brazilian and India Elections

Julio C. S. Reis, Philipe De Freitas Melo, Kiran Garimella, Jussara M. Almeida, Dean Eckles, Fabricio Benevenuto

Recently, messaging applications, such as WhatsApp, have been reportedly abused by misinformation campaigns, especially in Brazil and India. A notable form of abuse in WhatsApp relies on several manipulated images and memes containing all kinds of fake stories. In this work, we performed an extensive data collection from a large set of WhatsApp public groups and the websites of fact-checking agencies. This paper opens a novel dataset to the research community containing fact-checked fake images shared through WhatsApp for two distinct scenarios known for the spread of fake news on WhatsApp: the 2018 Brazilian elections and the 2019 Indian elections.

Generally Curious: Thematically Distinct Datasets of General Threads on 4chan/pol/ Forum

Emilija Jokubauskaitė, Stijn Peeters

Over the second half of the 2010s, the /pol/ (‘politically incorrect’) forum on the 4chan image board has emerged as a space within which various extreme political ideologies are discussed and cultivated, occasionally informing off-site acts of political extremism. While previous research has often studied this space as a unified whole, it is relevant to more specifically demarcate different publics within 4chan’s /pol/ board, apart from studying it as an ‘amorphous blob’. This paper focuses specifically on ‘generals’ — recurring threads with a specific thematic focus identified by a particular vernacular phrase or tag. By identifying them it is possible to partition the board’s archive into multiple distinct datasets comprising discussions about a particular topic, such as Donald Trump, the Syria war, or British politics. We provide a dataset containing 58,841 opening posts and 13,697,738 replies to those, divided over 329 thematically distinct general thread collections. In this paper we outline our data collection and query protocol, the structure of the data and its rationale, as well as a number of suggested research uses for this new data.

Ginger Cannot Cure Cancer: Battling Fake Health News with a Comprehensive Data Repository

Enyan Dai, Yiwei Sun, Suhang Wang

Nowadays, Internet is a primary source of attaining health information. Massive fake health news which is spreading over the Internet, has become a severe threat to public health. Numerous studies and research works have been done in fake news detection domain, however, few of them are designed to cope with the challenges in health news. For instance, the development of explainable is required for fake health news detection. To mitigate these problems, we construct a comprehensive repository, FakeHealth, which includes news contents with rich features, news reviews with detailed explanations, social engagements and a user-user social network. Moreover, exploratory analyses are conducted to understand the characteristics of the datasets, analyze useful patterns and validate the quality of the datasets for health fake news detection. We also discuss the novel and potential future research directions for the health fake news detection.

Mining Archive.org’s Twitter Stream Grab for Pharmacovigilance Research Gold

Ramya Tekumalla, Juan M Banda

In the last few years, Twitter has become an important re-source for the identification of Adverse Drug Reactions (ADRs), monitoring flu trends, and other pharmacovigi-lance and general research applications. Most researchers spend their time crawling Twitter, buying expensive pre-mined datasets, or tediously and slowly building datasets using the limited Twitter API. However, there are a large number of datasets that are publicly available to research-ers that are underutilized or unused. In this work, we demonstrate how we mined over 9.4 billion Tweets from archive.org’s Twitter stream grab using a drug-term dic-tionary and plenty of computing power. Knowing that not everything that shines is gold, we used pre-existing drug-related datasets to build machine learning models to filter our findings for relevance. In this work, we present our methodology and the 3,346,758 identified tweets for pub-lic use in future research.

P4KxSpotify: A Dataset of Pitchfork Music Reviews and Spotify Musical Features

Anthony T Pinter, Jacob M Paul, Jessie Smith, Jed R. Brubaker

While algorithmically driven curation and recommendation systems like Spotify have become more ubiquitous for surfacing content that people might want hear, expert reviews continue to have a measurable impact what people choose to listen to and subsequently the commercial success and cultural staying power of those artists. One such site, Pitchfork, is particularly known in the music community for its ability to catapult artists to stardom based on the reviews that an album receives. In this paper, we present a dataset of Pitchfork album reviews with the corresponding Spotify audio features for those albums. We describe our data collection and dataset creation process. We present basic information and descriptive statistics about the dataset. Finally, we offer several possible avenues for research that might utilize this new dataset.

Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board

Antonis Papasavva, Savvas Zannettou, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn

This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016-November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a few set of additional labels, including toxicity scores and the named entities mentioned in each post. We also present a statistical analysis of the dataset, providing an overview of what researchers interested in using it can expect, as well as a simple content analysis, shedding light on the most prominent discussion topics, the most popular entities mentioned, and the level of toxicity in each post. Overall, we are confident that our work will further motivate and assist researchers in studying and understanding 4chan as well as its role on the greater Web. For instance, we hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing. Finally, our dataset can assist qualitative work focusing on in-depth case studies of specific narratives, events, or social theories.

The Long-Running Debate about Brexit on Social Media

Emre Calisir, Marco Brambilla

Online social media platforms have become a major place where people also discuss their opinions and express their feelings about socio-political phenomena such as elections and referendums. Human-generated online content is a fruitful resource for a deeper understanding of these happenings. In this study, we present a dataset comprising 45 months (from January 2016 until September 2019) of long-running discussions on Twitter about the Brexit referendum, which can be used by social scientists and journalists for understanding the evolution of the public debate about the phenomenon. This dataset comprises 50.8 million tweets and 3.97 million users, and is also enriched with additional meta-data attributes: bot score of users, sentiment information detected by our sentiment analyzer, political stance information predicted by our stance classifier. Considering all Brexit related tweets of users during our time period, we also determine their overall stance and sentiment.

The Media Coverage of the 2020 US Presidential Election Candidates through the Lens of Google's Top Stories

Anna Kawakami, Khonzodakhon Umarova, Eni Mustafaraj

Choosing the nominee of a political party who will appear on the ballot for the US presidency is a long process that starts two years before the general election. The news media plays a particular role in this process by continuously covering the state of the race. How can this news coverage be characterized? Given that there are thousands of news organizations, but each of us is exposed to only a few of them, we might be missing most of it. Online news aggregators, which aggregate news stories from a multitude of news sources and perspectives could provide an important lens for the analysis. One such aggregator is Google's Top Stories, a recent addition to Google's search result page. For the entire duration of 2019, we have collected the news headlines that Google Top Stories has displayed for 30 candidates of both UP political parties. Our dataset contains 79,903 news story URLs published by 2,168 unique news sources. Our analysis indicates that despite this large number of news sources, there is a very skewed distribution of where the Top stories are originating, with a very small number of sources contributing the majority of stories. We are sharing our dataset so that other researchers can answer questions related to algorithmic curation of news as well as media agenda setting in the context of political elections.

The Pushshift Reddit Dataset

Jason Baumgartner, Savvas Zannettou, Brian C Keegan, Megan Squire, Jeremy Blackburn

Social media data has become crucial to the advancement of scientific understanding. However, even though it has become ubiquitous, just collecting large-scale social media data involves a high degree of engineering skill set and computational resources. In fact, research is often times gated by data engineering problems that must be overcome before analysis can proceed. This has resulted recognition of datasets as meaningful research contributions in and of themselves. Reddit, the so called “front page of the Internet,” in particular has been the subject of numerous scientific studies. Although Reddit is relatively open to data acquisition compared to social media platforms like Facebook and Twitter, the technical barriers to acquisition still remain. Thus, Reddit’s millions of subreddits, hundreds of millions of users, and hundreds of billions of comments are at the same time relatively accessible, but time consuming to collect and analyze systematically. In this paper, we present the Pushshift Reddit dataset. Pushshift is a social media data collection, analysis, and archiving platform that since 2015 has collected Reddit data and made it available to researchers. Pushshift’s Reddit dataset is updated in real-time, and includes historical data back to Reddit’s inception. In addition to monthly dumps, Pushshift provides computational tools to aid in searching, aggregating, and performing exploratory analysis on the entirety of the dataset. The Pushshift Reddit dataset makes it possible for social media researchers to reduce time spent in the data collection, cleaning, and storage phases of their projects.

The Pushshift Telegram Dataset

Jason Baumgartner, Savvas Zannettou, Megan Squire, Jeremy Blackburn

Messaging platforms, especially those with a mobile focus, have become increasingly ubiquitous in society. These mobile messaging platforms can have deceivingly large user bases, and in addition to being a way for people to stay in touch, are often used to organize social movements, as well as a place for extremists and other ne’er-do-well to congregate. In this paper, we present a dataset from one such mobile messaging platform: Telegram. Our dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users. To the best of our knowledge, our dataset comprises the largest and most complete of its kind. In addition to the raw data, we also provide the source code used to collect it, allowing researchers to run their own data collection instance. We believe the Pushshift Telegram dataset can help researchers from a variety of disciplines interested in studying online social movements, protests, political extremism, and disinformation.

WikiHist.html: English Wikipedia’s Full Revision History in HTML Format

Blagoj Mitrevski, Tiziano Piccardi, Robert West

Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is publicly available exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions. We solve these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and release WikiHist.html, English Wikipedia’s full revision history in HTML format. We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia’s hyperlinks, showing that over half of the wiki links present in HTML are missing from raw wikitext, and that the missing links are important for user navigation. Data and code are publicly available at https://doi.org/10.5281/zenodo.3605388.

Demos

BotSlayer: DIY Real-Time Influence Campaign Detection

Pik-Mai Hui, Kaicheng Yang, Christopher Torres-Lugo, Marc McCarty, Benjamin Serrette, Valentin Pentchev, Filippo Menczer

BotSlayer is an application that helps track and detect potential manipulation of information spreading on Twitter. It can be used by journalists, corporations, political candidates, and civil society organizations to discover online coordinated campaigns in real time. BotSlayer uses an anomaly detection algorithm to flag hashtags, links, accounts, phrases, and media that are trending and amplified in a coordinated fashion by likely bots. A Web dashboard lets users explore the tweets and accounts associated with suspicious campaigns, visualize their spread, and search related content on multiple search engines and social media platforms. BotSlayer is easily installed and configured in the cloud. It will aid in the study and early detection of social media manipulation phenomena.

GeoSiteSearch: A Tool to Map Vietnamese Diaspora by Deducing Geographical Information of Webpages about Our Lady of LaVang

Madison G. Masten, Thien-Huong Ninh, Nicholas Tran

We construct a web tool to extract geographical locations from web pages returned by the Google search engine for an arbitrary query and display those locations on an interactive map. The tool was used to track the worldwide Vietnamese diaspora using Our Lady of LaVang as proxy for presence of a Vietnamese community, but it could potentially have other applications.

The Political Dashboard A Tool for Online Political Transparency

Juan Carlos Medina Serrano, Orestis Papakyriakopoulos, Morteza Shahrezaye, Simon Hegelich

Contemporary political communication is a multi- and cross-platform process. Because of its complexity, new tools are necessary to monitor and understand it. We present a system that ingests, stores and processes political data from Twitter, Facebook and online news articles. We visualize the data in the form of a freely accessible online dashboard. The political dashboard (https://political-dashboard.com/) aims to provide online political transparency and assist researchers, journalists and the general public to understand the German online political landscape.

Tutorials, Workshops, and Data Challenge

Tutorials Schedule

June 08, 2020

T1: Quali-Quantitative Research With 4CAT: Capturing and Analysis Toolkit

Sal Hagen, Emilija Jokubauskaitė, Stijn Peeters

As part of the so-called “computational turn” in the social sciences and humanities, tools that offer computational, data-driven, and otherwise quantitative methods seem to be increasingly entangled with the research practices of traditionally qualitative fields. This tutorial engages with broader challenges of quali-quantitative social research by explaining 4CAT, our modular tool that allows to fetch and analyse data from multiple Web sources. At the moment, these include Reddit, 4chan, 8chan, Tumblr, Instagram, Telegram, as well as self-made datasets. 4CAT allows downloading and qualitatively exploring textual posts, as well as quantitatively processing these with a number of analysis modules. Some analysis modules calculate simple yet insightful metrics, while others execute more advanced computational methods, like those from Natural Language Processing. This multi-purpose setup thus facilitates combining approaches from an array of academic fields. In the tutorial, we show how 4CAT can aid in combining data science and ethnography in a fruitful and methodologically solid manner. Next to this, we explain how how participants can customise 4CAT for their specific research interest by installing it on their laptop to develop their own modules.

Sal Hagen is a PhD candidate at the University of Amsterdam as part of OILab and affiliated with the Digital Methods Initative. His research focuses on the political collectivisation of Internet subcultures. Methodologically, he aims to combine media theory with computational methods.

Stijn Peeters is a post-doctoral researcher at the Department of Media Studies, University of Amsterdam, and a member of OILab. His current research interests focus on the development of sound research tools and methods for analysis of fringe and historical online platforms.

Emilija Jokubauskaitė is a lecturer in New Media & Digital Culture at the University of Amsterdam and co-founder of Open Intelligence Lab. Her main research interests consist of online political subcultures, the lesser-known online spaces, and platforms as well as scrutiny of research tools and techniques.

URL: www.oilab.eu/4cat-icwsm.html

T2: Detection and Characterization of Stance on Social Media

Abeer AlDayel, Kareem Darwish, Walid Magdy

Stance detection involves the identification of the positions of a piece of text or a user towards a target such as a topic, entity, or claim. A growing body of research in the ICWSM and Social Computing community on performing and using stance detection shows its importance for a variety of applications including properly analyzing the attitudes of online users. This tutorial aims to teach participants how to perform and use stance detection. Specifically, we provide a general introduction to the concept of stance and how it differs from sentiment analysis; present recent methodologies for stance detection on social media including supervised, semi-supervised, and unsupervised methods; and introduce various applications of stance detection on social media including how it can be used to support analytical studies. The tutorial concludes with an exploration of open challenges and future directions for stance detection on social media.

URL: http://smash.inf.ed.ac.uk/tutorials/stance/

~~T3: Introduction to Collecting and Analyzing Satellite Imagery~~

~~Ingmar Weber, Achira Bhattacharyya~~

The ICWSM community studies how online social media and web data can be used to understand different aspects of society and human behavior. While these online data sources are valuable for sensing and quantifying the ‘social fabric’, they are not made for sensing the physical world. This tutorial will provide attendees with another data source with which to complement their analysis: satellite imagery.
Thanks to the proliferation of tools, ranging from Google Earth, to Planet’s imagery, to Landsat’s imagery, to cloud-hosted Jupyter Notebooks, the collection and analysis of satellite imagery is gradually being commodified with a steadily shrinking barrier of entry. This makes it feasible for ‘general purpose computational social scientists’, without prior expertise in image processing, to start exploring such data sources as an additional layer in their analysis. This holds a lot of promise for different analyses with a GIS component, such as looking at air quality, looking at areas of high risk due to climate-change induced flooding, identifying deforestation, looking at the impact of disasters, or looking at the societal changes brought about by electrification.
The first part (90 min) of this tutorial will give a high level overview of data and tools, as well as look at some case studies to show the ‘what is possible’ element. The second part (90 min) will walk through a number of notebooks hosted on Google Colab using the Earth Engine Python API, to show the ‘how to do it’ element.

Ingmar is the Research Director for Social Computing at the Qatar computing Research Institute. His interdisciplinary research looks at how digital data and computational methods can be applied to address research questions from demography, the development sector, or from other domains. He was one of the ICWSM’18 PC chairs and currently serves as one of the ICWSM Editors-in-Chief. Web: https://ingmarweber.de/publications/

Achira is a Research Assistant at University College London, Qatar. With a background in communications and interest in computation, her research focuses on analysis of visual and social media using a mixed methods approach to inform understanding of topics within the domains of social sciences and digital humanities, often concerning issues in gender representation.

URL: https://sites.google.com/view/icwsm2020-satellite-tutorial/home

T4: Investigating Attention and Influence Online with Media Cloud

Rahul Bhargava, Aashka Dave, Orestis Papakyriakopoulos

Today’s media universe encompasses traditional media, digital platforms, social media and myriad tools responsible for story creation and distribution. The complexity of our media ecosystems presents significant challenges for anyone interested in studying information, particularly across platforms and dissemination methods. In this tutorial, we present Media Cloud, an open source research platform that offers easy, unparalleled access to information from the open web. This tutorial will train attendees to use Media Cloud’s suite of tools for their own media research and analysis purposes through a combination of case studies and exercises.

Rahul Bhargava is a researcher and technologist specializing in civic technology and data literacy. He creates interactive websites, playful educational experiences, and award- winning visualizations for museum settings. As a Research Scientist at the MIT Center for Civic Media, Rahul serves as Chief Technology Officer for Media Cloud.

Aashka Dave is a researcher for Media Cloud, based at the MIT Media Lab. Aashka studies media ecosystems and public health and supports the Media Cloud community in their research. She previously worked at the Harvard Kennedy School and The Associated Press, and holds an MS in Comparative Media Studies from MIT.

Orestis Papakyriakopoulos is researcher at the Technical University of Munich and Visiting Scholar at the Center for Civic Media at the MIT Media Lab. Orestis studies new and old media by the application of data-intensive algorithms. He also studies the political impact of the use of data-intensive algorithms in society.

Workshops Schedule

ICWSM-2020 Workshop Proceedings: here

June 08, 2020 Full Day

W1: Emoji Understanding and Applications in Social Media (Emoji 2020)

Sanjaya Wijeratne, Horacio Saggion, Amit Sheth

URL: http://emoji2020.aiisc.ai/

W2: Social Sensing (SocialSens 2020): Special Edition on Narrative Analysis on Social Media

Jiawei Han, Ning Yu, Emilio Ferrara, Tarek Abdelzaher

URL: http://socialsens.web.illinois.edu/

June 08, 2020 Half Day Morning

NEws and publiC Opinion (NECO) 2020

Jisun An, Haewoon Kwak, Fabrício Benevenuto

URL: http://neco.io/

Cyber Social Threats (CySoc 2020).

Ugur Kursuncu, Yelena Mejova, Jeremy Blackburn, Amit Sheth

URL: http://cysoc.aiisc.ai/

June 08, 2020 Half Day Afternoon

Mediate: Social and News Media Misinformation Workshop.

Panayiotis Smeros, Jérémie Rappaz, Marya Bazzi, Karl Aberer

URL: https://mediateworkshop.github.io/

W6: Data Is People: New Challenges and Tensions in Social Media Research.

Jessica Pater, Alicia Nobles, Sarah Gilbert, Michael Zimmer, Casey Fiesler

Data Challenge Schedule

June 08, 2020 Full Day

URL: here

An RNN-based Classifier to detect Misinformation in News Articles

Brendan Cunha and Lydia Manikonda

On Analyzing Annotation Consistency in Online Abusive Behavior Datasets

Md Rabiul Awal, Rui Cao, Roy Ka-Wei Lee and Sandra Mitrovic

Do All Good Actors Look The Same? Exploring News Veracity Detection Across The U.S. and The U.K.

Benjamin Horne, Maurício Gruppi and Sibel Adali

Enhanced Offensive Language Detection Through Data Augmentation

Ruibo Liu, Guangxuan Xu and Soroush Vosoughi

Examining Racial Bias in an Online Abuse Corpus with Structural Topic Modeling

Thomas Davidson and Debasmita Bhattacharya

"To Target or Not to Target": Identification and Analysis of Abusive Text Using Ensemble of Classifiers

Gaurav Verma, Niyati Chhaya, and Vishwa Vinay

Intersectional Bias in Hate Speech and Abusive Language Datasets

Jae Yeon Kim, Carlos Ortiz, Sarah Nam, Sarah Santiago, and Vivek Datta

Implicit Crowdsourcing for Identifying Abusive Behavior in Online Social Networks

Abiola Osho, Ethan Tucker, and George Amariucai

Organizing Committee

ICWSM Editors-in-Chief

Carol Hamilton
AAAI Executive Director

Stephanie Le
AAAI Conference Coordinator

Steering Committee

Program Committees

Senior Program Committee

Palakorn Achananuparp; Eytan Adar; Jussara M. Almeida; Daniel Archambault; Saeideh Bakhshi; Francesco Barbieri; Fabricio Benevenuto; Jeremy Blackburn; Cody Buntain; Ángel Cuevas; Aron Culotta; Nicholas Diakopoulos; Rosta Farzan; Emilio Ferrara; Sabrina Gaito; David Garcia; Kiran Garimella; R. Kelly Garrett; Nir Grinberg; Alex Hanna; Jake Hofman; Bernie Hogan; Kokil Jaidka; Kenneth Joseph; Andreas Kaltenbrunner; Brian Keegan; Emre Kiciman; Haewoon Kwak; Renaud Lambiotte; Dongwon Lee; Kyumin Lee; Kristina Lerman; David Liben-Nowell; Jimmy Lin; Jiebo Luo; Jalal Mahmud; Winter Mason; Tanushree Mitra; Andrés Monroy-Hernández; Fred Morstatter; Mirco Musolesi; Eni Mustafaraj; Jahna Otterbacher; Hemant Purohit; Daniele Quercia; Paul Resnick; Daniel Romero; Nishanth Sastry; Rossano Schifanella; Sarita Schoenebeck; Amit Sheth; Emma Spiro; Markus Strohmaier; Jiliang Tang; Hanghang Tong; Panayiotis Tsaparas; Oren Tsur; Lyle Ungar; Onur Varol; Katrin Weller; Brooke Foucault Welles; Tim Weninger; Shaomei Wu; Xing Xie; Arkaitz Zubiaga

Program Committee and Reviewers

Andres Abeliuk; Mohit Agarwal; Pushkal Agarwal; Wei Ai; Luca Maria Aiello; Abeer Aldayel; Nikos Aletras; Majid Alfifi; Essa Alhazmi; Hamed Alhoori; Kristen Altenburger; Tawfiq Ammari; Jisun An; Panagiotis Andriotis; Pablo Aragón; David Arbour; Ahmer Arif; Mariam Asad; Zahra Ashktorab; Pepa Atanasova; Michaël Aupetit; Adam Badawy; Ramesh Baral; Pablo Barbera; Valerio Basile; Adam Bates; George Berry; Shreyansh Bhatt; Md Momen Bhuiyan; Su Lin Blodgett; Chiara Boldrini; Ludovico Boratto; Alexandre Bovet; Shannon Bowen; Lia Bozarth; Matteo Bruno; Ceren Budak; Cody Buntain; Keith Burghardt; Jie Cai; Jose Camacho-Collados; Hancheng Cao; Sam Carton; Tristan Caulfield; Abhijnan Chakraborty; George Chalhoub; Stevie Chancellor; Eshwar Chandrasekharan; Jonathan Chang; Despoina Chatzakou; Sneha Chaudhari; Nitesh V Chawla; Charalampos Chelmis; Chen Chen; Haochen Chen; Haohan Chen; Xi Chen; Justin Cheng; Giovanni Luca Ciampaglia; Riki Conrey; Denzil Correa; Michele Coscia; Stefano Cresci; Tiago Cunha; Henry Dambanemuya; Srayan Datta Datta; Thomas Davidson; Gianmarco De Francisci Morales; Jean-Charles Delvenne; Emily Denton; Tyler Derr; Himel Dev; Nitish Devadiga; Sanorita Dey; Paramveer Dhillon; Yifan Ding; Alex Dow; Yuhao Du; Asif Ekbal; Bhushan Ekbote; Mai ElSherief; Ziv Epstein; Sindhu Kiranmai Ernala; Motahhare Eslami; Dhivya Eswaran; Chao Fan; Flavio Figueiredo; Fabian Fl√∂ck; alessandro flammini; Claudia Flores-Saviaga; Paula Fortuna; Adam Fourney; Diana Freed; Simona Frenda; Lisa Friedland; Hao Fu; Liye Fu; Brian Gallagher; Ryan Gallagher; Peng Gao; Wei Gao; Nikhil Garg; Kiran Garimella; Joobin Gharibshah; Saptarshi Ghosh; Salvatore Giorgi; Maria Glenski; Pawan Goyal; Przemyslaw Grabowicz; Erhardt Graeff; Derek Greene; Nir Grinberg; Akshay Gugnani; Francesco Gullo; Sharath Chandra Guntuku; Cheng Guo; Abram Handler; Alex Hanna; Ehsan ul Haq; Libby Hemphill; Danula Hettiachchi; Tuan-Anh Hoang; Christopher Homan; Lingzi Hong; Benjamin Horne; Benjamin D. Horne; Manoel Horta Ribeiro; Emoke Agnes Horvat; Geert-Jan Houben; Hsun-Ping Hsieh; Tianran Hu; Yiqing Hua; Binxuan Huang; Xiaolei Huang; Xin Huang; Amanda Hughes; Muhammad Imran; Shankar Iyer; Kokil Jaidka; Kasthuri Jayarajah; Aaron Jiang; Shan Jiang; Fang Jin; Sagar Joglekar; Isaac Johnson; Kenneth Joseph; Mouna Kacimi; parisa kaghazgaran; Wang-Cheng Kang; Katherine Keith; Os Keyes; Madian Khabsa; Jayden Khakurel; Masahiro Kimura; Katharina Kinder-Kurlanda; Vivek Kothari; Nicolas Kourtellis; P Krafft; Dmitry Kravchenko; Jitin Krishnan; Rachel Krohn; Gokhan Kul; Ramnath Kumar; Srijan Kumar; Angela Lai; Hemank Lamba; Jack LaViolette; Hayden Le; Min Hun Lee; Zachary Levonian; Ang Li; Jundong Li; Defu lian; Q. Vera Liao; Yiming Liao; Tzu-Heng Lin; Marina Litvak Dr; Han Liu; Haochen Liu; Renju Liu; Tong Liu; Tony Liu; Xiaochen Liu; Zhe Liu; Ismini Lourentzou; Cristian Lumezanu; Emma Lurie; Xiao Ma; Yao Ma; Piyush Madan; Sanjay Madria Madria; Walid Magdy; Matteo Magnani; Jamie Mahoney; Suman Kalyan Maity; Danaja Maldeniya; Momin Malik; Radhika Mamidi; Lydia Manikonda; Enrico Mariconti; Alice Marwick; Binny Mathew; Abhinav Maurya; Grant McKenzie; Alexey Medvedev; Ninareh Mehrabi; Florian Meier; Yelena Mejova; Weizhi Meng; Hannah Metzler; Christian Meurisch; Mehrnoosh Mirtaheri; Amita Misra; Prasenjit Mitra; Sina Mohseni; Mainack Mondal; Animesh Mukherjee; Kevin Munger; Fabricio Murai; Matthieu Nadini; Kanika Narang; Vidya Narayanan; Laura Nelson; Leonie Neuhäuser; Dong Nguyen; Joel Nishimura; Behnaz Nojavanasghari; Katherine Ognyanova; Lucas Oliveira; Claudia Orellana-Rodriguez; Deepak Padmanabhan; Orestis Papakyriakopoulos; Vagelis Papalexakis; Kunwoo Park; Minsu Park; Jasabanta Patro; Umashanthi Pavalanathan; Leto Peel; Isidoros Perikos; Shruti Phadke; Ed Platt; Kashyap Popat; Emily Porter; Oliver Posegga; Vinodkumar Prabhakaran; Hemant Purohit; Apostolos Pyrgelis; ashwin rajadesingan; Gilad Ravid; Afsaneh Razi; Raquel Recuero; Kunal Relia; Lindsay Reynolds; Rezvaneh (Shadi) Rezapour; Georgios Rizos; Lionel Robert; Ronald Robertson; Lorenzo Rossi; Luca Rossi; Camille Roth; Jeffrey Rzeszotarski; Koustuv Saha; Mattia Samory; Mary Sanford; Shruti Sannon; Tiago Santos; Anna Sapienza; Jari Saramäki; Saiph Savage; Martin Saveski; Yann Savoye; Sanja Scepanovic; Andrew Schwartz; Jo√£o Sedoc; Joseph Seering; Mridul Seth; Neil Shah; Amit Sharma; Karishma Sharma; Saeedeh Shekarpour; Kai Shu; Satyaki Sikdar; Priyanka Sinha; Tanmay Sinha; Michael Sirivianos; Mauro Sozio; Megan Squire; Padmini Srinivasan; Kate Starbird; Leo Stewart; Arkadiusz Stopczynski; Gianluca Stringhini; Guillermo Suarez-Tangil; Lionel Tabourier; Shuai Tang; Nazgol Tavabi; Mohini Tellakat; Zhanna Terechshenko; Jacob Thebault-Spieker; Daniel Thomas; Thanh Tran; Christoph Trattner; Sho Tsugawa; Gareth Tyson; Catalina Vajiac; Onur Varol; Pedro Vaz-de-Melo; Saurabh Verma; Bertie Vidgen; Marco Viviani; Nguyen Vo; Svitlana Volkova; Soroush Vosoughi; Mengting Wan; Stephen Wan; Suhang Wang; Yi Wang; Yixue Wang; Yong Wang; Yu Wang; Zhiwei Wang; Zijian Wang; Zeerak Waseem; Mark Whiting; Sanjaya Wijeratne; Thilini Wijesiriwardene; Tom Wilson; Stefan Wojcik; Yi-Fang WU WU; Feng Xia; Kevin Xu; Shweta Yadav; Diyi Yang; Jie Yang; Kaicheng Yang; Min Yang; Sean Yang; Yuan Yao; Junting Ye; Zhijun Yin; Jiaxuan You; Arjumand Younus; Reza Zafarani; Savvas Zannettou; Yilei Zeng; Amy Zhang; Chenwei Zhang; Justine Zhang; Rui Zhang; Si Zhang; Yang Zhang; Xiangyu Zhao; Changtao Zhong; Ke Zhou; Matteo Zignani; Zubiaga, Arkaitz

Sponsorship

The 14th annual International Conference on Web and Social Media (AAAI ICWSM) will be held in June 2019 in Atlanta, USA. ICWSM is one of the world's premier conferences and publication venues in computational social science. From election interference to online harassment to increasingly social AI systems, the past years have highlighted the critical importance of computational social science for the success of the world's largest online platforms. The ICWSM conference will feature a rigorous dialogue about urgent and high-stakes computational social science challenges at the intersection of industry and academia.

Over the past 13 years, ICWSM has emerged as a unique forum bringing together researchers working at the nexus of computer science and the social sciences, with work drawing upon network science, machine learning, human-computer interaction, psychology, sociology, political science, economics, statistics, multimedia, and communication.

Your participation in the ICWSM-2020 Sponsor Program will give your company instant visibility and access to this diverse group of researchers, students, and professionals. This year we are expecting even greater participation and involvement at the conference. This will give your company access to recruiting opportunities and high-level sponsors will be able to set up recruiting and/or demo booths during the conference. ICWSM-2020 is sponsored by the Association for the Advancement of Artificial Intelligence (AAAI).

The details of the various sponsorship levels and what they encompass are shown below:

Benefits

All ICWSM-2020 Sponsors will receive the following benefits:

Company name and/or logo on conference website homepage

Company name and/or logo displayed in conference registration area

Listing on the conference proceedings sponsor page and brochure

Complimentary exhibit tabletop in conference center foyer

Complimentary one-page insert placement in conference bags

Sponsors at Silver or higher levels will also receive:

Complimentary technical conference registration(s)/banquet ticket(s)

Complimentary quarter-page black & white ad in AI Magazine

Listing in AI Magazine masthead for one year

Sponsors at Gold or higher levels will also receive:

Banquet/meal sponsorship

An additional complimentary technical conference registration(s)/banquet ticket(s) (2 total)

Logo on conference bag (as appropriate)

Recruiting/demo booths in conference center foyer

The Sponsor at the Platinum level will exclusively receive exclusively:

An additional complimentary technical conference registration(s)/banquet ticket(s) (3 total)

Sponsor one of our keynote speakers

Sponsor poster session

Sponsorship Levels

Platinum $22,000
Gold $10,000
Silver $5,000
Bronze $2,500

We're also excited about exploring additional means by which we can help conference attendees engage with your company. Ideas we have discussed in the organization committee include a sponsored data competition, sponsored receptions, and sponsored activities at the conference.

If you have any questions, please feel free to contact sponsorships@icwsm.org or 2020@icwsm.org. We look forward to seeing you in Atlanta in June 2020.