Despite many efforts to automatically identify toxic comments online (including sexual harassment, threats, and identity attacks), modern systems fail to generalize to the diverse concerns of Internet users. This dataset consists of 107,620 social media comments annotated by 17,280 unique participants, and was collected to understand how user expectations for what constitutes toxic content differ across demographics, beliefs, and personal experiences. The dataset is encrypted – please contact Deepak Kumar for the password.

Study Details

Designing Toxic Content Classification for a Diversity of Perspectives
USENIX Symposium on Usable Privacy and Security (SOUPS) 2021
Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, Michael Bailey
Deepak Kumar

Dataset Details

107,620 social media comments labeled by five annotators each.

File Download

File NameMetaDataSHA-1 FingerprintSizeUpdated At unavailable unavailable 14.98 MB 2021-06-09