SML – Cyberbullying

Cyberbullying identification resources

(roughly ordered by date)

  • Kaggle Formspring data for Cyberbullying Detection
    A large labeled Formspring dataset, from a Summer 2010 crawl
    Errors of annotations (mechanical turks) were found by Samurai who reannotated the data for their technical report Cyberbullying Detection by M. Ptaszyński, G. Leliwa, M. Piech, A. Smywiński-Pohl (contact teachers for details)
  • The website of ChatCoder project
    Contains links to labeled datasets, especially the MySpace Group Data Labeled for Cyberbullying cited by Samurai’s TR?
  • Kaggle turkish-cyberbullying
    Contains resources for the paper “Detection of cyberbullying on social media messages in Turkish
    S. A. Özel, E. Saraç, S. Akdemir and H. Aksu, International Conference on Computer Science and Engineering (UBMK), Antalya, 2017
  • Data and code for the study of bullying at University of Wisconsin-Madison
    Dataset:  Bullying Traces Data Set version 3.0: (size 534950, released in June 2015), containing 7321 tweets with tweet ID, bullying, author role, teasing, type, form, and emotion labels. Use subject to agreement.
    Reference: Understanding and Fighting Bullying with Machine Learning Junming Sui. PhD thesis, Department of Computer Sciences, University of Wisconsin-Madison. 2015.
  • Automatic detection of cyberbullying in social media text, by Van Hee C, Jacobs G, Emmery C, Desmet B, Lefever E, et al. PLOS ONE (2018)
    Abstract:  We describe the collection and fine-grained annotation of a cyberbullying corpus for English and Dutch and perform a series of binary classification experiments to determine the feasibility of automatic cyberbullying detection. We make use of linear support vector machines exploiting a rich feature set and investigate which information sources contribute the most for the task. Experiments on a hold-out test set reveal promising results for the detection of cyberbullying-related posts. After optimisation of the hyperparameters, the classifier yields an F1 score of 64% and 61% for English and Dutch respectively, and considerably outperforms baseline systems.
    Data Availability: “Because the actual posts in our corpus could contain names or other identifying information, we cannot share them publicly in a repository. They can, however be obtained upon request, for academic purposes solely and via or”

A new dataset (added on Sep. 25, 2019):

  • The Offensive Language Identification Dataset (OLID)
    contains a collection of 14,200 annotated English tweets using an annotation model that encompasses following three levels:
    • A) Offensive Language Detection: Not Offensive (NOT), Offensive (OFF)
    • B) Categorization of Offensive: Language Targeted Insult (TIN), Untargeted (UNT):
    • C) Offensive Language Target Identification: Individual (IND),  Group (GRP), Other (OTH)

    OLID was the official dataset used in the OffensEval: Identifying and Categorizing Offensive Language in Social Media (SemEval 2019 – Task 6) shared task.

    Download: The complete dataset OLID v1.0 dataset (train, test, and gold labels) is available here.
    More information about OLID can be found in the NAACL 2019 paper.
    If you used OLID, please refer to this paper:
        title={{Predicting the Type and Target of Offensive Posts in Social Media}}, 
        author={Zampieri, Marcos and Malmasi, Shervin and Nakov, Preslav and Rosenthal, Sara and Farra, Noura and Kumar, Ritesh}, 
        booktitle={Proceedings of NAACL}, 


Comments are closed.