SML – Project tasks

Machine learning tasks for SML Projects

Polarity/Sentiment classification tasks
Sentiment classification can change the world! Watch the Arte video “Comment Trump a manipulé l’Amérique” (in French) if you have time. The technical content is rather in the third part (from 29:00) but the confession at 7:00 is also worth hearing from data science students…

If you can’t access the video above, here is an introductory article at on this subject

  • Polarity prediction on movie reviews 
    Task: Predict if the review of a movie is positive or negative

      • Polarity dataset v2.0 (1000 positive and 1000 negative reviews)
        [Bo Pang and Lillian Lee, ACL 2004]
      • Large Movie Review Dataset (12500 positive and 12500 negative review)
        [Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts,  ACL 2011].
  • Predict the happiness (of customers)
    Task: Predict whether author of a review was happy or not
    Datasets (roughly ordered by size):

    • French amazon dataset available at FLUE gtihub
      Binary classification task. It consists in classifying Amazon reviews for three product categories: books, DVD, and music. Each sample contains a review text and the associated rating from 1 to 5 stars. Reviews rated above 3 is labeled as positive, and those rated less than 3 is labeled as negative.
      The train and test sets are balanced, including around 1k positive and 1k negative reviews for a total of 2k reviews in each dataset.
    • Trip advisor dataset (+ a notebook) available on github
      The labeled data contains 38932 rows (there is also a test data file containing 29404 unlabeled rows, let us know if you find the labels for this file on the web).
    • English amazon dataset available on kaggle
      34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This subset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment.
      Reviews include product and user information, ratings, and a plaintext review. For more information, please refer to the following paper: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.
    • Yelp dataset (need to register to access data)
      Huge dataset!!!
      In the Yelp dataset, a 5* rating is used, but it can be transformed into a “happy” vs “not happy” classification task by considering 1* or 2* rating as “not happy”  and 4* or 5* rating as “happy” labels.
      The task for the project is to predict the polarity of the reviews (on two or five levels), but more information than the reviews are available in this dataset (informations on businesses and users, including user’s friend mapping). The project can focus on using only the reviews, or use also other available informations…
      Some code is available
      to handle/transform Yelp dataset (see also documentation)
  • Multilingual Emoji Prediction
    SemEval-2018 Task 2 (overview, data details)
    Task: Predict the emoji contained in a tweet

    • 500k tweets in English
    • 100K tweets in Spanish.
      The tweets were retrieved with the Twitter APIs, from October 2015 to February 2017, and geolocalized in the United States and Spain. The dataset includes tweets that contain one and only one emoji, of the 20 most frequent emojis. As labels, we will use the 20 most frequent emojis of each language. They are different across the English and Spanish corpora. In the following, we show the distribution of the emojis for each language (numbers refer to the percentage of occurrence of each emoji).
      Note that due to an issue we only consider 19 emojis in the Spanish task (from 0 to 18 where “top” emoji is omitted)


    • Barbieri F., Ballesteros M., Saggion H., Are Emojis Predictable?, European Chapter of the Association for Computational Linguistics Valencia, 3-7 April 2017.
  • Emotion Recognition
    recognize emotion in tweets

    • The emotion dataset comes from the paper by Saravia et al. The authors constructed a set of hashtags to collect a separate dataset of English tweets from the Twitter API belonging to eight basic emotions, including anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

      To get the dataset (remove “!” if you use it in a terminal, not in jupyter notebook):

      This dataset is also available in Hugging Face 🤗 Dataset Hub. To use it, type in a notebook:

      !pip install datasets 

      Then, you can load the dataset in python with:

      from datasets import load_dataset
      emotion_dataset = load_dataset("emotion")   
Symbolic music classification  
Other resources that could be used to define a classification task studied in a project

(subject to prior approval by teachers)

  • Music classification from wav files ! Need signal transformation…
  • Cyberbullying detection an interesting and important task, but we are not sure of the quality of the datasets that we have found…
    New: Offensive Language Identification Dataset added

Comments are closed.