Machine learning tasks for SML Projects
Polarity/Sentiment classification tasks
Sentiment classification can change the world! Watch the Arte video “Comment Trump a manipulé l’Amérique” (in French) if you have time. The technical content is rather in the third part (from 29:00) but the confession at 7:00 is also worth hearing from data science students…
If you can’t access the video above, here is an introductory article at vice.com on this subject |
- Polarity prediction on movie reviews
Task: Predict if the review of a movie is positive or negative
Datasets:-
- Polarity dataset v2.0 (1000 positive and 1000 negative reviews)
[Bo Pang and Lillian Lee, ACL 2004]
- Large Movie Review Dataset (12500 positive and 12500 negative review)
[Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts, ACL 2011].
- Polarity dataset v2.0 (1000 positive and 1000 negative reviews)
-
- Predict the happiness (of customers)
Task: Predict whether author of a review was happy or not
Datasets (roughly ordered by size):- French amazon dataset available at FLUE gtihub
Binary classification task. It consists in classifying Amazon reviews for three product categories: books, DVD, and music. Each sample contains a review text and the associated rating from 1 to 5 stars. Reviews rated above 3 is labeled as positive, and those rated less than 3 is labeled as negative.
The train and test sets are balanced, including around 1k positive and 1k negative reviews for a total of 2k reviews in each dataset. - Trip advisor dataset (+ a notebook) available on github
The labeled data contains 38932 rows (there is also a test data file containing 29404 unlabeled rows, let us know if you find the labels for this file on the web). - English amazon dataset available on kaggle
34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This subset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment.
Reviews include product and user information, ratings, and a plaintext review. For more information, please refer to the following paper: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013. - Yelp dataset (need to register to access data)
Huge dataset!!!
In the Yelp dataset, a 5* rating is used, but it can be transformed into a “happy” vs “not happy” classification task by considering 1* or 2* rating as “not happy” and 4* or 5* rating as “happy” labels.
The task for the project is to predict the polarity of the reviews (on two or five levels), but more information than the reviews are available in this dataset (informations on businesses and users, including user’s friend mapping). The project can focus on using only the reviews, or use also other available informations…
Some code is available to handle/transform Yelp dataset (see also documentation)
- French amazon dataset available at FLUE gtihub
- Multilingual Emoji Prediction
SemEval-2018 Task 2 (overview, data details)
Task: Predict the emoji contained in a tweet
Datasets:- 500k tweets in English
- 100K tweets in Spanish.
The tweets were retrieved with the Twitter APIs, from October 2015 to February 2017, and geolocalized in the United States and Spain. The dataset includes tweets that contain one and only one emoji, of the 20 most frequent emojis. As labels, we will use the 20 most frequent emojis of each language. They are different across the English and Spanish corpora. In the following, we show the distribution of the emojis for each language (numbers refer to the percentage of occurrence of each emoji).
Note that due to an issue we only consider 19 emojis in the Spanish task (from 0 to 18 where “top” emoji is omitted)
References:
- Barbieri F., Ballesteros M., Saggion H., Are Emojis Predictable?, European Chapter of the Association for Computational Linguistics Valencia, 3-7 April 2017.
- Emotion Recognition
Task: recognize emotion in tweets
Dataset:
- The emotion dataset comes from the paper CARER: Contextualized Affect Representations for Emotion Recognition by Saravia et al. The authors constructed a set of hashtags to collect a separate dataset of English tweets from the Twitter API belonging to eight basic emotions, including anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
To get the dataset (remove “!” if you use it in a terminal, not in jupyter notebook):!wget https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt
!wget https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt
!wget https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt
This dataset is also available in Hugging Face 🤗 Dataset Hub. To use it, type in a notebook:
!pip install datasets
Then, you can load the dataset in python with:
from datasets import load_dataset
emotion_dataset = load_dataset("emotion")
- The emotion dataset comes from the paper CARER: Contextualized Affect Representations for Emotion Recognition by Saravia et al. The authors constructed a set of hashtags to collect a separate dataset of English tweets from the Twitter API belonging to eight basic emotions, including anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
Symbolic music classification 
- Classification of music genres based on repetitive basslines
Task: Predict music genre from midi file bassline track.
Dataset: 40 midi bassline tracks per genre (Blues, BossaNova, Forro, Funk, HipHop, MinimalTechno, Motown, Reggae, RockNineties, RockSeventies, SalsaMambo, Swing and Zouglou) build by J. Abeßer from Fraunhofer Institute for Digital Media Technology (IDMT).
Dataset is not public (for copyright reasons) and access to the data is subject to a non-public disclosure agreement. Contact us if you are interested by this dataset.
References:- Classification of music genres based on repetitive basslines, J. Abeßer, H. Lukashevich, P. Bräuer, Paul, Journal of new music research, 2012
- Automatic Transcription of Bass Guitar Tracks applied for Music Genre Classification and Sound Synthesis, PhD thesis of J. Abeßer, 2014
Helpful resources:
- Ircam note on Humdrum formats
- The Humdrum Toolkit: Software for Music Research contains tools to convert midi to **kern format
- Ideas for symbolic music encoding: Musical Style Identification Using Grammatical Inference: The Encoding Problem (see also Two grammatical inference applications in music processing)
Other resources that could be used to define a classification task studied in a project
(subject to prior approval by teachers)
- Music classification from wav files
! Need signal transformation…
- Cyberbullying detection an interesting and important task, but we are not sure of the quality of the datasets that we have found…
New: Offensive Language Identification Dataset added