Supervised Machine Learning
Course of master SIF
Instructors:
- François Coste, Dyliss, francois.coste@inria.fr
- Ewa Kijak, Linkmedia, ewa.kijak@irisa.fr
List of students
Tentative schedule
(Look also at ADE UR1 Planning for the actual room, search for M2 INFO-SIF)
1 | 11/09 15h00-16h30 | B02B-E209 | F. Coste |
Introduction
|
2 | 11/09 16h45-18h15 | B02B-E209 | F. Coste | Methodology
slides, 8pp |
3 | 13/09 16h45-18h15 | B02B-E208 | E. Kijak | Decision trees (slides 6pp) |
4 | 18/09 15h00-16h30 | B02B-E208 | E. Kijak | Bayesian Learning (slides 6pp) |
5 | 20/09 16h45-18h15 | B02B-E208 | E. Kijak |
Support Vector Machine (slides 6pp)
|
6 | 25/09 15h00-16h30 | B02B-E208 | E. Kijak | Logistic regression (slides 6pp) |
7 | 25/09 16h45-18h15 | B02B-E208 | E. Kijak | Neural networks (slides 6pp) |
8 | 27/09 16h45-18h15 | B02B-E208 | E. Kijak | Model combination (slides 6pp) |
9 | 02/10 15h00-16h30 | B02B-E208 | F. Coste | Naive Bayes for text classification in practice with scikit-learn Notebook: skeleton |
10 | 04/10 15h00-16h30 | B02B-E209 | F. Coste | Naive Bayes for text classification in practice with scikit-learn (cont’) |
11 |
04/10 16h45-18h15 | B02B-E209 | F. Coste | Natural language preprocessing Notebook: complete (slides of lectures 9-11, 8pp) Introduction to grammatical inference (slides, 8pp) |
12 |
11/10 15h00-16h30 | B02B-E208 | F. Coste | Automata learning
The kTs.tgz file from this directory might contain a C code for k-TSI (Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition by P. García and E. Vidal) |
13 |
11/10 16h45-18h15 | B02B-E208 | F. Coste | Grammar learning (slides of lectures 12-13, 8pp) |
14 | 25/10 15h00-16h30 | B12D-i52 |
E. Kijak
F. Coste
|
Projects |
7/11 13h15-14h45 | B12D-i52 |
E. Kijak
F. Coste
|
Exam | |
We will use:
- Jupyter
- Python 3.5 or later with scikit-learn, nltk (book), and pandas modules
and the following datasets:
- UCI Mushroom Data Set
- Names corpus Corpus, Version 1.3 (1994-03-29), by Mark Kantrowitz and Bill Ross
- SMS Spam Collection v.1 from Tiago A. Almeida and José María Gómez Hidalgo (see also UCI and Kaggle pages)
Run for instance (conda
might be an alternative): pip3 install -U numpy scipy scikit-learn nltk pandas
Then download this notebook and run it to check that the required packages are available on your computer and download useful resources in advance. “Introduction to NumPy and Matplotlib” by Sebastian Raschka and “10 minutes to pandas” are good introductions to those libraries that we won’t present during the lectures.
Many other datasets are available to play with. See for instance UCI Machine Learning Repository, Kaggle, Wikipedia list of datasets for ML, or Google dataset search engine.
Projects:
- General goal: Implement a learning process using the methodology and the methods seen during the module
- Subject: Compare and study the influence of the different choices that can be made to tackle one of the machine learning tasks proposed here.
- Instructions: Perform a rigorous and reproducible comparative study of at least 3 learning approaches on the chosen task.
The learning approaches should be evaluated. The study should explain the choice of representations, as well as the parameters of the learning approaches. You should provide a short analysis and discussion of the results. We expect conclusions on the pros and cons (learning quality, required amount of data or computing resources, …) of each learning approaches on the chosen task. Master 2 students will be asked to go one step further by exploring one (or several) aspect(s) of their choice, such as the impact of representation, noise in the data, the number of learning examples, the influence of some parameter, the study of another learning strategy (like semi-supervised)…
To make it more pedagogical: Don’t use GridSearch, or similar, tools!
- Deliverable: Jupyter notebook presenting and enabling to reproduce experiments (notebook file or link to it + pdf export) + short report in pdf format + 15mn presentation with slides.
The report and presentation should contain the presentation of the task, the methodology, the experiments and the results (in the form of a table), the short analysis, and conclusions, in a synthetic, precise, and clear manner.
If you use code from others, don’t forget to acknowledge the source…
- 2-3 person / group
Assessment:
60% written exam (notes taken during lectures and slides allowed), 40% project