SML

Supervised Machine Learning

Course of master SIF

Instructors:

List of students

Tentative schedule

(Look also at ADE UR1 Planning for the actual room, search for M2 INFO-SIF)

1 11/09 15h00-16h30 B02B-E209 F. Coste
Introduction
2 11/09 16h45-18h15 B02B-E209 F. Coste Methodology

slides, 8pp
Notebook: Name gender prediction

3 13/09 16h45-18h15 B02B-E208 E. Kijak Decision trees (slides 6pp)
4 18/09 15h00-16h30 B02B-E208 E. Kijak Bayesian Learning (slides 6pp)
5 20/09 16h45-18h15 B02B-E208 E. Kijak
Support Vector Machine (slides 6pp)
6 25/09 15h00-16h30 B02B-E208 E. Kijak Logistic regression (slides 6pp)
7 25/09 16h45-18h15 B02B-E208 E. Kijak Neural networks (slides 6pp)
8 27/09 16h45-18h15 B02B-E208 E. Kijak Model combination (slides 6pp)
9 02/10 15h00-16h30 B02B-E208 F. Coste Naive Bayes for text classification in practice with scikit-learn
Notebook: skeleton
10 04/10 15h00-16h30 B02B-E209 F. Coste Naive Bayes for text classification in practice with scikit-learn (cont’)
11
04/10 16h45-18h15 B02B-E209 F. Coste Natural language preprocessing
Notebook: complete
(slides of lectures 9-11, 8pp)
Introduction to grammatical inference (slides, 8pp)
12
11/10 15h00-16h30 B02B-E208 F. Coste Automata learning

The kTs.tgz file from this directory might contain a C code for k-TSI (Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition by P. García and E. Vidal)

13
11/10 16h45-18h15 B02B-E208 F. Coste Grammar learning
(slides of lectures 12-13, 8pp)
 
14 25/10 15h00-16h30 B12D-i52
E. Kijak
F. Coste
Projects
7/11 13h15-14h45 B12D-i52
E. Kijak
F. Coste
Exam

We will use:

and the following datasets:

Run for instance (conda might be an alternative): pip3 install -U numpy scipy scikit-learn nltk pandas
Then download this notebook and run it to check that the required packages are available on your computer and download useful resources in advance. “Introduction to NumPy and Matplotlib” by Sebastian Raschka and “10 minutes to pandas” are good introductions to those libraries that we won’t present during the lectures.

Many other datasets are available to play with. See for instance UCI Machine Learning Repository, Kaggle, Wikipedia list of datasets for ML, or Google dataset search engine.

Projects:

  • General goal: Implement a learning process using the methodology and the methods seen during the module
  • Subject: Compare and study the influence of the different choices that can be made to tackle one of the machine learning tasks proposed here.
  • Instructions: Perform a rigorous and reproducible comparative study of at least 3 learning approaches on the chosen task.
    The learning approaches should be evaluated. The study should explain the choice of representations, as well as the parameters of the learning approaches. You should provide a short analysis and discussion of the results. We expect conclusions on the pros and cons (learning quality, required amount of data or computing resources, …) of each learning approaches on the chosen task. Master 2 students will be asked to go one step further by exploring one (or several) aspect(s) of their choice, such as the impact of representation, noise in the data, the number of learning examples, the influence of some parameter, the study of another learning strategy (like semi-supervised)…
    To make it more pedagogical: Don’t use GridSearch, or similar, tools!
  • Deliverable: Jupyter notebook presenting and enabling to reproduce experiments (notebook file or link to it + pdf export) + short report in pdf format + 15mn presentation with slides.
    The report and presentation should contain the presentation of the task, the methodology, the experiments and the results (in the form of a table), the short analysis, and conclusions, in a synthetic, precise, and clear manner.
    If you use code from others, don’t forget to acknowledge the source…
  • 2-3 person / group

 

Assessment:

60% written exam (notes taken during lectures and slides allowed), 40% project

Excerpt of 2016 exam

Comments are closed.