Learning linguistic models and application to biological sequences
Keywords: Grammatical Inference, Machine Learning, Functions and Structures of Protein Sequences, DNA…
Software
- Protomata Learner , inference of automata modeling families of protein sequences by partial local multiple alignment.
The current version, Protomata 2.0 can be used through a web interface on the Genouest Bioinformatics platform server.
We are working on version 3.0… - PPalign, Potts to Potts alignment of protein sequences taking into account residues coevolution, with Hugo Talibart.
PhD Students
- Pablo Espana Gutierrez, Learning models with explicit dependencies between residues to predict protein functions, since Sep. 2023
Previous Ph.D. Students
- Nicolas Buton, Transformer models for interpretable and multilevel prediction of protein functions from sequences (with Yann Le Cunff and Olivier Dameron), Oct. 2023.
- Olivier Dennler, Characterization in functional modules of ADAMTS-TSL proteins, by phylogeny approaches (with Nathalie Théret, Samuel Blanquart, and Catherine Belleannée), Dec 2022.
- Hugo Talibart, Comparison of homologous protein sequences using direct coupling information by pairwise Potts model alignments, Feb 2021.
- Clovis Galiez, Structural fragments: comparison, predictability from the sequence and application to the identification of viral structural proteins, Dec 2015.
- Gaelle Garet, Classification and Characterization of Enzymatic Families using Formal Methods (with Jacques Nicolas), Dec 2014.
- Matthias Gallé, Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem, Jan 2011.
- Goulven Kerbellec, Apprentissage d’automates modélisant des familles de séquences protéiques, Jun 2008.
- Marie Lahaye, Apprentissage de signatures topologiques de protéines. Marie is gone too soon, but we are not forgetting her…
- Aurélien Leroux, Inférence grammaticale sur des alphabets ordonnés (main supervisor: Jacques Nicolas), Jun 2005.
- Daniel Fredouille, Inférence d’automates finis non déterministes par gestion de l’ambiguïté, en vue d’applications en bioinformatique, Oct 2003.
Funded projects
- Pepper: Vers la nouvelle génération de méthodes d’alignement protéiques avec les modèles de Potts, coordinated by Mathilde Carpentier, Émergence 2021-2022 from Alliance Sorbonne Université
- IDEALG Seaweed for the future, ANR Investissements d’avenir, Biotechnology and Bioressource
- Characterization of desaturases with Pleiade team, IPL Algae in silico
- Grammatical inference methods in classification of amyloidogenic proteins with Politechnika Wroclawska, Polland, funded by Polish National Science Center
- “Omics”-Line of the Chilean CIRIC-Inria Center
- PEPS project: Characterisation and identification of viral sequences in marine metagenomes
- ANR Biotempo: Languages, time representations and hybrid models for the analysis of incomplete models in molecular biology
- ANR LepidOLF: Microgénomique de la sensille phéromonale d’un lépidoptère : une approche novatrice pour comprendre les mécanismes olfactifs et leur modulation
- ANR Pelican : Competing for light in the ocean: An integrative genomic approach of the ecology, diversity and evolution of cyanobacterial pigment types in the marine environment
- Collaboration MINCyT (ex SECyT) – INRIA with the “Grupo de Procesamiento de Lenguaje Natural ” of Gabriel Infante-Lopez: Modélisation linguistique de séquences génomiques par apprentissage de grammaires
- ANR Proteus: Reconnaissance de pli et repliement inverse : vers une prédiction à grande échelle des structures de protéines
- ANR Modulome: Deciphering and modelling the structural organization of genomes
Selected publications
Primers and reviews
- Learning the Language of Biological Sequences, François Coste. In Topics in Grammatical Inference, editors: J. Heinz, J.M. Sempere, Springer 2016 (also in open archive HAL)
- Le langage des molécules du vivant, Jacques Nicolas, Catherine Belleannée, François Coste. In Bibliothèque Tangente, Editions Pôle Paris, 2014.
- Bioinformatique, François Coste, Claire Nédellec, Thomas Schiex, Jean-Philippe Vert. In Panorama actuel de l’intelligence artificielle, volume 3: frontières et applications, editors: P. Marquis, O. Papini, and H. Prade, Cépaduès, 2014.
Looking at long-distance correlations
With Transformers’ attention
- Predicting enzymatic function of protein sequences with attention, Nicolas Buton, François Coste, Yann Le Cunff, Bioinformatics, 2023
Residues coevolution
- PPalign: optimal alignment of Potts models representing proteins with direct coupling information, Hugo Talibart, François Coste, BMC Bioinformatics, 2021
Protein sequences and structures
- VIRALpro: a tool to identify viral capsid and tail sequences, Clovis Galiez, Magnan Christophe, François Coste, Pierre Baldi, Bioinformatics, 2015
- Use VIRALpro web service to detect capsid and tail proteins in peptidic sequences.
- Download VIRALpro for offline use or supplementary material for paper.
- Amplitude Spectrum Distance: measuring the global shape divergence of protein fragments, Clovis Galiez, François Coste, BMC Bioinformatics 2015, 16:256
- Download ASD program or supplementary material by Clovis
- Structural conservation of remote homologs: better and further in contact fragments, Clovis Galiez, François Coste, poster at ISMB/ECCB 2015 – 3DSIG: Structural Bioinformatics and Computational Biophysics.
Learning context-free grammars
- Estimating probabilistic context-free grammars for proteins using contact map constraints, Witold Dyrka, Mateusz Pyzik, François Coste, Hugo Talibart, PeerJ, 2019.
- Learning local substitutable context-free languages from positive examples in polynomial time and data by reduction. François Coste, Jacques Nicolas, ICGI 2018. (slides)
- How to measure the topological quality of protein parse trees? Mateusz Pyzik, François Coste and Witold Dyrka. ICGI 2018.
- A refined parsing graph approach to learn smaller contextually substitutable grammars with less data, François Coste, Mikail Demirdelen, ICGI 2016.
- A bottom-up efficient algorithm learning substitutable languages from positive examples, François Coste, Gaëlle Garet, Jacques Nicolas, ICGI 2014. (slides)
- Local Substitutability for Sequence Generalization, François Coste, Gaelle Garet, Jacques Nicolas. ICGI 2012. (slides by Gaelle) submitted extended version (first submission of ReGLiS and definition of prime and composite classes)
- Searching for Smallest Grammars on Large Sequences and Application to DNA, Rafael Carrascosa, François Coste, Matthias Gallé, Gabriel Infante-Lopez, Journal of Discrete Algorithms, vol. 11, February 2012, pp 62-72. (preprint)
- The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing, Rafael Carrascosa, François Coste, Matthias Gallé, Gabrie Infante-Lopez , Algorithms, 4 (2011) 262-284
- In place update of suffix array while recoding words, Matthias Gallé, Pierre Peterlongo and François Coste, International Journal of Foundation of Computer Science, vol. 20, Issue 6, 2009, pp. 1025-1045.
Extended version of paper presented at PSC 2008 (abstract, paper, slides).
- Progressing the State-of-the art in Grammatical Inference by Competition, Brad Starkie, François Coste and Menno van Zaanen, AI Communications, vol. 18, no. 2, 2005, pp. 93-115.
Learning automata (and partial local multiple alignment applications)
- Phylogenetic inference of the emergence of sequence modules and protein-protein interactions in the ADAMTS-TSL family, Olivier Dennler, Samuel Blanquart, Catherine Belleannée, Nathalie Théret. PLOS Computational Biology, 2023
- Extracellular vesicles produced by human and animal Staphylococcus aureus strains share a highly conserved core proteome, Natayme Rocha Tartaglia, Aurélie Nicolas, Vinícius de Rezende Rodovalho, Brenda Silva Rosa da Luz, Valérie Briard-Bion, Zuzana Krupova, Anne Thierry, François Coste, Agnes Burel, Patrice Martin, Julien Jardin, Vasco Azevedo, Yves Le Loir, Eric Guédon. Scientific Reports, Nature Publishing Group, 2020
-
CyanoLyase: a database of phycobilin lyase sequences, motifs and functions, Anthony Bretaudeau, François Coste, Florian Humily, Laurence Garczarek, Gildas Le Corguillé, Christophe Six, Morgane Ratin, Olivier Collin, Wendy M Schluchter, Frédéric Partensky. Nucleic Acids Research, Oxford University Press, 2012
-
A Similar Fragments Merging Approach to Learn Automata on Proteins, François Coste and Goulven Kerbellec, ECML 2005. (abstract, paper, extended version, data sets).Some slides presenting this work and more at a grammatical inference workshop: slides, 4 per page for printing
- Introducing Domain and Typing Bias in Automata Inference, François Coste, Daniel Fredouille, Christopher Kermorvant and Colin de la Higuera. ICGI 2004. paper (.pdf), slides (.ppt, 2.2MB)
- Mutually compatible and incompatible merges for the search of the smallest consistent DFA, John Abela, François Coste and Sandro Spina. ICGI 2004. paper (.pdf), slides (.ppt)
- Unambiguous automata inference by means of state-merging methods. François Coste, Daniel Fredouille, ECML’03. paper (.ps.gz, .pdf) complementary experiments (.ps.gz, .pdf), benchmarks (.tar.gz), slides (.ppt).
Parsing ambiguity!
- What is the Search Space for the Inference of Non Deterministic, Unambiguous and Deterministic Automata ? François Coste, Daniel Fredouille, Techn. Report, RR-4907, 2003
- Efficient ambiguity detection in C-NFA, a step toward inference of non deterministic automata, François Coste, Daniel Fredouille, ICGI 2000, Grammatical inference: algorithms and applications, Lisbonne, 2000, paper (.ps.gz, .pdf) benchmark (.tar.gz).
Classification ambiguity!
- State merging inference of finite state classifiers, François Coste, INRIA/IRISA, May 1999, report (.ps.gz, .pdf)
- Regular Inference as a graph coloring problem, François Coste, Jacques Nicolas, ICML97, Grammatical Inference Workshop, Nashville TN, USA, 1997 (.ps.gz, .pdf)
Ph.D. Thesis
- Apprentissage d’automates classifieurs en inférence grammaticale, IRISA/Université de Rennes 1, 27 janvier 2000.
Advisor: Jacques Nicolas.
Abstract (English and French) , thesis (.ps.gz, .pdf, errata), slides ( .ps.gz, .pdf).
Grammatical Inference Benchmarks and Competitions
- I gathered classical grammatical inference benchmarks in this GIB repository. Don’t hesitate to contribute with your own data sets, especially real-world ones!
- I set up the Gowachin server, a continuation of the Abbadingo One DFA learning competition, allowing parametrized problems to be generated. I also co-organized Omphalos, the competition on learning context-free languages, which is now over but the data sets are still available… If you are interested in grammatical inference competitions, look at this page.
More complete list of publications here.
This page is updated on an irregular basis: browse HAL for new publications.