I am interested in the developpement of new methods to study genome evolution and dynamics. At the moment, my interests are focused on methods dedicated to sequencing data, Next and Third generation sequencing technologies. I develop methods for de novo genome assembly, genomic variant detection, sequencing data compression and correction, ...
In the past, I worked on classical comparative genomics methods based on the identification of homologous genes (Post-doc work on Mollicutes) and full genome comparisons (PhD work).
From a biological point of view, I am particularly interested in genome rearrangements and genome organisation, such as the analysis of rearrangement breakpoints in complete genomes (PhD work) or the detection of structural variants in population genomics data (current work).
Genomics studies entered an unprecedented deep change with the arrival of Next Generation Sequencers (NGS), and now Third Generation Sequencers (TGS). These new technologies enable to sequence biological material with a flow much higher than before, for a price now accessible to most biological lab. However these generate huge amounts of data of a new type (mainly short sequences called reads) that necessitate new computational methods.
In the Genscale team, I am involved in several projects dealing with such NGS or TGS data, whose aims are to improve the extraction of biological information inside the reads, with a particular concern of maintaining or improving time and memory performances of the algorithms (which is actually the main bottleneck of current tools).
Notably, many tools we developed rely on the Genome Assembly Tool Box (GATB) library which offers a very light representation of the De Bruijn graph. Therefore the time and memory requirements are unprecedently low, enabling for instance to call SNPs for several human datasets on current laptop or desktop computers.
We proposed a new method for the integrated detection and assembly of insertion variants from re-sequencing data. Importantly, it is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. MindTheGap uses an efficient k-mer based method to detect insertion sites in a reference genome, and subsequently assemble them from the complete set of donor reads.
The method is implemented in the tool MindTheGap and showed high recall and precision on simulated datasets of various genome complexities. When applied to real C. elegans and human NA12878 datasets, MindTheGap detected and correctly assembled insertions longer than 1 kb, using at most 14 GB of memory.
MindTheGap has been greatly improved since its publication in 2014, and we are currently applying it in the context of medical diagnosis for rare human diseases.
Publication : MindTheGap : integrated detection and assembly of short and long insertions
Software : MindTheGap or got directly to the Github page
We extended MindTheGap to the task of genome assembly finishing: it can fill the gaps between a set of input contigs without any a priori on their relative order and orientation.
We developped a full pipeline around this tool in order to assemble a genome of interest in a metagenomics sample using a remote reference genome as a guide. It is particularly suited when structural variation is present in the dataset. Results are output in a gfa file to help visualizing and understanding the co-existing genome structures.
Results to appear...
We developped new tools, such as discoSnp and TakeABreak, to detect genomic varaints in raw read sets without assembly nor mapping on a reference genome. The method relies on the exploration of the De Bruijn graph generated from the combined read sets, looking for specific topological motifs. discoSnp focuses on Single Nucleotide Polymorphism (SNP), whereas TakeABreak detects inversion and translocation breakpoints. We developped also efficient filters and ranking to discard false positives in complex and repeated genomes, achieving better results than existing tools or the classical strategy of assembly+mapping.
We recently extended discoSnp to RAD-seq analyses.
Publication : Mapping-Free and Assembly-Free Discovery of Inversion Breakpoints from Raw NGS Reads
Software : discoSnp, TakeABreak, or go directly to discoSnp++ Github page
We developped several tools to compare and extract similar reads between raw read sets in an efficient manner. The main focus was on time and memory performances in order to be able to deal with potentially huge metagenomic datasets. This is achieved by using the kmer (word of size k, with k around 20-40) as a comparison unit.
Simka computes various pairwise distances between metagenomics datasets based on their common and specific kmer contents. It relies on an efficient kmer counting algorithm and avoids storing the whole kmer count matrice in memory.
Compareads relies on a custom probabilistic data structure, based on the Bloom Filter principle, to index all k-mers of the reads.
Publications:
Software : Simka and Compareads, or go directly to Simka Github page
We developped 2 tools with very low memory footprint, based on the de Bruijn Graph implementation in GATB, to correct (Bloocoo) and compress (Leon) raw Illumina short reads.
Publication:Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
Software : Bloocoo and Leon.
See also my PhD manuscript (in french) and the slides of my defense.
Chromosomal rearrangements are large scale mutations that alter the structure and organisation of genomes. I have studied them in the scope of the evolution of the mammalian genomes. The aim of this work was to characterise the genomic regions which have undergone such events; the latter are called breakpoints.
I developped Cassis, a new method to precisely localise rearrangement breakpoints on a genome by comparison with the genome of a related species.
The originality of the method is that it is divided in two steps :
The whole method was applied to localise breakpoints on the human genome compared with other fully sequenced mammalian genomes, and was shown to achieve a better precision than other published methods.
Publications : Precise detection of rearrangement breakpoints in mammalian genomes., Cassis: Detection of genomic rearrangement breakpoints.
Software : Cassis
In collaboration with the team of Alain Arnéodo of the Laboratoire Joliot-Curie at the ENS (Lyon), I analysed the distribution of a set of 622 mammalian rearrangement breakpoints (obtained with the method Cassis) along the human chromosomes.
We found that their distribution is highly heterogeneous and follows the organisation of the genome into isochores, with a high density of breakpoints in regions of high GC content and high gene density.
We then proposed the hypothesis that regions of high transcriptional activity that are probably in an open chromatin state, may have an enhanced susceptibility to breakage.
In collaboration with Gabriel Marais of the LBBE laboratory (Lyon), I analysed the rearrangements between human X and Y chromosomes. More precisely, we were interested in the breakpoints at the evolutionary strata boundaries (Y is supposed to have evolved from the X chromosome by several large inversions supressing recombination in the strata defined).
Using the method Cassis, I could identify two sets of duplications clearly linked to the inversions responsible for the formations of stratum 4 and stratum 5. These were not only clear evidence of the existence of these inversions, but also permitted to order them in time (see picture below : the progressive reduction of the PAR (pseudo-autosomal region) by two inversions on the Y chromosome associated with duplications).
Publication : Footprints of inversions at present and past pseudoautosomal boundaries in human sex chromosomes.
Publication : Close 3D proximity of evolutionary breakpoints argues for the notion of spatial synteny.
I am involved in the ANR project EvolMyco. The aim of the project is to understand the evolution and adaptation of ruminants mycoplasmas (small bacteria belonging to the mollicutes group). It also includes the sequencing and annotation of 20 new genomes of these species.
Within this project, I focused mainly on the prediction of orthologous relationships between genes. Based on the observation that Mollicute genomes have a strong compositional bias (AT-richness), I proposed a general and simple methodology to build new matrices fitted to specific compositional bias of proteins. I designed such a new matrix MOLLI60 for Mollicute genomes (to replace BLOSUM62) and it enabled to better predict homologous and orthologous relationships thanks to a better estimation of protein alignment significance.
I developped a mixed strategy based on pairwise alignment clustering and detection of species overlap patterns in protein family phylogenetic trees, to predict orthologous relationships between about 60 Mollicute gene sets. The resulting orthologous groups are available in Molligen (a database dedicated to the comparative genomics of Mollicutes, developped by the CBiB) and are currently being analysed in order to find correlations between gene sets and phenotypic traits such as hosts specificities or pathogenicity levels.