Research interests :

I am interested in the developpement of new computational methods to study genome evolution and dynamics. At the moment, my interests are focused on methods dedicated to sequencing data, Next and Third generation sequencing technologies. I develop methods for de novo genome assembly, genomic variant detection and analysis, with a particular focus on Structural Variation.

In the past, I worked on classical comparative genomics methods based on the identification of homologous genes (Post-doc work on Mollicutes) and whole genome comparisons (PhD work).

From a biological point of view, I am particularly interested in genome rearrangements and genome organisation, such as the analysis of rearrangement breakpoints in complete genomes (PhD work) or the detection of structural variants in population genomics data (current work).

Outline :


New methods for genome comparison with sequencing data (2011 - now)

Genomics studies entered an unprecedented deep change with the arrival of Next Generation Sequencers (NGS), and now Third Generation Sequencers (TGS). These new technologies enable to sequence biological material with a flow much higher than before, for a price now accessible to most biological lab. However these generate huge amounts of data of a new type (mainly short sequences called reads) that necessitate new computational methods.

Notably, some tools that I developed for short read data rely on the Genome Assembly Tool Box (GATB) library which offers a very light representation of the De Bruijn graph. Therefore the time and memory requirements are unprecedently low, enabling for instance to call SNPs for several human datasets on current laptop or desktop computers.

More recently, I focused my research on methods for Structural Variant detection and analysis. Advances in sequencing technologies (long reads) have revealed the prevalence and importance of structural variations (deletions, duplications, inversions or rearrangements of DNA segments) which cover 5 to 10 times more bases in the genome than the point mutations commonly analyzed. I work on methods to improve their detection, genotyping, analysis with various sequencing data types (short, linked and long reads) and in various organisms (human, insects, bacterial symbiont...).


Chromosomal rearrangements in mammals

See also my PhD manuscript (in french) and the slides of my defense.

Chromosomal rearrangements are large scale mutations that alter the structure and organisation of genomes. I have studied them in the scope of the evolution of the mammalian genomes. The aim of this work was to characterise the genomic regions which have undergone such events; the latter are called breakpoints.

  1. Cassis

    I developped Cassis, a new method to precisely localise rearrangement breakpoints on a genome by comparison with the genome of a related species.
    The originality of the method is that it is divided in two steps :

    1. first detecting broadly the synteny blocks;
      We proposed a formal definition of synteny blocks between two genomes allowing some flexibility governed by one parameter, together with an algorithm to find them. (Blocks (Ar,Ao) and (Br,Bo) in the picture below)
    2. then refining each breakpoint separately.
      The refinement is done by aligning each breakpoint sequence against its specific orthologous sequences in the other species (sequences Sr against SoA and SoB in the picture below). We can then look for weak similarities inside the breakpoint, thus extending the synteny blocks and narrowing the breakpoints. The identification of the narrowed breakpoints relies on a segmentation algorithm and is statistically assessed.

    schema

    The whole method was applied to localise breakpoints on the human genome compared with other fully sequenced mammalian genomes, and was shown to achieve a better precision than other published methods.

    Publications : Precise detection of rearrangement breakpoints in mammalian genomes., Cassis: Detection of genomic rearrangement breakpoints.

    Software : Cassis

  2. Distribution of rearrangement breakpoints in mammalian genomes

    In collaboration with the team of Alain Arnéodo of the Laboratoire Joliot-Curie at the ENS (Lyon), I analysed the distribution of a set of 622 mammalian rearrangement breakpoints (obtained with the method Cassis) along the human chromosomes.

    schema

    We found that their distribution is highly heterogeneous and follows the organisation of the genome into isochores, with a high density of breakpoints in regions of high GC content and high gene density.
    We then proposed the hypothesis that regions of high transcriptional activity that are probably in an open chromatin state, may have an enhanced susceptibility to breakage.

    Publication : Analysis of fine-scale mammalian evolutionary breakpoints provides new insight into their relations to genome organisation and open chromatin.

  3. Rearrangement and evolution of human X-Y chromosomes

    In collaboration with Gabriel Marais of the LBBE laboratory (Lyon), I analysed the rearrangements between human X and Y chromosomes. More precisely, we were interested in the breakpoints at the evolutionary strata boundaries (Y is supposed to have evolved from the X chromosome by several large inversions supressing recombination in the strata defined).

    Using the method Cassis, I could identify two sets of duplications clearly linked to the inversions responsible for the formations of stratum 4 and stratum 5. These were not only clear evidence of the existence of these inversions, but also permitted to order them in time (see picture below : the progressive reduction of the PAR (pseudo-autosomal region) by two inversions on the Y chromosome associated with duplications).

    schema

    Publication : Footprints of inversions at present and past pseudoautosomal boundaries in human sex chromosomes.

  4. Spatial synteny

    In this study, we analysed the correlation between 3D chromatin interaction data (public data obtained with the Hi-C method) and breakpoint regions resulting from evolutionary rearrangements in the human genome. We found that two loci distant in the human genome but adjacent in the mouse genome are significantly more often observed in close proximity in the human nucleus than expected. These findings strongly suggest that part of the 3D organisation of chromosomes may be conserved across very large evolutionary distances. To characterise this phenomenon, we proposed to use the notion of spatial synteny which generalises the notion of genomic synteny to the 3D case.

    Publication : Close 3D proximity of evolutionary breakpoints argues for the notion of spatial synteny.


Comparative genomics of mollicutes

I am involved in the ANR project EvolMyco. The aim of the project is to understand the evolution and adaptation of ruminants mycoplasmas (small bacteria belonging to the mollicutes group). It also includes the sequencing and annotation of 20 new genomes of these species.

Within this project, I focused mainly on the prediction of orthologous relationships between genes. Based on the observation that Mollicute genomes have a strong compositional bias (AT-richness), I proposed a general and simple methodology to build new matrices fitted to specific compositional bias of proteins. I designed such a new matrix MOLLI60 for Mollicute genomes (to replace BLOSUM62) and it enabled to better predict homologous and orthologous relationships thanks to a better estimation of protein alignment significance.

Publication : A novel substitution matrix fitted to the compositional bias in Mollicutes improves the prediction of homologous relationships.

I developped a mixed strategy based on pairwise alignment clustering and detection of species overlap patterns in protein family phylogenetic trees, to predict orthologous relationships between about 60 Mollicute gene sets. The resulting orthologous groups are available in Molligen (a database dedicated to the comparative genomics of Mollicutes, developped by the CBiB) and are currently being analysed in order to find correlations between gene sets and phenotypic traits such as hosts specificities or pathogenicity levels.