Second Annual French Complex Systems Summer School

Lyon and Paris, July 15-August 10, 2008

"Signal analysis applied to biological sequences"

Alain Arnéodo (alain DOT arneodo AT ens-lyon DOT fr)
Benjamin Audit

Laboratoire Transdisciplinaire Joliot Curie
Laboratoire de Physique
Ecole Normale Supérieure de Lyon
46, allée d’Italie, 69364 Lyon cedex 07

The slides of this course are available here (see syllabus below):
Wavelet Slides

Course Description

Recent technical progress in live cell imaging have confirmed that the structure and dynamics of chromatin play an essential role in regulating many biological processes, such as gene activity, DNA replication, recombination, and DNA damage repair. The main objective of these lectures is to show that there is a lot to learn about these processes when using multiscale signal processing tools to analyze DNA sequences.

The first lecture is devoted to background knowledge about the wavelet transform, the multifractal formalism and its wavelet-based generalization. These theoretical notions will be illustrated by concrete applications to fully-developed turbulence data.

The second lecture addresses the issues of the existence and the understanding of long-range correlations (LRC) in DNA sequences. We use the space-scale decomposition provided by the continuous wavelet transform (WT) to characterize the scale invariance properties of genomic sequences. We show the existence of LRC over distances up to 20-30 kbp. To understand to which extent the observed LRCs could influence the compaction and accessibility of genomic information in the cells, we perform a multifractal analysis of DNA structural profiles, e.g., DNA bending profiles based on nucleosome positioning data. We then investigate the thermodynamical properties of 2D elastic chains submitted locally to mechanical/topological constraint as loops. We show that a possible key to understanding is that the LRC structural disorder induced by the sequence may favor the autonomous formation of small (few hundreds bp) DNA loops and, in turn, the propensity of eukaryotic DNA to interact with histones to form nucleosomes. We further compare the model predictions to genome-scale nucleosome positioning data recently obtained by Yuan et al. for S. cerevisiae chromosome III (Science 309, 2005). The statistical analysis of the experimental profile of nucleosome occupancy displays striking similarities to the energy landscape of nucleosome formation computed from the sequence including the nucleosome-free regions observed at gene promoters. The third part of this talk is devoted to recent experimental AFM imaging of nucleosome positioning by excluding genomic energy barriers. These results constitute the first experimental evidence of the influence of LRCs on the nucleosomal organisation of eukaryotic chromatin.

In the third lecture, we explore the large-scale compositional heterogeneity of human autosomal chromosomes through the optics of the WT microscope. We show that the GC content displays relaxational nonlinear oscillations with main frequencies corresponding to 100 kb, 400 kb and 900 kb which are well recognized characteristic sizes of chromatin loops and loop domains involved in the hierarchical folding of the chromatin fiber. These frequencies are also remarkably similar to the size of mammalian replicons. When further investigating deviations from intrastrand equimolarities between A and T and between G and C, we corroborate the existence of these fundamental frequencies as the footprints of the replication and/or transcription mutation biais and we show that the observed nonlinear oscillations enlighten a remarkable cooperative organization of gene location and orientation. When further investigating the intergenic and transcribed regions flanking experimentally identified human replication origins and the corresponding mouse and dog homologous regions, we reveal that for 7 of 9 of these known origins, the (TA+GC) skew displays rather sharp upward jumps, with a linear decreasing profile in between two successive jumps. We present a model of replication with well positioned replication origins and random terminations that accounts for the observed characteristic serrated skew profiles. We further use the singularity tracking ability of the WT to detect the origins of replication. We report the discovery of 1024 putative origins of replications in the human genome. We then develop a wavelet-based multiscale pattern recognition methodology to disentangle the replication- from the transcription-associated compositional strand asymmetries. Comparing replication skew profiles to recent high resolution replication timing data reveals that most of the putative replication origins that border the so-identified replication domains are replicated earlier that their surroundings wheras the central regions replicate late in the S phase. We discuss the implications of this first experimental confirmation of these replication origin predictions that are likely to be early replicating and active in most tissues. The statistical analysis of the distribution of sense and anti-sense genes in these replication domains strongly suggests that these master origins of replication play a fundamental role in the organization of mammmalian genomes. Taken together, these analyses show that replication and gene expression are likely to be regulated by the structure and dynamics of the chromatin fiber.

Course Syllabus

Lecture I

1. The multifractal formalism: a thermodynamics of fractals
  • 1.1 Historical introduction
  • 1.2 The analogy between the multifractal formalism and thermodynamics

2. The multifractal formalism revisited with wavelets
  • 2.1 The 1D wavelet transform modulus maxima (WTMM) method
  • 2.2 The 1D WTMM method versus the box-counting and structure function techniques
  • 2.3 Test applications (Generalized devil staircases, Fractional Brownian motions, multifractal cascades)

3. Applications of the 1D WTMM method
  • 3.1 Experimental application to 1D velocity signals in fully developed turbulence
  • 3.2 Space-scale correlations from wavelet analysis: how to reveal the existence of an underlying multiplicative process

Lecture II

1. Introduction
  • 1.1 Background materials in genomics
  • 1.2 About the controversy concerning the existence of long-range correlations in DNA sequences

2. Wavelet based fractal analysis of DNA sequences
  • 2.1 About the necessity of using wavelet analysis to master the mosaic structure of DNA sequences
  • 2.2 Application of the wavelet transform modulus maxima (WTMM) method to the study of DNA sequences
    • Demonstration of the monofractality of DNA walks
    • About the Gaussian character of the fluctuations in DNA walk landscapes
  • 2.3 Uncovering long-range correlations in coding as well as in non-coding DNA sequences
  • 2.4 Nucleotide composition effects on the long-range correlation properties of human genes
  • 2.5 Long-range correlations in DNA sequences do not originate from genome plasticity

3. Towards a structural interpretation of the long-range correlations in DNA sequences
  • 3.1 Evidencing long-range correlations between DNA bending sites in eukaryotic organisms
  • 3.2 Long-range correlations in eukaryotic DNA: a signature of the nucleosomal structure
  • 3.3 Long-range correlations in genomic DNA: a necessity for condensation-decondensation processes?
  • 3.4 The thermodynamics of 2D DNA loops in the presence of long-range correlated structural disorder

4. Experimental studies of the effect of long-range correlations on the first step of compaction of DNA inside eukaryotic nuclei
  • 4.1 Probing persistence in DNA curvature properties with Atomic Force Microscopy
  • 4.2 Tiled microarray experimental data confirm the influence of genomic long-range correlations on nucleosome positioning
  • 4.3 AFM imaging of nucleosome positioning by excluding genomic energy barriers

Lecture III

1. Wavelet analysis of the large-scale GC fluctuations in the human genome
  • 1.1 Evidencing low-frequency rhythms in human DNA sequences: a possible signature of the chromatin loops structure
  • 1.2 Revisiting the concept of isochores

2. Wavelet analysis of the large-scale TA and GC strand asymmetry fluctuations in the human genome
  • 2.1 A wavelet-based methodology to perform gene clustering
  • 2.2 Transcrition-coupled and splicing-coupled stand asymmetries in eukaryotic genomes: transcription induced steps-like skew profiles in mammalian genomes

3. Replication-associated strand asymmetries in mammalian genomes: toward detection of replication origins
  • 3.1 Replication induced factory-roof skew profiles in the human genome
  • 3.2 Conservation of replication-coupled strand asymmetries in mammalian genomes
  • 3.3 Detecting the origins of replication with the wavelet transform
  • 3.4 From DNA sequence analysis to modeling replication in the human genome

4. A wavelet-based methodology to disantangle transcription- and replication-associated strand asymmetries reveals a remarkable gene organization in the human genome
  • 4.1 Disentangling the replication and transcription contributions to the compositional strand asymmetries
  • 4.2 DNA replication timing data corroborate in silico human replication origin predictions
  • 4.3 Gene organization in the detected replication domains

5. Perspectives
  • 5.1 The sequence is highly predictive of open chromatin around putative human master replication origins
  • 5.2 Comparative studies of replication domains, expression domains and structural domains from the wavelet analysis of DNA sequences

Contributors to this page: Rene Doursat .
Page last modified on Monday 21 July, 2008 22:56:38 by Rene Doursat.