Analysis of genome variation using high through-put sequencing
Nishant K. T.
In the post genomic era, high throughput sequencing of DNA is commonly used to analyse genome variation. Many next generation sequencing (NGS) platforms such as Illumina Hi Seq, ABI Solids, Roche 454 enable us to sequence entire genome of individuals with reasonable cost and time. Comparison of genome sequences can identify single nucleotide polymorphisms (SNPs) and structural variations. This information is used to estimate mutation rates, disease risk and to enhance our basic understanding of the mutation process. Analysis of the segregation patterns of these variations is used to build recombination maps, haplotypes etc that are useful for gene mapping and understanding basic mechanisms of genetic recombination. Generally the data output from such NGS platforms is in terms of Terabytes (1000 Gb) of computer memory space. A typical mammalian genome take close to 1Gb for storage alone at 1X coverage. Higher coverage and larger number of genomes increases the storage space requirement. The hardware requirement in terms of processor speed and RAM often forms the bottleneck for the analysis of such genome sequences.
In our laboratory, we routinely do whole genome sequence analysis of the yeast genome (12 Mb genome size) at high coverage (~ 30-40X coverage ) to identify SNPs. The analysis includes pre-processing of raw sequence data, alignment to a reference genome and calling out variations such as SNPs using a Bayesian statistical framework. The whole workflow on a typical workstation with 64 threads and 64 Gb RAM takes around two-three hours. Analysis of the segregation of SNPs in hundreds of such yeast genome sequences is required to infer recombination rates. This process is computationally intensive and takes weeks to analyse. We also plan to study mutation and recombination processes in mammalian genomes.