You are here
How to solve diverse, large-scale sequence comparison problems
Sequence comparison has been central to computational biology for many decades, and still is today. Modern tasks include: comparing whole genomes; aligning bisulfite-converted DNA reads to a genome; aligning long, high-error sequences from single molecule sequencers; aligning ancient or degraded DNA; comparing metagenomic DNA to a protein database. Our main argument will be that classic alignment techniques, with substitution score matrices and statistical models, can and should be used for modern data. We will describe how to incorporate sequence quality data, and how to make the seed-and-extend method scale to large datasets by using "adaptive seeds". Finally, we build on the classic statistical model to accurately align paired sequences, and sequences that come from disjoint genomic loci due to rearrangements