Transcriptome sequencing (sequencing only from the protein coding genes of a genome) has multiplied our ability to understand the biology of life on Earth. While full genome sequencing is still prohibitively expensive for many species, sequencing of genes only provides direct access to the most functional elements of a genome for a fraction of the cost. This advance brings broad genetic resources to those studying species of even the most specialized interest.
We use these techniques to answer an important ecological question: How will species react to climate change? It has been assumed that as climate warms, populations will simply shift poleward to compensate. Unfortunately, previous reseach shows that some populations (of two butterflies) are adapted to local conditions and may not respond this way. To discover the genetic basis of these results, we first sequenced, assembled, annotated, and analyzed the butterfly transcriptomes. Because sequences were sampled from wild-caught populations, we developed novel methods to ensure high quality results in this setting. We then designed custom microarrays to measure how much of every gene is expressed in a given experimental setting. These, coupled with a robust experimental design, revealed a variety of genes and functional categories carrying the signature of local adaptation to climate.
The bioinformatic difficulties associated with such projects are many. In particular, transcriptome assembly presents unique challenges, and it is not yet clear how to quantitatively evaluate assemblies. By simulating sequencing and comparing assembler results to those of perfect assemblies, we evaluate a number of commonly used and novel quality metrics. This study reveals that some quality metrics reflect biological accuracy while others (such as contig N50 length) do not and provides vital information for researchers making use of transcriptome data.
Finally, when sequences are sourced from many genetically diverse individuals, our tools would ideally reveal this diversity rather than produce a simple genetic consensus. To this end, we develop algorithms to seperately assemble diverse sequences (haplotypes) accurately in the face of both sequencing error and data ambiguity. These methods will help reveal biodiversity in applications ranging from community ecology to epidemiology.