On the Genetic Information Processing of Metabarcodes: Adding Data Utility with Provenance and Metadata

Doctoral Dissertation


This body of work explores information processing of metabarcodes. It seeks to answer the following questions: what does this string of metabarcode DNA mean? Where does it come from? And how do we utilize it in new and interesting ways? This is done by implementing a genomic informatics processing framework, utilizing provenance and metadata to increase data utility of metabarcode data. Three terms are critical to the understanding of this body of work: provenance, the history of the data; metadata, information about the data, and data utility, the general reusability of the data. Under these definitions’ provenance is a subset of metadata. I show an increase pf data utility in three ways: by looking at the different features of the metabarcode itself, and exploring how manipulation of those features can start to explain variance in the analysis of metabarcode data; In chapter two, agent based simulations are used to analyze features, such as the relative abundance of barcodes, to show their effect on resulting metabarcode datasets. Unsurprisingly, varying the abundance of metabarcode sequences results in the variance in the similarity between samples. Other features, like the addition of Single Nucleotide Polymorphisms can also result in variance in simulaity. I then go on to show how metadata and provenance of previously published metadata can be utilized in order to further describe the environment of a species of interest; In chapter three, I utilize natural language processing techniques in order to draw conclusions about the environment of a particular species, a human pathogen known as Cryptococcus neoformans. Lastly, by utilizing the metadata of various metabarcode datasets I show we can now explore not only the intermixing of various previously published metabarcode data but derive new estimations of arthropod diversity and rarefaction; In chapter four, I implement a novel data framework called met, which utilizes the metadata from different metabarcode datasets in order to make comparisons across different projects. What I conclude is that by utilizing a genomic informatics and informatics processing framework we can increase the data utility of the metabarcode; this is useful because this allows us to gain more “Bang for our Buck” to use an adage.


Attribute NameValues
Author David C Molik
Contributor Stuart Jones, Committee Member
Contributor Michael Pfrender, Research Director
Contributor Scott Emrich, Committee Member
Contributor Natalie Meyers, Committee Member
Contributor Elizabeth Archie, Committee Member
Degree Level Doctoral Dissertation
Degree Discipline Integrated Biomedical Sciences
Degree Name Doctor of Philosophy
Banner Code

Defense Date
  • 2021-04-30

Submission Date 2021-05-09
  • Computational Biology

  • Bioinformatics

  • Metabarcoding Methodology

  • English

Record Visibility Public
Content License
Departments and Units
Catalog Record


Please Note: You may encounter a delay before a download begins. Large or infrequently accessed files can take several minutes to retrieve from our archival storage system.