Improving the Reproducibility of Scientific Applications with Execution Environment Specifications

Meng, Haiyan

doi:10.7274/z029p269z90

MengH042017D.pdf (2.04 MB)

Improving the Reproducibility of Scientific Applications with Execution Environment Specifications

thesis

posted on 2017-04-08, 00:00 authored by Haiyan Meng

Reproducibility, a main principle of the scientific method, has historically depended on text and proofs in a publication. However, as computation pervades science and changes the way how research is conducted, relying only on the experimental results in a publication cannot guarantee reproducibility. The execution environment, in which the results were generated, is another important ingredient and must also be preserved to reproduce the results. Unfortunately, execution environments for scientific work are often fragile and too complex to be well understood by researchers, let alone to be preserved.

This dissertation proposes two broad approaches for improving the reproducibility of scientific applications and explore their feasibility and applicability for both single-machine scientific applications and complex scientific workflows. The first approach wraps the minimal execution environment of an application into an all-in-one package. The second approach specifies the execution environment from hardware, kernel and OS all the way up to software, data and environment variables in an organized way, preserves dependencies in the unit of basic OS image, software and data, and combines all the dependencies at runtime using mounting mechanisms.

For each approach, a prototype was implemented and the following three aspects are explored: what to preserve, how to preserve and how to reproduce. The time and space overheads to preserve and reproduce applications, and the correctness of preserved artifacts are evaluated through applications from high energy physics, bioinformatics, epidemiology and scene rendering. The evaluation results show that both approaches allow researchers to reproduce an application and verify its results. However, the second approach avoids storing shared dependencies repeatedly and makes it easier to extend the original work.

This work makes its contribution by demonstrating the importance of execution environments for the reproducibility of scientific applications and differentiating execution environment specifications, which should be lightweight, persistent and deployable, from various tools used to create execution environments, which may experience frequent changes due to technological evolution. It proposes two preservation approaches and prototypes for the purposes of both result verification and research extension, and provides recommendations on how to build reproducible scientific applications from the start.

History

Date Created

2017-04-08

Date Modified

2018-11-01

Defense Date

2017-03-22

Research Director(s)

Douglas Thain

Degree

Doctor of Philosophy

Degree Level

Doctoral Dissertation

Language

English

Program Name

Computer Science and Engineering

Usage metrics

Keywords

execution environment specifications virtualization techniques scientific applications software preservation reproducible research scientific workflows

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Improving the Reproducibility of Scientific Applications with Execution Environment Specifications

History

Date Created

Date Modified

Defense Date

Research Director(s)

Degree

Degree Level

Language

Program Name

Usage metrics

Categories

Keywords

Licence

Exports