University of Notre Dame
Browse
HazekampN122019D.pdf (1.4 MB)

Methods Enabling Portability of Scientific Workflows

Download (1.4 MB)
thesis
posted on 2019-12-02, 00:00 authored by Nicholas Hazekamp

Scientific workflows are common and powerful tools used to elevate small scale analysis to large scale distributed computation. They provide ease of use for domain scientists by supporting the use of applications as they are, partitioning the data for concurrency instead of the application. However, many of these workflows are written in a way that couples the scientific intention with the specificity of the execution environment. This coupling limits the flexibility and portability of the workflow, requiring the workflow to be re-engineered for each new dataset or site.

I propose that workflows can be written for pure scientific intent, with the idiosyncrasies of execution resolved at runtime using workflow abstractions. These abstractions would allow workflows to be quickly transformed for different configurations, specifically handling new datasets, diverse sites, and different configurations. I examine three methods for developing workflow abstraction on static workflows, apply these methods to a dynamic workflow, and propose an approach that separates the user from the distributed environment.

In developing these methods for static workflows I first explored Dynamic Workflow Expansion, which allows workflows to be quickly adapted for new and diverse datasets. Then I describe an algorithm for statically determining a workflow's storage needs, which is used at runtime to prevent storage deadlocks. Finally, I develop an algebra for transforming workflows, which isolates site and configuration specific designs to be applied to workflows as needed. These methods were combined and applied to a dynamic workflow, adapting a site bounds MPI application to a dynamic cloud workflow.

I combine these methods and formulated the Continuously Divisible Jobs abstraction to separate the domain scientist's application from the distributed logic of a dynamic workflow. This abstraction defines an API which applications can implement to allow for dynamic distributed computation, showcasing the flexibility and portability provided through workflow abstractions.

History

Date Modified

2020-01-23

Defense Date

2019-11-07

CIP Code

  • 40.0501

Research Director(s)

Douglas L. Thain

Committee Members

Jarek Nabrzyski Aaron Striegel Scott Emrich Nirav Merchant

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1137159464

Library Record

5417326

OCLC Number

1137159464

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC