Methods Enabling Portability of Scientific Workflows

Doctoral Dissertation

Abstract

Scientific workflows are common and powerful tools used to elevate small scale analysis to large scale distributed computation. They provide ease of use for domain scientists by supporting the use of applications as they are, partitioning the data for concurrency instead of the application. However, many of these workflows are written in a way that couples the scientific intention with the specificity of the execution environment. This coupling limits the flexibility and portability of the workflow, requiring the workflow to be re-engineered for each new dataset or site.

I propose that workflows can be written for pure scientific intent, with the idiosyncrasies of execution resolved at runtime using workflow abstractions. These abstractions would allow workflows to be quickly transformed for different configurations, specifically handling new datasets, diverse sites, and different configurations. I examine three methods for developing workflow abstraction on static workflows, apply these methods to a dynamic workflow, and propose an approach that separates the user from the distributed environment.

In developing these methods for static workflows I first explored Dynamic Workflow Expansion, which allows workflows to be quickly adapted for new and diverse datasets. Then I describe an algorithm for statically determining a workflow’s storage needs, which is used at runtime to prevent storage deadlocks. Finally, I develop an algebra for transforming workflows, which isolates site and configuration specific designs to be applied to workflows as needed. These methods were combined and applied to a dynamic workflow, adapting a site bounds MPI application to a dynamic cloud workflow.

I combine these methods and formulated the Continuously Divisible Jobs abstraction to separate the domain scientist’s application from the distributed logic of a dynamic workflow. This abstraction defines an API which applications can implement to allow for dynamic distributed computation, showcasing the flexibility and portability provided through workflow abstractions.

Attributes

Attribute NameValues
Author Nicholas Hazekamp
Contributor Jarek Nabrzyski, Committee Member
Contributor Aaron Striegel, Committee Member
Contributor Douglas L. Thain, Research Director
Contributor Scott Emrich, Committee Member
Contributor Nirav Merchant, Committee Member
Degree Level Doctoral Dissertation
Degree Discipline Computer Science and Engineering
Degree Name Doctor of Philosophy
Banner Code
  • PHD-CSE

Defense Date
  • 2019-11-07

Submission Date 2019-12-02
Record Visibility Public
Content License
Departments and Units
Catalog Record

Files

Please Note: You may encounter a delay before a download begins. Large or infrequently accessed files can take several minutes to retrieve from our archival storage system.