A Compiler Toolchain for Distributed Data Intensive Scientific Workflows

Bui, Peter James

doi:10.7274/pk02c823v2f

BuiP062012D.pdf (1.54 MB)

A Compiler Toolchain for Distributed Data Intensive Scientific Workflows

thesis

posted on 2012-06-24, 00:00 authored by Peter James Bui

With the growing amount of computational resources available to researchers today and the explosion of scientific data in modern research, it is imperative that scientists be able to construct data processing applications that harness these vast computing systems. To address this need, I propose applying concepts from traditional compilers, linkers, and profilers to the construction of distributed workflows and evaluate this approach by implementing a compiler toolchain that allows users to compose scientific workflows in a high-level programming language.

In this dissertation, I describe the execution and programming model of this compiler toolchain. Next, I examine four compiler optimizations and evaluate their effectiveness at improving the performance of various distributed workflows. Afterwards, I present a set of linking utilities for packaging workflows and a group of profiling tools for analyzing and debugging workflows. Finally, I discuss modifications made to the run-time system to support features such as enhanced provenance information and garbage collection. Altogether, these components form a compiler toolchain that demonstrates the effectiveness of applying traditional compiler techniques to the challenges of constructing distributed data intensive scientific workflows.

History

Date Modified

2017-06-05

Defense Date

2012-06-07

Research Director(s)

Douglas Thain

Committee Members

Patrick Flynn Scott Emrich Jesus Izaguirre

Degree

Doctor of Philosophy

Degree Level

Doctoral Dissertation

Language

English

Alternate Identifier

etd-06242012-095705

Publisher

University of Notre Dame

Program Name

Computer Science and Engineering

Usage metrics

Keywords

compiler distributed systems workflows python

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

A Compiler Toolchain for Distributed Data Intensive Scientific Workflows

History

Date Modified

Defense Date

Research Director(s)

Committee Members

Degree

Degree Level

Language

Alternate Identifier

Publisher

Program Name

Usage metrics

Categories

Keywords

Licence

Exports