posted on 2010-06-18, 00:00authored byChristopher M. Moretti
Scientific computing users often find it difficult to transform serial domain applications into workloads for large non-dedicated heterogeneous campus grids. Due to hardware and software bottlenecks, a workload that succeeds on 8 nodes can fail disastrously on 128; or even fail on 8 nodes for a different instance of the same problem. An abstraction is a flexible solution to a pattern of computation that can be used to harness distributed computing resources more easily for non-experts. The users provide the pieces, such as their datasets and serial function, and the workload is constructed and executed for them in an appropriate manner for the environment in order to prevent disastrous configurations and satisfy cost, policy, and performance constraints. This work presents the design, implementation, and evaluation of a 'toolbox' of abstractions: All-Pairs, Sparse-Pairs, and Data-Split-Join. These abstractions are used for several problems in bioinformatics, biometrics, and data mining. The discussion of the abstractions includes modeling of the problem, managing input data, organizing computation on the campus grid, and managing output data. Results include the largest known biometrics All-Pairs result of its kind, in which over two years' worth of computation was executed in 10 days, and a complete alignment of the Human genome using Sparse-Pairs, which completed in 2.5 hours on over 1000 hosts with 952x speedup.