High-throughput computing (HTC) is about using a large amount of computing resources over a long time to accomplish many independent and parallel computational tasks. HTC workloads are often described in the form of workflow and run on distributed systems through workflow systems. However, as most workflow systems are not liable for managing the task execution environment, HTC workflows are regularly limited in dedicated HTC facilities that have required settings.
Lately, container runtimes have been widely deployed across public cloud because of its ability to deliver execution environment with lower overheads than the virtual machine. This trend provides users of HTC workflows an opportunity to use unlimited computing power on the cloud. However, migrating complex workflow systems to a container environment is cumbersome.
To containerize HTC workflows and scale them up on the cloud, I synthesize my experiences on using container technologies and develop a methodology that contains seven design factors: i) Isolation Granularity – the granularity of isolation should be determined by characteristics for target workloads; ii) Container Management – container runtimes must be adapted to the distributed environment, and the under-layer distributed systems best does the management of containers; iii) Im- age Management – a cooperated mechanism can help to speed up and improve the efficiency of image distribution in distributed environment; iv) Garbage Collection – timely garbage collection is necessary given the massive amount of intermediate data generated by the HTC workflow; v) Network Connection – excessive network connections should be avoided considering the plenty of small transmissions; vi) Resource Management – customized resource management mechanisms that fully consider the characteristics of the target workflow are required; vii) Cross-layer Cooperation – implementation of advanced features requires cooperation between the upper-layer workflow system and the under-layer cluster manager.
In addition to HTC workflows, I validate the above factors through my work of standardizing resource provisioning process for extreme scale online workloads, and observe that they are equally applicable to the HTC workflow as well as the extreme scale online workload.