Log Discovery, Log Custody, and the Web Inspired Approach for Open Distributed Systems Troubleshooting
Troubleshooting distributed systems is difficult due to the inherent complexities of runtime environments in computing resources utilized by the system. This is exacerbated in open distributed systems where membership of resources within the system is transient, placement of computations to resources is not known before runtime, and resources can be shared across multiple administrative jurisdictions (i.e. multiple independent clusters, clouds, or grids). It is infeasible to expect a user to know about these complexities. To make troubleshooting open distributed systems approachable by a user, the debug output of each individual component of the system (i.e. individual processes and services) must be discoverable and made queryable. However, contemporary approaches cannot provide these capabilities. Instead, they are provided using a novel architecture called TLQ (Troubleshooting via Log Query) which facilitates log discovery and custody for the user and allows them to directly query their system's debug output. In addition, it links components together when possible, inspired by the architecture of the World Wide Web. Through the lens of TLQ, this work presents a data model for debug output, a comparison of multiple querying approaches, a distributed querying architecture for open distributed systems troubleshooting, and considerations for more effective debug log design.
History
Date Modified
2021-05-20Defense Date
2021-04-01CIP Code
- 40.0501
Research Director(s)
Douglas L. ThainCommittee Members
Peter Kogge Paul Brenner Ronald MetoyerDegree
- Doctor of Philosophy
Degree Level
- Doctoral Dissertation
Alternate Identifier
1251516180Library Record
6022954OCLC Number
1251516180Additional Groups
- Computer Science and Engineering
Program Name
- Computer Science and Engineering