CrystalBall

Predicting and Preventing Inconsistencies in Deployed Distributed Systems

Distributed systems form the foundation of our society’s infrastructure. Unfortunately, it is notoriously difficult to develop reliable high-performance distributed systems that run over asynchronous networks. Even if a distributed system is based on a well-understood distributed algorithm, its implementation can contain coding bugs and errors arising from complexities of realistic distributed environments. Many of these errors can only manifest after the system has been running for a long time, has developed a complex topology, and has experienced a particular sequence of low-probability events such as node resets. Consequently, it is difficult to detect such errors using testing and model checking, and many of such errors remain unfixed after the system is deployed.

Execution path coverage by a) classic model checking, b) replay-based or live predicate checking, c) CrystalBall in deep online debugging mode, and d) CrystalBall in execution steering mode.

We propose CrystalBall, a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation.

Using CrystalBall we identified new bugs in mature Mace implementations of a random overlay tree and the Chord distributed hash table implementation. Furthermore, we show that if the bug is not corrected during system development, CrystalBall is effective in steering the execution away from inconsistent states at run-time.

Publications/presentations

Faculty

Research staff

Student(s)

Acknowledgments

This project builds upon the Mace distributed systems framework (and the MaceMC model checker). Doing so tremendously reduced the time to realize the contributions of this project.

This research is sponsored by the Swiss National Science Foundation (grant FNS 200021-125140), as well as the Hasler Foundation (grant 2103).