Predicting and Preventing Inconsistencies in Deployed Distributed Systems
Distributed systems form the foundation of our society’s infrastructure. Unfortunately, it is notoriously difficult to develop reliable high-performance distributed systems that run over asynchronous networks. Even if a distributed system is based on a well-understood distributed algorithm, its implementation can contain coding bugs and errors arising from complexities of realistic distributed environments. Many of these errors can only manifest after the system has been running for a long time, has developed a complex topology, and has experienced a particular sequence of low-probability events such as node resets. Consequently, it is difficult to detect such errors using testing and model checking, and many of such errors remain unfixed after the system is deployed.

Execution path coverage by a) classic model checking, b) replay-based or live predicate checking, c) CrystalBall in deep online debugging mode, and d) CrystalBall in execution steering mode.
We propose CrystalBall, a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation.
Using CrystalBall we identified new bugs in mature Mace implementations of a random overlay tree and the Chord distributed hash table implementation. Furthermore, we show that if the bug is not corrected during system development, CrystalBall is effective in steering the execution away from inconsistent states at run-time.
Publications/presentations
- Finding Almost-Invariants in Distributed Systems, Maysam Yabandeh, Abhishek Anand, Marco Canini, and Dejan Kostic, Proceedings of the 30th IEEE Symposium on Reliable Distributed Systems (SRDS), October 2011.
- Model Checking of Distributed Algorithm Implementations, Maysam Yabandeh, Ph.D. Thesis, 2011.
- Predicting and Preventing Inconsistencies in Deployed Distributed Systems, Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak, ACM Transactions on Computer Systems (TOCS), March 2010.
- Almost-Invariants: From Bugs in Distributed Systems to Invariants, Maysam Yabandeh, Abhishek Anand, Marco Canini, and Dejan Kostic, EPFL Technical report, October 2009.
- Towards a Cost-Effective Networking Testbed, Nikola Knezevic, Simon Schubert, and Dejan Kostic, Proceedings of the 4th Workshop on Real Overlays and Distributed Systems (ROADS), October 2009, (Also appears in SIGOPS Operating Systems Review, Volume 43, Number 4, pages 66-71, December 2009).
- DPOR-DS: Dynamic Partial Order Reduction in Distributed Systems, Maysam Yabandeh, and Dejan Kostic, Technical Report, 2009.
- CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems, Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak, Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’09), April 2009.
- CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems, Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak, EPFL Technical report, May 2008.
- Online Debugging and Future Prediction for Reliable Large-Scale Distributed Systems, Dejan Kostic and Viktor Kuncak, NanoTera planning workshop, May 24-25, 2007.
Faculty
Research staff
Student(s)
Acknowledgments
This project builds upon the Mace distributed systems framework (and the MaceMC model checker). Doing so tremendously reduced the time to realize the contributions of this project.
This research is sponsored by the Swiss National Science Foundation (grant FNS 200021-125140), as well as the Hasler Foundation (grant 2103).