Fault tolerance in distributed systems

A distributed system is a collection of computers (called nodes) that communicate with each other through a communication medium. Under the control of systems software, the nodes can co-operatively carry out a task. One of the challenging issues in a distributed system is how to guarantee continuous computation without compromising the correctness in the presence of failures. For example, in order to enhance availability, objects (data or processes) can be replicated at different nodes; however, the use of object replication may compromise consistency among copies of replicated objects when failures occur. This necessitates careful design of replica control mechanisms. As another example, to maintain consistency, the failure of one machine may require aborting related processes on other functioning machines. Different abortion schemes may cause a different number of processes to be aborted. A better protocol should allow more application processes to proceed successfully while incurring less overhead. Our research interests in this area include the development of efficient fault-tolerance protocols, techniques for evaluating such protocols, and exploration of theoretical foundations of fault-tolerance in distributed systems.