Rx: Treating Bugs as Allergies---a Safe
Method to Survive Software Failures
By Feng Qin, Joseph Tucek, Jagadeesan Sundaresan and Yuanyuan Zhou
Published in Proceedings of the ACM Symposium on Operating System
Principles (SOSP'05), Oct. 2005 (13% acceptance rate, 20/155). [PDF]
Awarded Paper. Also to appear in ACM Transactions
on Computer Systems Special Issue on Best Papers from
SOSP-2005.
A preliminary version was published in Proceedings of Usenix Workshop on
Hot Topics on Operating Systems (HotOS'05), 2005. [PDF]
Abstract:
Many applications demand availability. Unfortunately, software failures
greatly reduce system availability. Prior work on surviving software failures
suffers from one or more of the following limitations: Required application
restructuring, inability to address deterministic software bugs, unsafe
speculation on program execution, and long recovery time.
This paper proposes an innovative safe technique, called Rx, which can
quickly recover programs from many types of software bugs, both deterministic
and non-deterministic. Our idea, inspired from allergy treatment in real life,
is to rollback the program to a recent checkpoint upon a software failure, and
then to re-execute the program in a modified environment. We base this
idea on the observation that many bugs are correlated with the execution
environment, and therefore can be avoided by removing the "allergen" from the
environment. Rx requires few to no modifications to applications and provides
programmers with additional feedback for bug
diagnosis.
We have implemented Rx on Linux. Our experiments with four server applications
that contain six bugs of various types show that Rx can survive all the six
software failures and provide transparent fast recovery within 0.017-0.16
seconds, 21-53 times faster than the whole program restart approach for all but
one case (CVS). In contrast, the two tested alternatives, a whole program
restart approach and a simple rollback and re-execution without environmental
changes, cannot successfully recover the three servers (Squid, Apache, and CVS)
that contain deterministic bugs, and have only a 40% recovery rate for the
server (MySQL) that contains a non-deterministic concurrency bug. Additionally,
Rx's checkpointing system is lightweight, imposing small time and space
overheads.