3+ Years of Failure Traces from Rice University Clusters

A brief discussion of the traces is available in our paper mentioned below.
If you decide you want to cite our work, it is best to cite the paper below (bibtex here) and not the link to this webpage.

Florin Dinu, T. S. Eugene Ng
RCMP: Enabling Efficient Re-computation Based Failure Resilience for Big Data Analytics
in the 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2014) Phoenix, AZ, USA, May 2014

stic-node-failures.dat
File Size:	2304 kb
File Type:	dat

Download File

sugar-node-failures.dat
File Size:	2555 kb
File Type:	dat

Download File

Some details about the traces:
- The failure traces are from two different clusters: STIC and SUG@R. At the time when we received the failure traces (Sept 2012) we counted 218 nodes in STIC and 121 nodes in SUG@R. The current size of the clusters may differ.
- The traces for STIC span the period between Sept 2009 and Sept 2012.
- The traces for SUG@R span the period between Jan 2009 and Sept 2012.
- The traces are based on daily, automated checks of node unavailability. In other words an automated script checks once a day which nodes are down.
- Nodes that crash and are restarted between two unavailability checks are not captured in the traces. Those are likely to be crashes caused by bad jobs, not bad nodes. Bad nodes tend to require repair and are offline for longer than a day.
- Some (not all) of the failure events are accompanied by a description of the underlying cause.
- A few days are missing. This may represent the system being down or reporting disabled for any number of reasons.
- Every node unavailability entry is accompanied by the state of the node. "Offline" means that the IT staff manually took the node offline. "Down" means that an automated component (scheduler) took the node out of service. "Job-exclusive" means that the node went down while a job was running on it. Whether the job caused the problem or not is not indicated.

For contact information go here.