Epidemic failure detection and consensus for extreme parallelism

Download

Preview

Text
- Accepted Version

Advice

Please see our End User Agreement.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Tools

Lists

Katti, A., Di Fatta, G., Naughton, T. and Engelmann, C. (2018) Epidemic failure detection and consensus for extreme parallelism. International Journal of High Performance Computing Applications, 32 (5). pp. 729-743. ISSN 1094-3420 doi: 10.1177/1094342017690910

Abstract/Summary

Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum’s User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI Comm shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. Stochastic pinging is used to quickly detect failures during the execution of the algorithm, failures are then disseminated to all the fault-free processes in the system and consensus on the failures is detected using the three consensus techniques. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that the stochastic pinging detects all the failures in the system. In all the algorithms, the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time being memory and bandwidth efficient.

Altmetric Badge

Item Type	Article
URI	https://reading-clone.eprints-hosting.org/id/eprint/71175
Identification Number/DOI	10.1177/1094342017690910
Refereed	Yes
Divisions	Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
Publisher	Sage
Download/View statistics	View download statistics for this item

Download Statistics

Downloads

Downloads per month over past year

Deposit Details

Date Deposited:	17 Jul 2017 15:36	Date item deposited into CentAUR
Last Modified:	23 Jun 2024 02:51	Date item last modified

University Staff: Request a correction | Centaur Editors: Update this record

Search Google Scholar