Dissertation Announcement for Rui Jia
03/07/16 at 3:00 PM

February 29, 2016

Dear Faculty, Graduate and Undergraduate Students,

You are cordially invited to my dissertation oral defense.

Dissertation Title: Towards Model-based Fault Management for Computing Systems

When: Monday, March 7th, 2016, 03:00 PM

Where: Simrall 104

Candidate: Rui Jia

Degree: Doctor of Philosophy, Electrical and Computer Engineering
Committee:

Dr. Sherif Abdelwahed
Associate Professor of Electrical and Computer Engineering (Major Professor )

Dr. Derek Anderson
Assistant Professor of Electrical and Computer Engineering (Committee Member)

Dr. David A. Dampier
Professor of Computer Science and Engineering (Committee Member)

Dr. Bryan A. Jones
Associate Professor of Electrical and Computer Engineering (Committee Member)

 

Abstract:

Large scale distributed computing systems have been extensively utilized to host critical applications in the fields of national defense, finance, scientific research, commerce, etc. However, applications in distributed systems face the risk of service outages due to inevitable faults. Without proper fault management methods, faults can lead to significant revenue loss and degradation of Quality of Service. An ideal fault management solution should guarantee fast and accurate fault diagnosis, scalability in distributed systems, portability for a variety of systems, and the versatility of recovering different types of faults.

 

This dissertation presents a model-based fault management structure which automatically recovers computing systems from faults. This structure can recover a system from common faults while minimizing the impact on the system’s Quality of Service (QoS). It covers all stages of fault management including fault detection, identification and recovery. It also has the flexibility to incorporate various fault detection and diagnosis methods. When faults occur, the approach identifies fault types and intensity and accordingly compute the optimal recovery plan with minimum performance degradation, based on a user-defined utility cost function that defines performance objectives and a predictive control algorithm. The fault management approach has been verified on a centralized Web application testbed and a distributed big data processing testbed with four types of simulated faults: memory leak, network congestion, CPU hog and disk failure. This dissertation will also verify the feasibility of the fault recovery control algorithm. Simulation results show that our approach enabled effective automatic recovery from these faults. Performance evaluation reveals that CPU and memory overhead of the fault management process is negligible.

 

To allow domain engineers to conveniently apply the proposed fault management structure on their specific systems, a component-based modeling environment is developed. The meta-model of the fault management structure is developed with Unified Modeling Language as an abstract of a general fault management solution for computing systems. It defines the fundamental reusable components that comprise such a system, including the connections among them, attributes of each component and constraints. The meta-model can be interpreted into a user-friendly graphic modeling environment for creating application models of practical domain specific systems and generating codes that can be executed on them.

 

Best regards,
Rui