Fault detection and diagnosis in distributed systems : an approach by Partially Stochastic Petri Nets


Eric Fabre, Armen Aghasaryan, Albert Benveniste
Renée Boubour, Claude Jard

We address the problem of alarm correlation and diagnosis in large, distributed systems.

Our approach consists in extending Hidden Markov Model (HMM) techniques to distributed systems. In HMM theory, one considers a stochastic automaton whose transition are labelled by random observations, whereas the state of the automaton is hidden. The idea is that the hidden state encodes the safe and different faulty states, whereas the random observations correspond to noisy alarms. Inferring the faults from alarm observations is performed by searching for the most likely hidden state sequence, for a given alarm sequence. Corresponding algorithm is the so-called Viterbi algorithm, which implements dynamic programming for computing the maximum likelihood sequence of hidden states.

To extend this approach to distributed systems, we must 1/ relax the need for a global state, 2/ relax the need for a global time, and 3/ take advantage of possible concurrency in the distributed system. We propose Partially Stochastic Petri Nets (PSPN) as a new class of PN with a partial order semantics, in which concurrent parts of the PN behave as independent stochastic systems, a feature unique to PSPN. We extend HMM theory to PSPN and propose the Vitewrbi puzzle as a new paradigm for alarm correlation and diagnosis.

 gzipped postscript