François LeGland

MAGDA - Alarm supervision in Telecommunication Networks

Eric Fabre, Albert Benveniste, Claude Jard, Laurie Ricker, Mark Smith

MAGDA (Modélisation et Apprentissage pour une Gestion Distribuéee des Alarmes) is a research project supported by the French National Research Network in Telecommunications programme RNRT (Réseau National de la Recherche en Télécommunications). The challenge for the MAGDA project is twofold: 1/ adapting and mixing different formalisms towards the general objective of fault diagnosis in telecommunication networks, using alarm correlation, and 2/ building an experimental platform to validate this approach. The MAGDA project is a collaboration involving two academic research centres (Irisa/Inria, Rennes, and Université de Paris-Nord), and three industrial companies (France-Telecom/CNET, Alcatel/CRC, and Ilog).

Today, there are no widely recognized standards for tools that could help alarm correlation. Thus it is worth mentioning some key points of this research that could be regarded as useful achievements for themselves:

Validation of black-box methods. Alarm logs are easily available in practice, but tools are lacking to exploit them properly. It is important to check the relevance of data mining techniques to discover structures in these logs, in particular, frequent chronicles of events.
Modelling methodology.It is quite easy to build a model of alarm production and fault propagation for toy examples, but the construction of a model for a real network is quite challenging, and requires some structuring and methodology. In particular, it must be object-oriented and based on generic elements that are instantiated and interconnected. This model will be described in UML. The direct use of this model (for simulation and fault diagnosis) is also an objective of MAGDA.
Model-based diagnosis. The effectiveness of model based approaches for alarm correlation will be evaluated and compared to (possibly associated with) black-box models. Key points to check are the relevance of a formalism for concurrency of events, the relevance of a stochastic model, the influence of an incomplete knowledge of the model and the possibility to cope with a changing model (reconfiguration).
Distributed diagnosis. As very large systems must be modelled, a centralised diagnosis algorithm handling the global state of the system is unaffordable. Therefore, a distributed diagnosis algorithm will be developed, composed of asynchronous local algorithms that supervise part of the network evolution and synchronize their results. This approach is well-suited to highly concurrent systems.
Real case experiments will be carried out on an experimental platform composed of a real network manager connected to a simulator of a simple network. Fault scenarios will be elaborated from alarm logs collected on a true network.

All the above mentioned points are very likely to lead to innovations in alarm correlation and network monitoring. We present focus now on the part of the project devoted to distributed diagnosis algorithms.

Modelling

We regard a telecommunication network as a network of asynchronously interacting finite-state machines. Such systems are subject to spontaneous faults, occurrences of which may trigger alarms. Also, network elements get services from other elements and, in turn, provide services to several alternative network elements. This causes
both fault and alarm propagation throughout the network. As a first idea, we have proposed modelling such a situation as follows:

The mechanism of a spontaneous fault occurrence and alarm propagation for a given element is modelled by a Petri Net of capacity one (PN). Places of the PN capture features of interest concerning the state of the network element, e.g., nominal, faulty, disabled. Having a token in a given place indicates that the associated feature is active.
Observations are attached to transitions. Each time a transition fires, it produces an alarm message that is eventually received by the supervisor of the network (losses of messages are possible). The location of tokens and their moves are not observable.
The interaction of network elements is modelled by common places between the PN models of each element. We consider a true-concurrency semantics for the PN: transition firings are only partially ordered in time, according to the causality structure induced by the underlying PN directed graph structure.

Mathematical framework

Alarm interpretation is then regarded as the problem of inferring, from the observation of alarms, the hidden state history of the PN. As some of the events (in particular, spontaneous faults) are typically random in nature, we consider some kind of probabilistic form for our PN. To prepare for distributed diagnostics, our requirement was that stochastic independence should match concurrency, implying that if two transitions of the PN are concurrent, all interleavings of their reception by the supervisor should have equal likelihood.

We have proposed a new class of stochastic PN that satisfy the above property: Partially Stochastic Petri Nets (PSPN). The well known Hidden Markov Models (HMM) theory has been extended to them. HMM are stochastic automata for which is it desired to infer the most likely hidden state (or transition) sequence from an observed sequence of transition signatures. This machinery is very popular in pattern recognition and in speech
recognition. The basic algorithm is the so-called Viterbi algorithm, which computes on-line the most likely state history. In our PSPN framework, transitions are associated with "tiles" that describe local state changes. The Viterbi algorithm then reconstructs hidden trajectories by concatenating tiles that match the observations. Therefore it was renamed the "Viterbi puzzle".

Further issues for current research

The following topics require further development for MAGDA and are currently under investigation:

Develop a theory of distributed, asynchronous deployment of the Viterbi puzzle, meaning that the puzzle is now played in co-operation by several players, where each has its own partial set of tiles.
Extend this theory to dynamically-evolving models of PN. This is necessary since some network elements are only logical, not physical, and thus are subject to reconfiguration. Our aim is to directly generalise Viterbi puzzles to puzzles with a dynamically-evolving set of tiles.
Further extend the theory to accommodate situations where the model may be incomplete. That is, it is not guaranteed that the observed alarm sequence can be produced by the hypothesised model.
Modelling methodology: the process of deriving the PN model from available information is itself a challenge, as this must be automated as part of the deployment and configuration of the fault management system.
Deployment of the algorithm: as the MAGDA deployment architecture will rely on OMG-CORBA model, our first objective is to use this model as a target model for the deployment of our abstract distributed Viterbi puzzle.