MAGDA - Alarm supervision in Telecommunication Networks
Eric Fabre, Albert Benveniste, Claude Jard, Laurie Ricker, Mark
Smith
MAGDA (Modélisation
et Apprentissage pour une Gestion Distribuéee des Alarmes) is
a research project supported by the French National Research Network in
Telecommunications programme RNRT (Réseau National de la Recherche
en Télécommunications). The challenge for the MAGDA project
is twofold: 1/ adapting and mixing different formalisms towards the general
objective of fault diagnosis in telecommunication networks, using alarm
correlation, and 2/ building an experimental platform to validate this
approach. The MAGDA project is a collaboration involving two academic research
centres (Irisa/Inria, Rennes, and Université de Paris-Nord), and
three industrial companies (France-Telecom/CNET, Alcatel/CRC, and Ilog).
Today, there are no widely recognized
standards for tools that could help alarm correlation. Thus it is worth
mentioning some key points of this research that could be regarded as useful
achievements for themselves:
-
Validation
of black-box methods. Alarm logs are easily available in practice,
but tools are lacking to exploit them properly. It is important to check
the relevance of data mining techniques to discover structures in these
logs, in particular, frequent chronicles of events.
-
Modelling methodology.It
is quite easy to build a model of alarm production and fault propagation
for toy examples, but the construction of a model for a real network is
quite challenging, and requires some structuring and methodology. In particular,
it must be object-oriented and based on generic elements that are instantiated
and interconnected. This model will be described in UML. The direct use
of this model (for simulation and fault diagnosis) is also an objective
of MAGDA.
-
Model-based
diagnosis. The effectiveness of model based approaches for alarm
correlation will be evaluated and compared to (possibly associated with)
black-box models. Key points to check are the relevance of a formalism
for concurrency of events, the relevance of a stochastic model, the influence
of an incomplete knowledge of the model and the possibility to cope with
a changing model (reconfiguration).
-
Distributed
diagnosis. As very
large systems must be modelled, a centralised diagnosis algorithm handling
the global state of the system is unaffordable. Therefore, a distributed
diagnosis algorithm will be developed, composed of asynchronous local algorithms
that supervise part of the network evolution and synchronize their results.
This approach is well-suited to highly concurrent systems.
-
Real case experiments will be carried
out on an experimental platform composed of a real network manager connected
to a simulator of a simple network. Fault scenarios will be elaborated
from alarm logs collected on a true network.
All the above mentioned points
are very likely to lead to innovations in alarm correlation and network
monitoring. We present focus now on the part of the project devoted to
distributed diagnosis algorithms.
Modelling
We regard a telecommunication network
as a network of asynchronously interacting finite-state machines. Such
systems are subject to spontaneous faults, occurrences of which may trigger
alarms. Also, network elements get services from other elements and, in
turn, provide services to several alternative network elements. This causes
both fault and alarm propagation
throughout the network. As a first idea, we have proposed modelling such
a situation as follows:
-
The mechanism of a spontaneous fault
occurrence and alarm propagation for a given element is modelled by a Petri
Net of capacity one (PN). Places of the PN capture features of interest
concerning the state of the network element, e.g., nominal, faulty, disabled.
Having a token in a given place indicates that the associated feature is
active.
-
Observations are attached to transitions.
Each time a transition fires, it produces an alarm message that is eventually
received by the supervisor of the network (losses of messages are possible).
The location of tokens and their moves are not observable.
-
The interaction of network elements
is modelled by common places between the PN models of each element. We
consider a true-concurrency semantics for the PN: transition firings are
only partially ordered in time, according to the causality structure induced
by the underlying PN directed graph structure.
Mathematical framework
Alarm interpretation is then regarded
as the problem of inferring, from the observation of alarms, the hidden
state history of the PN. As some of the events (in particular, spontaneous
faults) are typically random in nature, we consider some kind of probabilistic
form for our PN. To prepare for distributed diagnostics, our requirement
was that stochastic independence should match concurrency,
implying that if two transitions of the PN are concurrent, all interleavings
of their reception by the supervisor should have equal likelihood.
We have proposed a new class of stochastic
PN that satisfy the above property: Partially Stochastic Petri Nets (PSPN).
The well known Hidden Markov Models (HMM) theory has been extended to them.
HMM are stochastic automata for which is it desired to infer the most likely
hidden state (or transition) sequence from an observed sequence of transition
signatures. This machinery is very popular in pattern recognition and in
speech
recognition. The basic algorithm
is the so-called Viterbi algorithm, which computes on-line the most likely
state history. In our PSPN framework, transitions are associated with "tiles"
that describe local state changes. The Viterbi algorithm then reconstructs
hidden trajectories by concatenating tiles that match the observations.
Therefore it was renamed the "Viterbi puzzle".
Further issues for current research
The following topics require further
development for MAGDA and are currently under investigation:
-
Develop a theory of distributed,
asynchronous deployment of the Viterbi puzzle, meaning that the puzzle
is now played in co-operation by several players, where each has its own
partial set of tiles.
-
Extend this theory to dynamically-evolving
models of PN. This is necessary since some network elements are only logical,
not physical, and thus are subject to reconfiguration. Our aim is to directly
generalise Viterbi puzzles to puzzles with a dynamically-evolving set of
tiles.
-
Further extend the theory to accommodate
situations where the model may be incomplete. That is, it is not
guaranteed that the observed alarm sequence can be produced by the hypothesised
model.
-
Modelling methodology: the process
of deriving the PN model from available information is itself a challenge,
as this must be automated as part of the deployment and configuration of
the fault management system.
-
Deployment of the algorithm:
as the MAGDA deployment architecture will rely on OMG-CORBA model, our
first objective is to use this model as a target model for the deployment
of our abstract distributed Viterbi puzzle.