François LeGland

Diagnosis of distributed and asynchronous Discrete Event Systems

Application to fault management in telecommunications networks

Armen Aghasaryan (Alcatel), Albert Benveniste, Eric Fabre, Stefan Haar, and Claude Jard (Irisa)

Typical examples of distributed and asynchronous Discrete Event Systems are networked systems, such as telecommunications networks. In such systems, the sensor system is distributed, it involves several local sensors, attached to some nodes of the network. Each sensor has only a partial view of the overall system. The different sensors possess their own local time, but they are not synchronized. Alarms are reported to the global supervisor asynchronously, and this supervisor performs diagnosis. This is the typical architecture in telecommunications network management systems today, our motivating application (see the Magda project). Events may be correctly ordered by each individual sensor, but communicating alarm events via the network causes a loss of synchronization, and results in a nondeterministic and possibly unbounded interleaving at the supervisor. Hence, the right picture, for what the supervisor collects, is not a sequence of alarms, but rather a partially ordered set of alarms.

In addition, for future telecommunications networks, services, and systems, supervision will not be centralized, but rather distributed. This means that each supervisor will have only a partial view of the overall system, observes only its local alarms, and has only to report local diagnosis. But, clearly, the different local diagnoses delivered by the different supevisors need to be coherent, i.e., they must be local projections of a unique overall diagnosis, this is what we mean by distributed diagnosis. Clearly, the distributed diagnosers need to cooperate at establishing distributed diagnosis. See [MagdaPaper2002] for a description of this application.

Eric Fabre et al. [FBJ2001] have proposed to regard distributed and asynchronous Discrete Event Systems as Bayesian networks of dynamical systems. Eric Fabre has further elaborated this approach in [Fabre_2002_a] and [Fabre_2002_b] . Such models can be randomized or alternatively be simply nondeterministic. Diagnosis is then formulated as the distributed state trajectory reconstruction, from distributed and asynchronous sensor observations. This approach to distributed diagnosis can be decomposed into several subproblems : 1/ an abstract algorithmic setting for distributed constraint solving [Fabre2003-2] , and 2/ a factorized and modular representation for trajectories of distributed systems, using so-called net unfolding techniques [Fabre2003-1] [Fabre2004] . Putting everything together to solve the distributed diagnosis problem yields papers [BFHJ2003-a] and [Concur2003].

In 2004, Eric Fabre has proposed an alternative representation for runs of distributed systems, by means of trellis processes instead of unfoldings. Trellis processes generalize to distributed and concurrent systems the usual notion of trellis, as it is commonly used in dynamic programming or Viterbi decoding, for automata. They enjoy the same nice algebraic properties as unfoldings. See slides on trellis nets for a preliminary presentation.

Since 2004, Stefan Haar is leading a new research to extend our diagnosis techniques to dynamically reconfigurable systems. Static models such as Petri nets are not suitable any more. Graph Grammars can be conveniently used instead. See [HaaBFJ2005]

See slides on modeling and slides on distributed diagnosis for a comprehensive presentation.

[HaaBFJ2005] A. Benveniste, S. Haar, E. Fabre, and C. Jard. ``Fault diagnosis for distributed asynchronous dynamically reconfigured discrete event systems''. In Proc. of the 16th IFAC World Congress, Prague. July 2005.

[Fabre2004] E. Fabre. Factorization of Unfoldings for Distributed Tile Systems Part 2: General Case. INRIA Res., Rep. 5186. May 2004.

[Concur2003] A. Benveniste, S. Haar, E. Fabre, and C. Jard. ``Distributed monitoring of concurrent and asynchronous systems''. Plenary address at CONCUR'2003, Proceedings of CONCUR'2003.
Informal presentation published in Proc. of IEEE Control and Decision Conference, Dec. 2003.
Final and improved version published in Discrete Event Dynamic Systems, 15(1), 33-84, Mar. 2005.
Extended version available as Irisa Research Report 1636, July 2004 and INRIA Research Report 4842 V2, July 2004.

[Fabre2003-2] E. Fabre. Convergence of Turbo Algorithms for Systems Defined by Local Constraints. INRIA Res., Rep. 4860. July 2003.

[Fabre2003-1] E. Fabre. Factorization of Unfoldings for Distributed Tile Systems Part 1 : Reduced Interaction Case. INRIA Res., Rep. 4829. May 2003.

[MagdaPaper2002] A. Aghasaryan, C. Dousson, E. Fabre, Y. Pencolé, A. Osmani. Modeling Fault Propagation in Telecommunications Networks for Diagnosis Purposes. XVIII World Telecommunications Congress, 22-27 September 2002 - Paris, France.

[FBJ2001] E. Fabre, A. Benveniste, C. Jard. ``Distributed Diagnosis for Large Discrete Event Dynamic Systems''.
In Proc of the IFAC world congress, July 2002.

[Fabre_2002_a] E. Fabre. Compositional models of distributed and asynchronous dynamical
systems. In Proc of the 2002 IEEE Conf. on Decision and Control, 1--6, Dec. 2002, Las Vegas, 2002.

[Fabre_2002_b] E. Fabre. Monitoring distributed systems with distributed algorithms. In Proc of the 2002 IEEE Conf. on Decision and Control, 411--416, Dec. 2002, Las Vegas, 2002.

[BFHJ2003-a] A. Benveniste, E. Fabre, C. Jard, and S. Haar. ``Diagnosis of asynchronous discrete event systems, a net unfolding approach''. IEEE Trans. on Automatic Control, 48(5), 714-727, May 2003.