Monitoring or diagnosis of large scale distributed Discrete Event
Systems with asynchronous communication is a demanding task. Ensuring
that the methods developed for Discrete Event Systems properly scale up
to such systems is a challenge. In this paper we explain why the use of
partial orders cannot be avoided in order to achieve this objective. To
support this claim, we try to push classical techniques (parallel
composition of automata and languages) to their limits. We focus on
on-line techniques, where a key difficulty is the choice of proper data
structures to represent the set of all runs of a distributed system. We
discuss the use of previously known structures such as execution trees
and unfoldings. We propose an alternative and more compact data
structure called trellis. We
study the apparatus needed to extend the use of these data structures
to represent distributed executions. And we show how such data
structures can be used in performing distributed monitoring and
diagnosis.
The techniques reported here were used in an industrial context for
fault management and alarm correlation in telecommunications networks.
This report is an extended version of the plenary address that was
given by the second author at WODES'2006.
This work is partially supported by RNRT (National Research Network
in Telecommunication) through the SWAN
project (Self aWare mANagement).