Self-management and autonomic computing: distributed algorithms

Armen Aghasaryan (Alcatel-Lucent), Albert Benveniste, Eric Fabre, Stefan Haar, and Claude Jard (Irisa)

Telecommunications networks and services become more and more complex and heterogeneous. Mobility and ubiquity require the transparent integration of diverse technologies and architectures. Management is therefore evolving from a heavily manual and centralized activity to an automatic one, with highest added value for operators and equipment suppliers. Cross-domain, and even cross-operator, management with end-to-end guarantees requires a distributed and adaptive approach. It is envisioned that distributed management will play a strong role in managing complex networks where the distribution of the various aspects of management such as monitoring, configuration, provisioning, billing, fault, and performance becomes imperative for greater reliability, scalability and efficiency. The emergence of Web services extends the traditional scope of network and service management. The generic term of
self-management encompasses all aspects of flexible distributed management.

The following four aspects of self-management have been identified, in which the need for and the purpose of
autonomic computing become apparent:

While significant effort has been devoted to architectures supporting self-management, little has been done in developing distributed algorithms for the different functions: this is the purpose of our research.

Model-based approach

Our approach is illustrated in the following figure.

 This figure shows the SDH/SONET optical ring in operation in the Paris area (the locations indicated are suburbs of Paris). A few ports and links are shown.  The bottom-left diagram is a detailed view of the Montrouge node. The nested light to mid gray rectangles represent the different layers in the SDH hierarchy, with the largest one being the physical layer. The different boxes are the managed objects (MO), and the links across the different layers are the paths for upward/downward fault propagation. Each MO can be seen as an automaton reacting to input events/messages, changing its state, and emitting events and alarms to its neighbors, both co-located and distant.

To give some figures related to system complexity, each network node is an asynchronous network of automata; each automaton has a handful of states; and there are from hundreds to thousands of such automata in the network. Each root fault can cause hundreds of correlated alarms that travel throughout each subsystem and are collected by the corresponding local supervisor. Supervised domains may very well be orders of magnitude larger in the future. Thus,
scalability is a major concern. It is important that the type of algorithm we develop takes this context into account.

To correlate faults and alarms that are causally related, our distributed algorithms use behavioral models of the system - this feature holds for fault management, but is also valid for other functions requiring non trivial algorithms. Therefore, the first issue is that of how to construct such models, since this cannot be performed manually. This is the subject of what we call self-modeling.

Self-modeling

Self-modeling is explained in the following figure.

The existence of standards allows us to build (a dozen of) classes of generic managed objects, with their associated behaviors. From performing network discovery, the structure of the management system can be found. From these two inputs, a behavioral model for the entire system can be automatically generated, as well as the distributed algorithm for fault diagnosis.