Accelerating the Performance of Multi-Site Scientific applications through Coordinated Data management

Hermes is an associated team between the Myriads team from INRIA Rennes-Bretagne Atlantique and Lawrence Berkeley National Lab (LBNL).

Principal investigator (Inria): Shadi Ibrahim, Myriads team.
Principal investigators (partner): Suren Byna, Lawrence Berkeley National Lab, USA



Data and resource management in multi-site workflow management.

Scientific workflows frequently have concurrently executing operations, which include circum- stances where it would be optimal to have multiple processes writing to one storage container (traditionally, a container is a file on a parallel file system). Historically, this mode of storing data has been challenging to perform without unusual, application-specific workarounds or us- ing database-like packages to coordinate modifications. To allow multiple concurrent writers and multiple readers interact with a data container without a central server is a challenging task because of a requirement consensus in modifying the contents of the container. Beyond local writers and readers, distributed data transfer between facilities is an important component of modern experimental and observational data (EOD) workflows. The multi-site workflow man- agement makes data and resource management more challenging. This project aims to explore remote, server-less multiple writer and multiple reader (MWMR) technologies to allow con- current access to shared data containers across multiple sites. The researchers will also explore efficient data movement and data placement strategies to improve performance of accessing data for processing.

Distributed Burst buffer management.

Burst Buffer (BB) is an effective so- lution to accelerate I/O in HPC systems [5]. Traditionally, BBs in HPC systems are used to min- imize the I/O time by absorbing the checkpointing data of scientific applications. Recently, BBs have been explored to temporarily store intermediate data – in between jobs/iterations/tasks. However, given the distributed nature of scientific workflows and the high correlation in-between scientific data, in this project we extend data management in BBs to multiple-sites, that is, we will introduce a global data management including data prefetching, data eviction , etc that consider the data type (e.g., checkpointing data, intermediate data, stream data), the relation between those data, and the load and I/O interference in different HPC sites.

Metadata management and provenance management.

Capturing the scientific process and decisions taken throughout that process provides a crucial basis for result analysis, evaluation, validation, and reproducibility. Metadata and provenance play critical roles in capturing the information about the scientific process. With the rapidly increasing volume, complexity, and pace of data generation and consumption, a particular focus on metadata and provenance management is necessary to support science data curation. The collection, in- flight use, organization, maintenance, augmentation, versioning, search, and analysis of scientific data throughout the life cycle all require a systematic focus on metadata in order to meet science needs and to preserve data provenance. The nature, volume, and rates of scientific data production require concurrent, performant, and scalable metadata management techniques. We plan to design and develop metadata extraction from datasets, and storage and indexing data structures for faster search of metadata. This project will also explore strategies for capturing provenance, such as users generating and analyzing data, and patterns of data accesses, and for analyzing the provenance data for proactively placing data closer to scientists analyzing it.

Stream data processing in heterogeneous environments.

Big Data evo- lution, the advances in computation capacity of HPC systems and the emergence of in-memory data processing engines like Spark and Flink have fostered the adoption of stream data process- ing applications in HPC systems. For example, in-situ analysis and data filtering. Those stream data may be generated in distributed locations (multiple sites). This poses several challenges when processing those distributed data streams including the heterogeneity of the infrastructure and the volatility of stream data. In this project, we will define new metrics which can reflect the correctness of the stream applications – an exact results are not required when data filtering – and explore the tradeoff between correctness, locality and performance towards introducing a new scheduling and placement policies of jobs and tasks across multiple HPC sites.

© Shadi Ibrahim 2019