Inria Rennes - Bretagne Atlantique Research Center

Job Opening (Master Internships)

Several Master internships Funded by the ANR KerStream project and the Hermes associate team are available. Please feel free to email me your resume.

Distributed Burst Buffer System in Collaborative Research
Responsable scientifique : Shadi IBRAHIM (équipe Myriads)

Contexte :
Many collaborative research applications require data to be shared across computing centers (i.e., supercomputing facilities) including input and intermediate data. The typical route of transferring and analyzing data between multiple sites is to store the data on the file systems of a source facility and allowing analysis applications at the destination to read from the file systems, which may result in high latency and low resource utilization. Meanwhile, Burst Buffer (BB) is an effective solution for reducing the data transfer time [1, 2] and the I/O interference in HPC systems [3, 4]. Previous BB solutions have been used mainly within a single system or single supercomputing facility but not across supercomputing facilities.

Objectifs :
The goal of this project is to investigate how to seamlessly aggregate the Burst Buffers of potentially many supercomputing facilities and enable fast data movements across these sites considering both the applications characteristics and the status of the sites (resource usage, energy consumption, etc ). After a review of the literature on using Burst buffers in HPC systems, the student will introduce performance models for collaborative research applications when data are shared at file system level or across burst buffers. Based on these models, the student will design and evaluate a prefetching technique that proactively moves data closer to the destination sites to reduce the response time of collaborative applications and improve the resource utilization (and energy efficiency) of supercomputing facilities.
This project will be done in collaboration with Suren Byna from LBNL, USA.

[1] N. Liu et al., “On the role of burst buffers in leadership-class storage systems”;2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), 2012, pp. 1-11, doi: 10.1109/MSST.2012.6232369.
[2] B. Dong et al., “Data Elevator: Low-Contention Data Movement in HierarchicalStorage System”; 2016 IEEE 23rd International Conference on High PerformanceComputing (HiPC), 2016, pp. 152-161, doi: 10.1109/HiPC.2016.026.[
3] A. Kougkas, M. Dorier, R. Latham, R. Ross and X. Sun; “Leveraging burst buffercoordination to prevent I/O interference”; 2016 IEEE 12th International Conferenceon e-Science (e-Science), 2016, pp. 371-380, doi: 10.1109/eScience.2016.7870922.
[4] O. Yildiz, A. C. Zhou, S. Ibrahim, “Improving the Effectiveness of Burst Buffersfor Big Data Processing in HPC Systems with Eley”; Future Generation ComputerSystems, Volume 86, 2018.

Skew mitigation in massively distributed data analytics
Responsable scientifique : Shadi IBRAHIM (équipe Myriads)

Contexte :
Data analytics frameworks such as Hadoop, Spark and Flink have been recently used to process a diverse variety of big data applications in Fog and Edge infrastructures. Big Data applications (MapReduce, machine learning and deep learning applications) are often modelled as a directed acyclic graph (DAG): operators (functions) with data flows among them. Typically, tasks executing the same function are grouped into a stage. For example, MapReduce jobs consist of two consecutive stages (Map and Reduce) and the output of the map stage is transferred to the input of the Reduce stage. The completion time of a stage strongly depends on the finishing time of the last task in this stage. Accordingly, much work have focused on balancing the execution of tasks of the same stage by evenly partitioning data across them [1][2][3][4]. However, existing data partitioning mechanisms are limited to two-stage applications and assume that compute and I/O resources are homogenous across the platform, making them impractical for machine learning and deep learning applications; and for heterogeneous infrastructures like Fog computing.

Objectifs :
In this internship we want to investigate new techniques to mitigate skew in data analytic frameworks when deployed in the Fog and introduce an effective data partitioning mechanism that balances the workload among tasks of the same stage and also reduces data transfer time in-between stages.

[1] S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He and L. Qi, "LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud," 2010 IEEE Second International Conference on Cloud Computing Technology and Science.
[2] B. Gufler, N. Augsten, A. Reiser and A. Kemper, "Load Balancing in MapReduce Based on Scalable Cardinality Estimates," 2012 IEEE 28th International Conference on Data Engineering.
[3] Q. Chen, J. Yao and Z. Xiao, "LIBRA: Lightweight Data Skew Mitigation in MapReduce," in IEEE Transactions on Parallel and Distributed Systems.
[4] Z. He, Z. Li, X. Peng and C. Weng, "DS2: Handling Data Skew Using Data Stealings over High-Speed Networks," 2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021.

Accelerating the deployment of latency-critical services in shared Fog infrastructure: On the role of container Image placement
Responsable scientifique : Shadi IBRAHIM (équipe Myriads)

Contexte :
Fog computing, promising to extend Clouds by moving computation close to data sources, has been successfully deployed and utilized in practice to facilitate short-running and latency-critical applications and services [1]. Providing fast and predictable service provisioning time presets a new and mounting challenge, as the scale of Fog servers grows and the heterogeneity of networks between them increases [2].

Objectifs :
In this internship we want to review the state-of-the-art research on container image placement and retrieval in the Fog and model and evaluate their performances when multiple container images are concurrently retrieved. Based on these models and evaluation results, the student will design and evaluate new container image placement and retrieval strategies that addresses both container images and network path sharing. These strategies can be implemented and evaluated using a simulation (which was developed in our previous work [3]) and in Docker[4].

[1] E. G. Renart, J. Diaz-Montes, and M. Parashar, “Data-Driven Stream Processing at the Edge,” in ICFEC, 2017.
[2] T. Harter, B. Salmon, R. Liu, A. C. Arpaci-Dusseau, and R. H. Arpaci- Dusseau, “Slacker: Fast distribution with lazy docker containers,” in
FAST, 2016.
[3] J. Darrous, T. Lambert, and S. Ibrahim, "On the Importance of Container Image Placement for Service Provisioning in the Edge" in ICCCN 2019.
[4] Docker homepage. [Online].

Addressing resource and data dynamics for stream data applications in the Edge
Responsable scientifique : Shadi IBRAHIM (équipe Myriads)

Contexte :
The mutual low-latency objective for both Data Stream Processing (DSP) and Edge environments has resulted in a continuous growth of DSP deployments on Edge or Fog environments. The success of DSP deployments in the Edge relies on operators placements and the ability to sustain low latency. Accordingly, much work have focused on placement strategies across Edge-servers or across hybrid Cloud and Edge environments. Previous efforts have focused on reducing the volume of communication overhead between nodes (inter-node communication) and dividing the computation between edge servers and clouds. Unfortunately, they are oblivious to (1) the dynamic nature of data streams (i.e., data volatility and bursts) and to (2) the bandwidth and resource heterogeneity in the Edge, which negatively affects the performance of stream data applications.
In a recent work, we addressed the problem of data stream dynamicity. In particular, we showed that Maximum Sustainable Throughput (MST) -- which refers to the amount of data that a DSP system can ingest while keeping stable performance -- should be considered as an optimization objective for operators placements in the Edge. Accordingly, we design and evaluate a MST-driven operators placement (based on constraint programming) for stream data applications [1].

Objectifs :
The goal of this project is to extend our previous work to consider resource dynamically in the Edge. The student will investigate how to enable dynamic operators placements in the Edge with minimal cost. For example, how we can scale out the operators or replace them without the need to frequently stop and resume the whole applications. The student will start by learning how stream data applications are deployed in large-scale and to study the performance of current (and to be developed during this project) state-of-the-art operators placements both theoretically and systematically (using Storm [2] and Grid'5000).

[1] Thomas Lambert, David Guyon, and Shadi Ibrahim. 2020. Rethinking Oper- ators Placement of Stream Data Application in the Edge . In The 29th ACM International Conference on Information and Knowledge Management (CIKM ’20), October 19–23, 2020, Virtual Event, Ireland.
[2] Apache Storm. 2020.


Adapted from a template by