Job Opening (Master Internships)
Several Master internships are available. Please feel free to email me your resume.
Towards reliable and cost-efficient intermediate data management in data-intensive clusters
Responsable scientifique : Shadi IBRAHIM (équipe Myriads)
Big Data applications (MapReduce, machine learning and deep learning applications) are often modelled as a directed acyclic graph (DAG): operators (functions) with data flows among them. Typically, tasks executing the same function are grouped into a stage. The intermediate data, which are produced in one stage (iteration), are used as inputs for the next stage. Usually, intermediate data are distributed across the local disks of the nodes. Thus, in case of failure, all the tasks – belong to the failed machine – will be re-executed to generate the lost intermediate data. This will prolong the starting time of the next stage and degrade the performance of Big Data applications. An alternative solution is to replicate the intermediate data across several nodes. However, replication can cause very high overhead in terms of storage space and writing time [1, 2].
Recently, erasure coding (EC) has been used in storage systems (e.g., input data in data analytic frameworks  and caching systems ) to provide the same fault tolerance guar- antee as replication while reducing storage cost. For example, under Reed-Solomon code RS(6,3), a data block is split into (6) smaller blocks called data chunks, and then used to compute (3) parity chunks. Any 6 out of the 9 chunks are sufficient to rebuild the original data block. Thus, RS(6,3) can tolerate 3 simultaneous node failures while reducing the storage overhead by 50% compared to three-way data replication. Previous efforts to ex- plore EC for high availability of intermediate data [5, 6] have blindly adopted EC without considering the functionality of chucks (i.e., data or parity chunks) or the I/O contention when writing and reading data.
The goal of this project is to investigate how to efficiently adopt EC for intermediate data in Big Data applications. After a review of the literature on using EC in data-intensive clusters, the student will study how to place data chunks and parity chunks across nodes to balance I/O load and network traffic and therefore ensure fast and cost-efficient (in terms of monetary cost and energy consumption) execution of Big Data applications. Depending on the progress achieved during the internship, this work could lead to the publication of a research paper and could be followed by a PhD thesis on a related topic.
 J. Darrous, and S. Ibrahim. Understanding the performance of erasure codes in hadoop distributed file system. In Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems (CHEOPS ’22).
 Y. Taleb, S. Ibrahim, G. Antoniu, T. Cortes. Characterizing Performance and Energy- Efficiency of The RAMCloud Storage System. ICDCS 2017 : 37th IEEE International Conference on Distributed Computing Systems.
 J. Darrous, S. Ibrahim, C. Perez, Is it time to revisit Erasure Coding in Data-intensive clusters? in MASCOTS 2019.
 K. V. Rashmi, M. Chowdhury, J. Kosaian, I. Stoica, and K. Ram- chandran. EC-cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding. In Proceed- ings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pages 401– 417, 2016.
 Z. Zhang, A. Deshpande, X. Ma, E. Thereska, and D. Narayanan. Does erasure coding have a role to play in my data center? Technical report, Microsoft research, 2010.  X. Yao, et al., EC-Shuffle : Dynamic Erasure Coding Optimization for Efficient and Reliable Shuffle in Spark, In CCGrid 2019.
On the role of data placement in P2P distributed storage systems
Logistics : The internship will take place in HIVE’s headquarters in Cannes and it will be in collaboration between the MYRIADS team in Rennes (Shadi Ibrahim), the COAST team in Nancy (Thomas lambert), and the Hive Company (Amine Ismail).
The internship will be remunerated at a salary of E 1,329.05 (SMIC). It will take 4-6 months with the possibility to continue with a PhD thesis on a related topic (More details on the PhD subject can be found here: https://recrutement.inria. fr/public/classic/fr/offres/2022-05223).
Hive (https://www.hivenet.com/) intends to play the role of a next generation cloud provider in the context of Web 3.0. Hive aims to exploit the unused capacity of computers to offer the general public a greener and more sovereign alternative to the existing clouds where the true power lies in the hands of the users. It relies both on distributed peer-to-peer networks, on the encryption of end-to-end data and on blockchain technology.
In the context of the Inria-Hive collaborative frame- work that aims at delivering a reliable and secure peer-to-peer storage service, we plan to investigate a set of data placement strategies to ensure high available and cost-efficient data services when using Erasure Coding (EC) to store that data.
Traditionally, data are replicated to ensure high availability: several copies (mostly 3 copies) of the same data block are stored across several storage nodes (one copy per node). Under EC the original data block is divided into smaller chunks, encoded and distributed across the storage nodes. To generate the original data a subset of the chunks should be retrieved (and encoded). Recently, EC has been deployed in hot storage systems to support data analytic systems [1, 2, 3, 4] and caching systems .
The goal of this project is to investigate a new data placement strategy that can ensure high data availability under frequent failures and node unavailabili- ties in P2P storage systems. After a review of the literature on using EC is distributed storage systems, the student will introduce a novel placement strategy that considers the upload/download bandwidth of storage nodes and the availability of data. Depending on the progress achieved during the internship, this work could lead to the publication of a research paper and could be followed by a PhD thesis on the same subject.
 Z. Zhang, A. Deshpande, X. Ma, E. Thereska, and D. Narayanan. Does erasure coding have a role to play in my data center? Technical report, Microsoft research, 2010.
 Z. Zhang, A. Wang, K. Zheng, U. M. G., and V. B. Introduction to HDFS Erasure Coding in Apache Hadoop. https://blog.cloudera.com/blog/ 2015/09/introduction- to- hdfs- erasure- coding- in- apache- hadoop, 2015.
 J. Darrous, S. Ibrahim, C. Perez, Is it time to revisit Erasure Coding in Data- intensive clusters? in MASCOTS 2019.
 X. Yao, et al., EC-Shuffle : Dynamic Erasure Coding Optimization for Efficient and Reliable Shuffle in Spark, In CCGrid 2019.
 K. V. Rashmi, M. Chowdhury, J. Kosaian, I. Stoica, and K. Ram- chandran. EC- cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pages 401– 417, 2016.
Addressing resource and data dynamics for stream data applications in the Edge
Responsable scientifique : Shadi IBRAHIM (équipe Myriads)
The mutual low-latency objective for both Data Stream Processing (DSP) and Edge environments has resulted in a continuous growth of DSP deployments on Edge or Fog environments. The success of DSP deployments in the Edge relies on operators placements and the ability to sustain low latency. Accordingly, much work have focused on placement strategies across Edge-servers or across hybrid Cloud and Edge environments. Previous efforts have focused on reducing the volume of communication overhead between nodes (inter-node communication) and dividing the computation between edge servers and clouds. Unfortunately, they are oblivious to (1) the dynamic nature of data streams (i.e., data volatility and bursts) and to (2) the bandwidth and resource heterogeneity in the Edge, which negatively affects the performance of stream data applications.
In a recent work, we addressed the problem of data stream dynamicity. In particular, we showed that Maximum Sustainable Throughput (MST) -- which refers to the amount of data that a DSP system can ingest while keeping stable performance -- should be considered as an optimization objective for operators placements in the Edge. Accordingly, we design and evaluate a MST-driven operators placement (based on constraint programming) for stream data applications .
The goal of this project is to extend our previous work to consider resource dynamically in the Edge. The student will investigate how to enable dynamic operators placements in the Edge with minimal cost. For example, how we can scale out the operators or replace them without the need to frequently stop and resume the whole applications. The student will start by learning how stream data applications are deployed in large-scale and to study the performance of current (and to be developed during this project) state-of-the-art operators placements both theoretically and systematically (using Storm  and Grid'5000).
 Thomas Lambert, David Guyon, and Shadi Ibrahim. 2020. Rethinking Oper- ators Placement of Stream Data Application in the Edge . In The 29th ACM International Conference on Information and Knowledge Management (CIKM ’20), October 19–23, 2020, Virtual Event, Ireland.
 Apache Storm. 2020. https://storm.apache.org/
Distributed Burst Buffer System in Collaborative Research
Responsable scientifique : Shadi IBRAHIM (équipe Myriads)
Many collaborative research applications require data to be shared across computing centers (i.e., supercomputing facilities) including input and intermediate data. The typical route of transferring and analyzing data between multiple sites is to store the data on the file systems of a source facility and allowing analysis applications at the destination to read from the file systems, which may result in high latency and low resource utilization. Meanwhile, Burst Buffer (BB) is an effective solution for reducing the data transfer time [1, 2] and the I/O interference in HPC systems [3, 4]. Previous BB solutions have been used mainly within a single system or single supercomputing facility but not across supercomputing facilities.
The goal of this project is to investigate how to seamlessly aggregate the Burst Buffers of potentially many supercomputing facilities and enable fast data movements across these sites considering both the applications characteristics and the status of the sites (resource usage, energy consumption, etc ). After a review of the literature on using Burst buffers in HPC systems, the student will introduce performance models for collaborative research applications when data are shared at file system level or across burst buffers. Based on these models, the student will design and evaluate a prefetching technique that proactively moves data closer to the destination sites to reduce the response time of collaborative applications and improve the resource utilization (and energy efficiency) of supercomputing facilities.
This project will be done in collaboration with Suren Byna from LBNL, USA.
 N. Liu et al., “On the role of burst buffers in leadership-class storage systems”;2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), 2012, pp. 1-11, doi: 10.1109/MSST.2012.6232369.
 B. Dong et al., “Data Elevator: Low-Contention Data Movement in HierarchicalStorage System”; 2016 IEEE 23rd International Conference on High PerformanceComputing (HiPC), 2016, pp. 152-161, doi: 10.1109/HiPC.2016.026.[
3] A. Kougkas, M. Dorier, R. Latham, R. Ross and X. Sun; “Leveraging burst buffercoordination to prevent I/O interference”; 2016 IEEE 12th International Conferenceon e-Science (e-Science), 2016, pp. 371-380, doi: 10.1109/eScience.2016.7870922.
 O. Yildiz, A. C. Zhou, S. Ibrahim, “Improving the Effectiveness of Burst Buffersfor Big Data Processing in HPC Systems with Eley”; Future Generation ComputerSystems, Volume 86, 2018.