KERSTREAM

Big Data Processing: Beyond Hadoop!

Inria

KERSTREAM is a JCJC project funded by the French National Research Agency (ANR). This project aims to address the limitations of Hadoop when running stream Big Data applications on large-scale clouds and to do a step beyond Hadoop by proposing a new approach, called KERSTREAM, for scalable and resilient stream Big Data processing on clouds.

Duration: 60 months (2017-2022)

Budget: 238,000 EUR

Coordinator: Shadi Ibrahim (Inria Research Scientist)

Recent Highlights

  • One PhD student and one Post-doc positions funded by the KerStream project are available. Please feel free to email me your resume. Research Openings [Details]

  • [Paper] 2018: Our paper on Energy-Efficient Speculative Execution using Advanced Reservation is accepted in ICPP 2018.

  • [Paper] 2018: Our paper on Dual-Paradigm Stream Processing is accepted in ICPP 2018.

  • [Paper] 2018: Our paper on Low-Latency Data Stream Processing is accepted in ICDCS 2018.

  • [Paper] 2018: Our paper on Network-Aware Virtual Machine Image Management in Geo-Distributed Clouds is accepted in CCGrid 2018.

  • [Paper] 2018: Our paper on the Performance of Spark on HPC Systems is accepted in SupercomputingAsia 2018 SCA18. Best Paper Candidate.

  • [Paper] 2018: Our paper on Improving the effectiveness of burst buffers for big data processing in HPC systems is accepted in FGCS Journal 2018.


Big Data evolution, the introduction of cloud computing and the success of the MapReduce model have fostered new types of data-intensive applications where obtaining fast and timely results is a must (i.e., stream data applications). Stream data applications are emerging as first-class citizens in large scale production data centers (e.g., click-stream analysis, network-monitoring log analysis, abuse prevention, etc).

Hadoop has recently emerged as by far the most popular middleware for Big Data processing on clouds. But Hadoop can not deal with low-latency stream processing because data needs to be stored in the distributed file systems. While several systems has been introduced to process stream data applications, they are still providing best efforts when failures occur (failure is a natural reflection of the explosion of scale) . Moreover, they are designed to run on dedicated "controlled" environments and therefore suffer of unpredictable performance when running on large-scale clouds due to the resource contention, performance variation and high rate of failures.

The KerStream project aims to address the limitations of Hadoop, and to go a step beyond Hadoop through the development of a new approach, called KerStream, for reliable, stream Big Data processing on clouds. KerStream keeps computation in-memory to ensure the low-latency requirements of stream data computations. Furthermore, KerStream will embrace a set of techniques that allow the running applications to automatically adapt to the performance variation and node failures/subfailures, and enable a smart choice of failure handling techniques. Moreover, KerStream will have a set of scheduling policies to allow multiple running applications to meet their QoS (low-latency for stream data processing) while achieving high resource utilization.


Project Participants

  • One PhD position is still avaialbe

  • One Post-doc position is still available

  • [Not Funded By ANR]

    • Amelie Chi Zhou a former post-Doctoral fellow funded by Inria and supervised by me. She is now at Shenzhen university in China.

    • Orcun Yildiz a former PhD student funded by Inria and co-advised with Gabriel Antoniu (Inria, Irisa). Orcun defended his PhD in December 2017.

    • Tien-Dat Phan a former PhD student funded by University of Rennes 1 and co-advised with Luc Bouge (ENS Rennes, Irisa). Tien-Dat defended his PhD in November 2017.

    • Yacine Taleb a PhD student funded by the European Training Network (ETN) BigStorage project and co-advised with Gabriel Antoniu (ENS Rennes, Irisa) until April 2017.

  • External Collaborators:

    • Prof. Hai Jin and Prof. Song Wu from Huazhong University of Science and technology in China.
      In 2017, thanks to the 2017 programme Jeunes Talents France-Chine, Shadi Ibrahim visited HUST for 10 days. In addition to advertising the PhD and Post-Doc positions. The visit was very fruitful as we have successfully resumed our collaborations after long interruption. We could identify relevant research topics to work on together - mainly related to stream data processing.


Acadmic External Supporters


Case Studies

  • RENATER. French Research and Education Network

    • Real-life workloads provided by RENATER to analyze data network streams

  • Syntactic benchmarks:


Objecteives [Current reults and delieverables]

This project aims to address the limitations of Hadoop when running stream data applications on clouds and to go a step beyond Hadoop by proposing a new approach, called KERSTREAM, for fast, resilient, and scalable stream data processing on clouds. We expect this project to bring substantial in- novative contributions with respect to the following aspects:

[Objective 1] An approach for fast and scalable stream data processing on clouds.

We plan to design a new approach for in-memory stream data processing, named KerStream. Different from other stream data processing frameworks (e.g., Spark and Flink) which are designed for a controlled environment (assume homogeneous resources), KerStream will be equipped with a performance-aware task scheduling framework to cope with the unpredictable performance variably in clouds (network, I/O, etc).

[Objective 2] A smart configurable failure/subfailure recovery.

We will propose a set of techniques (e.g., machine learning techniques) and algorithms that allow stream data processing to automatically adapt to node failures and subfailures (i.e., Stragglers) and to enable the “smart” choice of failure/subfailure handling techniques.


[Objective 3] An adaptive job scheduling framework for stream data applications.

An adaptive job scheduling framework will be developed on the top of KERSTREAM. The framework will be accommodated with several scheduling polices that can be adaptively tuned in response to the application’s behavior and requirement.

We been working on scientific challenges related to Tasks 1 and 2 and have advanced in these two tasks. The research results will be used as a base for the hired PhD/Post-Doc, thus facilities and eases the work towards achieving the project objectives, in particular Objective 1 and Objective 2.


Towards the design of an architecture for dependable in-memory stream data processing (subtask in Task 1), we have performed the below studies:

• First, we evaluated the performance of Spark (widely used Big Data analysis engine) in different environments (i.e., cloud where the storage is attached to the compute nodes and HPC where the storage is separated from the compute node) and we have explored how to use in-memory buffer systems to improve the performance in the later scenario.
This work was realized in collaboration with Orcun Yildiz and Amelie Chi Zhou and resulted in three publications in Cluster 2017, FGCS 2018 and SCA 2018.

• Second, we performed a performance characterization study of RamCloud which is a representative in-memory storage system featured with low latency and is actually a potential candidate to store data when processing stream data applications. This study will help in completing the picture and identify the most appropriate in-memory storage system for in-memory stream data processing engines.
This work was realized in collaboration with Yacine Taleb and resulted in one publication in ICDCS 2017.

• Third, we have introduced two approaches to reduce the latency of Stream data processing though reduce memory-copy operations and efficiently handling data bursts.
This work was realized in collaboration with Hai Jin and Wu Song and resulted in two publications in ICDCS 2018 and ICPP 2018.

• Fourth, we introduce a new method for graph partitioning that addresses the high latency and heterogeneity and thus improves the response time of graph processing applications. We plan to port the results to stream processing application, especially as data stream processing applications are often modelled as a directed acyclic graph: operators with data streams among them. This work was realized in collaboration with Amelie Chi Zhou and Bingsheng He and resulted in one publication in ICDCS 2017.

Towards effectively mitigating node failures and subfailures (i.e., Stragglers) in stream data processing applications (Subtask in Task 2), we have investigated a new technique to handle stragglers and also introduced a new scheduling framework to handle stragglers in heterogeneous environments.
This work was realized in collaboration with Tien-Dat Phan, Amelie Chi Zhou and Bingsheng He and resulted in two publications in Euro-Par 2017 and ICPP 2018.

Missions, Visits, and other activites

  • [August 2018] Shadi Ibrahim is attanding ICPP 2018

  • [November 2017] Shadi Ibrahim visited Huazhong University of Sceince and Technolgy

  • [September 2017] Orcun Yildiz presented the Eley paper is Cluster 2017

  • [August 2017] Shadi Ibrahim attended ICA3PP 2017.
  • [August 2017] Tien-Dat Phan presented the energy driven stragglers mitigation in Euro-par 2017

  • [June 2017] Amelie Chi Zhou presented the paper on graph processing in ICDCS 2017


Dissemination and Exploitation of Results (Publications)

2018

  • [ICPP 2018-Stragglers] Amelie Chi Zhou^, Tien-Dat Phan*, Shadi Ibrahim, Bingsheng He, Energy-Efficient Speculative Execution using Advanced Reservation for Heterogeneous Clusters . in ICPP 2018.

  • [ICPP 2018-Stream] Song Wu, Zhiyi Liu, Shadi Ibrahim, Lin Gu, Hai Jin, and Fei Chen, Dual-Paradigm Stream Processing. in ICPP 2018

  • [ICDCS 2018] Song Wu, Mi Liu, Shadi Ibrahim, Hai Jin, Lin Gu, Fei Chen and Zhiyi Liu, TurboStream: Towards Low-Latency Data Stream Processing. in ICDCS 2018.

  • [FGCS 2018] Orcun Yildiz*, Amelie Chi Zhou^, Shadi Ibrahim, Improving the effectiveness of burst buffers for big data processing in HPC systems with Eley, In FGCS 2018.

  • [SCA 2018] Orcun Yildiz*, Shadi Ibrahim, On the Performance of Spark on HPC Systems: Towards a Complete Picture, In SCA 2018 (Best Paper Candidate).

2017

  • [Cluster 2017] Orcun Yildiz*, Amelie Chi Zhou^, Shadi Ibrahim, Eley: On the Effectiveness of Burst Buffers for Big Data Processing in HPC systems, In Cluster 2017 (Short Paper).

    [Euro-Par 2017] Tien-Dat Phan*, Shadi Ibrahim, Amelie Chi Zhou^, Guillaume Aupy, Gabriel Antoniu, Energy-Driven Straggler Mitigation in MapReduce, In Euro-Par 2017.

  • [ICDCS 2017-Storage] Amelie Chi Zhou^, Shadi Ibrahim, Bingsheng He, On Achieving Efficient Data Transfer for Graph Processing in Geo-Distributed Datacenters, In ICDCS 2017 (Applications and Experiences Track).

  • [ICDCS 2017-Sheduling] Yacine Taleb*, Shadi Ibrahim, Gabriel Antoniu, Toni Cortes. Characterizing Performance and Energy-efficiency of The RAMCloud Storage System, In ICDCS 2017 (Applications and Experiences Track).


© Shadi Ibrahim 2018