Rennes - Bretagne Atlantique Research Center
Cloud & Big data
Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in computing infrastructure through Cloud Computing and in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale infrastructures. On the other hand, cloud computing is continuing to act as a prominent infrastructure for Big Data applications.
The goal of this course is to serve as a first step towards exploring data analytics models and technologies used to handle Big Data such as MapReduce (and what’s after), Hadoop, Spark, Flink. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will then be discussed. We will then discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud. Finally, we will discuss the limitations of Hadoop and introduce new Big Data systems including Spark.
Several hand-ons could be provided to study the operation of Hadoop along with the implementation of MapReduce applications.
Course Schedule and Resources (a work in progress)
Lectures
- Introduction to Big data [pdf]
The goal of this lecture is to provide an overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data.
- MapReduce System: Google File System (GFS) and MapReduce programming model [pdf] [pdf]
The goal of these two lectures is to introduce the main design goals and features of the distributed file systems (GFS) and the Google MapReduce programming model. For both systems we will cover issues related to fault-tolerance, data access and so on.
- Hadoop Ecosystem, Yarn, Spark and Flink [pdf][pdf][MR-Python][link]
The goal of these two lectures is to introduce Hadoop, the most widely used open-source implementation of MapReduce. We will dicover the Hadoop ecosystem and discuss several optimizations in Hadoop including stragglers mitigation and job scheduling. We will also discuss the limitations of Hadoop and introduce new frameworks that improve resource management in Hadoop including Yarn, and also Spark framework which targets iterative application and stream data applications.
- Papers for the Research Project:
- Data skew in MapReduce (Charles and Baptiste)
- Handling Data Skew in MapReduce. Gufler, B., Augsten, N., Reiser, A. and Kemper, A., Closer 11 (2011): 574-583
- LIBRA: Lightweight Data Skew Mitigation in MapReduce Lightweight data skew mitigation in mapreduce. IEEE Transactions on parallel and distributed systems 26, no. 9 (2014): 2520-2533
- DS2: Handling Data Skew Using Data Stealings over High-Speed Networks. Z He, Z Li, X Peng, C Weng 2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021 .
- Energy-Efficiency in MapReduce (Loic)
- Scheduling MapReduce applications (Dylan and Naïm)
- ShuffleWatcher: Shuffle-aware scheduling in multi-tenant MapReduce clusters. Ahmad, F., Chakradhar, S.T., Raghunathan, A. and Vijaykumar, T.N., In 2014 {USENIX} Annual Technical Conference ({USENIX}{ATC} 14), pp. 1-13. 2014.
- Low Latency Geo-distributed Data Analytics. Pu, Q., Ananthanarayanan, G., Bodik, P., Kandula, S., Akella, A., Bahl, P. and Stoica, I., In ACM SIGCOMM Computer Communication Review 45, no. 4 (2015): 421-434.
- A Network Cost-aware Geo-distributed Data Analytics System. Oh, K., Chandra, A. and Weissman, J., In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) (pp. 649-658)
Practical Sessions (By Alessio Pagliari)
The goal of these practical sessions is study the operation of Hadoop (and Yarn) and see how to run single and multiple MapReduce appliaction on Hadoop cluster. We will also learn how to configure the block size in HDFS and Hadoop Cluster. We will briefly discuss how to setup Yarn cluster and how to write MapReduce applications.
- Getting started with Hadoop [pdf]
The goal of this TP is to study the implementation and the operation of the Hadoop Platform. Deploy on single node and multiple nodes. Finally, we will run simple examples using the MapReduce paradigm and test slot configurations.
- You can download Hadoop HERE
- Note: (to unzip a tar file: tar -xvf ) (To find java path: update-alternatives --config java)
- Deploying and configuring Yarn on multiple nodes on Grid'5000 [pdf]
The goal of this TP is to study the implementation and the operation of the Hadoop 3 Platform using Yarn as a resource manager. Deploy the platform on multiple nodes. Start using HDFS by adding and removing files. Finally, we will run simple examples using the MapReduce paradigm.
- You can download the TP-resources HERE
- A skeleton for a script to automatically deploy Yarn on Grid'5000 can be found HERE
- Configuring and optimizing Hadoop [pdf]
The goal of this TP is to study how to configure the platform to expose different important features of Hadoop (e.g., data replication, speculation, etc). Then, we shall examine how this tuning impacts on the overall performance (e.g., execution time). We suppose you have an Hadoop cluster already deployed.
- Developing MapReduce applications [pdf] [source]
After studying in the first TPs how to operate the Hadoop platform using examples, we will now implement MapReduce applications to solve several types of problems: including numerical calculation and analysis examples.
- Practical evaluation [PDF]
- From Batch processing to Stream data processing: Getting Started with Storm [pdf] [presentation]
The goal of this TP is to study the implementation and the operation of the Storm stream processing platform. Deploy and configure a cluster and run simple a benchmark to understand the DAG application paradigm.
References:
- Cloud Types and Services. Hai Jin, Shadi Ibrahim, Tim Bell, Wei Gao, Dachuan Huang, Song Wu. Book Chapter in in the Handbook of Cloud Computing, Springer Press, 26 Sep 2010.
- Tools and technologies for building the Clouds. Hai Jin, Shadi Ibrahim, Tim Bell, Li Qi, Haijun Cao, Song Wu, Xuanhua Shi. Book Chapter in Cloud Computing: Principles Systems and Applications, Springer Press, 2 Aug 2010.
- A view of cloud computing. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2010.. Commun. ACM 53, 4 (April 2010).
- The Google file system. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. In SOSP '03. [pdf]
- MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean, Sanjay Ghemawat, OSDI, 2004. [pdf]
- The MapReduce Programming Model and Implementations. Hai Jin, Shadi Ibrahim, Li Qi, Haijun Cao, Song Wu, Xuanhua Shi. Book Chapter in Cloud Computing: Principles and Paradigms.
- Apache Hadoop YARN: yet another resource negotiator. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. In SOCC '13. [pdf]
Adapted from a template by FreeHTML5.co