Inria Rennes - Bretagne Atlantique Research Center

Cloud & Big data

Data volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in computing infrastructure through Cloud Computing and in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale infrastructures. On the other hand, cloud computing is continuing to act as a prominent infrastructure for Big Data applications.

The goal of this course is to serve as a first step towards exploring data analytics models and technologies used to handle Big Data such as MapReduce (and what’s after), Hadoop, Spark, Flink. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will then be discussed. We will then discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud. Finally, we will discuss the limitations of Hadoop and introduce new Big Data systems including Spark.
Several hand-ons could be provided to study the operation of Hadoop along with the implementation of MapReduce applications.


Course Schedule and Resources (a work in progress)

Lectures

 

Practical Sessions (By Alessio Pagliari)

The goal of these practical sessions is study the operation of Hadoop (and Yarn) and see how to run single and multiple MapReduce appliaction on Hadoop cluster. We will also learn how to configure the block size in HDFS and Hadoop Cluster. We will briefly discuss how to setup Yarn cluster and how to write MapReduce applications.

 

References:

  1. Cloud Types and Services. Hai Jin, Shadi Ibrahim, Tim Bell, Wei Gao, Dachuan Huang, Song Wu. Book Chapter in in the Handbook of Cloud Computing, Springer Press, 26 Sep 2010.
  2. Tools and technologies for building the Clouds. Hai Jin, Shadi Ibrahim, Tim Bell, Li Qi, Haijun Cao, Song Wu, Xuanhua Shi. Book Chapter in Cloud Computing: Principles Systems and Applications, Springer Press, 2 Aug 2010.
  3. A view of cloud computing. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2010.. Commun. ACM 53, 4 (April 2010).
  4. The Google file system. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. In SOSP '03. [pdf]
  5. MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean, Sanjay Ghemawat, OSDI, 2004. [pdf]
  6. The MapReduce Programming Model and Implementations. Hai Jin, Shadi Ibrahim, Li Qi, Haijun Cao, Song Wu, Xuanhua Shi. Book Chapter in Cloud Computing: Principles and Paradigms.
  7. Apache Hadoop YARN: yet another resource negotiator. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. In SOCC '13. [pdf]

Adapted from a template by FreeHTML5.co