Big data algorithms, techniques and platforms

Overview

This course aims to introduce the main technologies to deal with the many challenges posed by Big Data.

Big Data is a term used to describe a collection of data that is huge in volume and yet grows exponentially over time. In short, this data is so voluminous and complex that none of the traditional data management tools are capable of storing or processing it efficiently.

In the first part, this course introduces the existing technologies that make it possible to efficiently process large volumes of data, namely Hadoop MapReduce and Apache Spark.

In the second part, we will study the solutions that allow to store and query these volumes of data; we will focus on a variety of NoSQL databases (using MongoDB as a case study).

Prerequisites

Basic understanding of how computer systems work: processor, memory, disk operations and functions of the operating system.
Good knowledge of relational database management systems.

Teaching staff

Gianluca Quercini

Course summary

1. Introduction and MapReduce programming.

Basic notions and motivations of Big Data.
Overview of Hadoop.
Introduction to MapReduce.

2. Hadoop and its ecosystem: HDFS.

In-depth description of the Hadoop Distributed File System (HDFS).

3. Introduction to Apache Spark.

Apache Spark, its architecture and functionalities.
Resilient Distributed Datasets: transformations and actions.

4. SparkSQL

4. Spark streaming

6. Distributed databases and NoSQL.

Data distribution (replication, sharding, the CAP theorem).
Overview of NoSQL databases.

7. Document oriented databases: MongoDB.

Presentation of MongoDB.