Big Data for IA — Tutorial 1

MapReduce


1 Computing averages

We are given a dataset that contains the average monthly temperature measurements over the course of some years. More precisely, the dataset is stored in a CSV file, where each row corresponds to a monthly measurement and the columns contain the following values: year, month, average temperature in the month.

1980,1,5
1980,2,2
1980,3,10
1980,4,14
1980,5,17
....
1981,1,2
1981,2,1
1981,3,3
1981,4,10
....

We intend to get the average monthly temperature for each year.

Exercise

Exercise 1.1 Write a MapReduce algorithm that generates key-value pairs \((year, average\_temperature)\).

Suppose now that we have a large CSV file stored in a distributed file system (e.g., HDFS), containing a series of measurements in the format “Year, Month, Day, Minute, Second, Temperature”. We can have up to one measurement per second in some years. Like before, we’d like to compute key-value pairs (year, average_temperature) by using a MapReduce algorithm.

Exercise

Exercise 1.2 What is the maximum number of measurements in a year?

Exercise

Exercise 1.3 Considering the answer to the previous question, discuss the efficiency of the first implementation of the algorithm.

Exercise

Exercise 1.4 Based on the answer to the previous question, propose a better implementation to handle the CSV file.

2 Common friends in a social network

Consider a social network described by a graph encoded in a text file. Each line of the file is a list of identifiers separated by commas. For instance, the line \(A,B,C,D\) means that \(A\) is friend with \(B\), \(C\) and \(D\). An excerpt of the file looks like as follows:

A,B,C,D
B,A,D
C,A,D
D,A,B,C
...

We suppose that the friendship relation is symmetric: \((A, B)\) implies \((B, A)\).

We want to obtain the list of the common friends for each pair of individuals:

(A, B), [D]
(A, C), [D] 
(A, D), [B, C] 
(B, C), [D] 
(B, D), [A] 
(C, D), [A]

As an additional constraint, we want to represent a couple only once and avoid to represent the symmetric couple. In other words, if we output \((A, B)\), we don’t want to output \((B, A)\).

Exercise

Exercise 2.1 Propose a MapReduce implementation to find the common friends in a social network satisfying the given constraints.

3 Computing average and standard deviation

We consider again the large CSV file with a series of measurements in the format “Year, Month, Day, Minute, Second, Temperature”. We now intend to generate a series of key-value pairs (year, (avg_temperature, std_deviation)).

We can express the standard deviation of \(n\) values \(x_i\) (\(1 \leq i \leq n\)) with two different equations.

The first equation is as follows:

\[ \sigma = \sqrt{\frac{\sum_{i=1}^n (x_i - \overline{x})^2}{n}} \]

The second equation is as follows:

\[ \sigma = \sqrt{\overline{x^2} - \overline{x}^2} = \sqrt{\frac{\sum_{i=1}^n (x_i)^2}{n} - \Bigg(\frac{\sum_{i=1}^n x_i}{n}\Bigg)^2} \]

Exercise

Exercise 3.1 Which equation of the standard deviation is more appropriate in a Map-Reduce algorithm? Why?

Exercise

Exercise 3.2 Propose a MapReduce implementation to compute the average and the standard deviation of the temperatures for each year.