Big Data IA

First Spark program on the DCE

The Data Centre d’Enseignement (DCE) is a set of computing resources financed by the Metz Eurometropole by the Eurométropole de Metz, Grand Est, CentraleSupélec and its foundation, and the Conseil Départemental de Moselle.

This tutorial is a quick guide to : * Learn the basic commands for manipulating files stored in HDFS. * Learn how to run a Spark program on the DCE.

More information is available on the official DCE website.

1 Overview of the DCE

The architecture of the DCE is presented in Figure 1.1.

$Architecture of the cluster (source : [DCE documentation](https://dce.pages.centralesupelec.fr/01_cluster_overview/){target="_blank"})$

Figure 1.1: Architecture of the cluster (source : DCE documentation)

In this course, we’ll be using CPU nodes only. These are divided into two groups: the Sarah cluster and the Kyle cluster.

2 Running a Spark program

Datasets are normally available in the /data/ folder stored in HDFS. Enter the following command to display the contents of the folder:

hdfs dfs -ls -h hdfs://sar01:9000/data

The aim here is to run Spark code to count the number of occurrences of each word in file /data/sherlock.txt.

Copy file ~cpu_vialle/DCE-Spark/template_wc.py to your home directory through the following command :

cp ~cpu_vialle/DCE-Spark/template_wc.py ./wc.py

Enter the ls command to check that the wc.py file is in your home directory. This file contains the program’s Python code. This file should also appear in the Visual Studio Code explorer; if not, click on the explorer’s refresh button.

Open file wc.py in Visual Studio Code.
Locate the following instruction:

text_file = sc.textFile("hdfs://...")

and replace it with the following :

text_file = sc.textFile("hdfs://sar01:9000/data/sherlock.txt")

This instruction will create an RDD called text_file with the contents of the file.

Similarly, locate the following instruction :

counts.saveAsTextFile("hdfs://...")

and replace it with the following:

counts.saveAsTextFile("hdfs://sar01:9000/bdiaspark2024/bdiaspark2024_X/sherlock.out")

This instruction will create a folder sherlock.out that your program will produce.

Execute the program wc.py through the following command:

spark-submit --master spark://sar01:7077 wc.py

When the execution is completed, the output will be available in the folder sherlock.out. To verify it, type the following command:

hdfs dfs -ls -h hdfs://sar01:9000/bdiaspark2024/bdiaspark2024_X/sherlock.out

As usual, don’t forget to replace bdiaspark2024_X with your username.

Type the following command to print the result:

hdfs dfs -cat hdfs://sar01:9000/bdiaspark2024/bdiaspark2024_X/sherlock.out/*

Output files

If you execute the program and you specify an output file that already exists, you’ll get an error. If you really want to overwrite the output file,
you need to first remove it explicitly with the following command:

hdfs dfs -rm -r hdfs://sar01:9000/bdiaspark2024/bdiaspark2024_X/sherlock.out