Big Data — DCE tutorial

DCE tutorial

The Data Centre d’Enseignement (DCE) is a pool of computing resources that have been financed by the Eurométropole de Metz, Region Grand Est, CentraleSupélec and its foundation, and the Conseil Départemental de Moselle.

This tutorial is a quick guide to:

Learn how to connect to the DCE.
Learn basic commands to manipulate files stored in HDFS.
Learn how to run a Spark program on the DCE.

More information are available on the DCE official website.

1 Overview

The architecture of the DCE is shown in figure 1.1.

$Cluster architecture (image credit: [DCE documentation](https://dce.pages.centralesupelec.fr/01_cluster_overview/){target="_blank"})$

Figure 1.1: Cluster architecture (image credit: DCE documentation)

In this tutorial we only use the CPU nodes. These are divided in two groups: the cluster Sarah and the cluster Kyle.

To connect to the DCE you need:

A valid username and a password. These are provided by your lab supervisor before the first lab session.
Visual Studio Code with the extension Remote Development installed.

2 Connection

Watch this video to learn how to connect to the DCE with Visual Studio Code.

Some of the steps shown in the video are explained below:

When you connect to the DCE for the first time, or if you need to re-initialise the connection, you’ll be prompted to enter a SSH command. Type the following (replace your_username with the username that you received from your lab supervisor):

ssh your_username@chome.metz.supelec.fr

After executing the command, you’ll be prompted to enter your password.
Once you’re connected to chome, open your home folder on chome in Visual Studio Code, as shown in the video.

3 Allocating resources with slurm

You need to allocate computing ressources to run any jobs on the DCE. The command to do so depends on whether you have a reservation for the resources or not.

3.1 With a reservation

The resources are reserved by your lab supervisor before any lab session. Any reservation is given a code.

Good to know

The code is only valid for a single lab session. When the lab session is over, the reservation code might not work any longer.

The command to type to allocate the reserved resources is the following (replace code with the reservation code given by your lab supervisor ).

srun -N 1 -c 2 --reservation [code] -t 04:00:00 --pty bash

After running the command, you should be connected to one of the Kyle machines.

3.2 Without a reservation

If you don’t have a reservation code, run the following command:

srun -p cpu_inter -t 02:00:00 -N 1 --cpus-per-task=2 --pty bash

Good to know

If you don’t have a reservation, please use the DCE only between 6PM-8AM in weekdays. During the weekend, you can allocate resources on the DCE all day.

4 Running a Spark program

The datasets are usually available under directory /data/ stored in HDFS.

Type the following command to look at the content of the directory:

hdfs dfs -ls -h hdfs://sar01:9000/data

The objective is to run a Spark program that counts the number of occurrences of each word in file /data/sherlock.txt.

Copy the file ~cpu_vialle/DCE-Spark/template_wc.py to your home directory by typing the following command:

cp ~cpu_vialle/DCE-Spark/template_wc.py ./wc.py

Type the command ls to verify that the file wc.py is in your working directory. This file contains the Python code of the program. The file should also appear in the Explorer window in Visual Studio Code; if not, click on the Explorer refresh button.

Open the file wc.py in Visual Studio Code.
Locate the following instruction:

text_file = sc.textFile("hdfs://...")

and replace it with the following:

text_file = sc.textFile("hdfs://sar01:9000/data/sherlock.txt")

This will create an RDD named text_file with the content of the file.

Similarly, locate the following instruction:

counts.saveAsTextFile("hdfs://...")

and replace it with the following instruction:

counts.saveAsTextFile("hdfs://sar01:9000/hpda/hpda_X/sherlock.out")

This will create an output directory sherlock.out that will contain the files with the output of the program.

Run the Python program wc.py with the following command:

spark-submit --master spark://sar01:7077 wc.py

When the execution is over, the output will be available under the directory sherlock.out. To verify it, run the following command:

hdfs dfs -ls -h hdfs://sar01:9000/hpda/hpda_X/sherlock.out

As usual, remember to replace hpda_X with your username.

In order to see the result, run the following command:

hdfs dfs -cat hdfs://sar01:9000/hpda/hpda_X/sherlock.out/*

Output files

If you rerun the program by specifying an output file that already exists, you’d get an error. If you really want to overwrite the output file, you first need to remove it explicitly by typing the following command:

hdfs dfs -rm -r hdfs://sar01:9000/hpda/hpda_X/sherlock.out