First Spark program on the DCE
The Data Centre d’Enseignement (DCE) is a set of computing resources financed by the Metz Eurometropole by the Eurométropole de Metz, Grand Est, CentraleSupélec and its foundation, and the Conseil Départemental de Moselle.
This tutorial is a quick guide to : * Learn the basic commands for manipulating files stored in HDFS. * Learn how to run a Spark program on the DCE.
More information is available on the official DCE website.
1 Overview of the DCE
The architecture of the DCE is presented in Figure 1.1.
Figure 1.1: Architecture of the cluster (source : DCE documentation)
In this course, we’ll be using CPU nodes only. These are divided into two groups: the Sarah cluster and the Kyle cluster.
2 Running a Spark program
Datasets are normally available in the /data/
folder stored in HDFS.
Enter the following command to display the contents of the folder:
hdfs dfs -ls -h hdfs://sar01:9000/data
The aim here is to run Spark code to count the number of occurrences of each word in file /data/sherlock.txt
.
- Copy file
~cpu_vialle/DCE-Spark/template_wc.py
to your home directory through the following command :
cp ~cpu_vialle/DCE-Spark/template_wc.py ./wc.py
Enter the ls
command to check that the wc.py
file is in your home directory.
This file contains the program’s Python code.
This file should also appear in the Visual Studio Code explorer; if not, click on the explorer’s refresh button.
Open file
wc.py
in Visual Studio Code.Locate the following instruction:
text_file = sc.textFile("hdfs://...")
text_file = sc.textFile("hdfs://sar01:9000/data/sherlock.txt")
This instruction will create an RDD called text_file
with the contents of the file.
- Similarly, locate the following instruction :
counts.saveAsTextFile("hdfs://...")
and replace it with the following:
counts.saveAsTextFile("hdfs://sar01:9000/bdiaspark2024/bdiaspark2024_X/sherlock.out")
This instruction will create a folder sherlock.out that your program will produce.
- Execute the program
wc.py
through the following command:
spark-submit --master spark://sar01:7077 wc.py
- When the execution is completed, the output will be available in the folder
sherlock.out
. To verify it, type the following command:
hdfs dfs -ls -h hdfs://sar01:9000/bdiaspark2024/bdiaspark2024_X/sherlock.out
As usual, don’t forget to replace bdiaspark2024_X with your username.
- Type the following command to print the result:
hdfs dfs -cat hdfs://sar01:9000/bdiaspark2024/bdiaspark2024_X/sherlock.out/*
Output files
If you execute the program and you specify an output file that already exists,
you’ll get an error.
If you really want to overwrite the output file,
you need to first remove it explicitly with the following command:
hdfs dfs -rm -r hdfs://sar01:9000/bdiaspark2024/bdiaspark2024_X/sherlock.out