Basic commands

This page summarizes the commands needed to manage data and run jobs.

Data management

Data in the cluster lives on HDFS, the Hadoop distributed filesystem. Spark is already configured to read data from HDSF. The command to interact with HDFS is called hdfs.

hdfs <subcommand> [options]

The hdfs command has several subcommands, most of which can be used only by the cluster’s administrator. The most interesting subcommand for normal operation is hdfs dfs, which allows to list, move, copy, remove, upload, download, etc.. files and directories in HDFS. Here is a brief synopsis of the most useful options.

-ls Lists the file in the given directory. If no directory is given, then it shows files under /user/groupXXX if your username is groupXXX.
-rm Removes the given file. Paths not beginning with a / are relative to /user/groupXXX. If you need to remove a directory, you should also pass the options -r.
-mv Move a file or directory to another name.
-get Downloads the given file to the local filesystem.
-put Uploads the given file from the local filesystem.

Job submission

spark-submit [spark options] code [program arguments]

Submit Spark jobs to the cluster. For Java users, code is the path to the jar file containing all the bytecode of your application. For Python users, code is the path to the .py file containing the application code.

This command has many parameters, which can by shown by invoking:

spark-submit --help

Among the options that you can set, the following are of primary interest in our case:

--class class_name: the class containing the main method you intend to run. This argument is mandatory when using Java.
--num-executors X: the number X of executors to be used, each with one core and 2 GB of memory
--conf spark.pyspark.python=python3: for Python 3 users

There is also an option "--master" which allows you to specify the cluster manager to be used for executing your job. On the cluster, this is already configured, so you need not use this option. In fact, it is important that the program that you run does not set the master to local[*], which would force execution only on the frontend without exploiting parallelism.