Skip to content

spark local mode

Stratis Ioannidis edited this page Jan 14, 2019 · 3 revisions

Running Spark in Local Mode

Overview

In local mode, spark jobs run on a single machine, and are executed in parallel using multi-threading: this restricts parallelism to (at most) the number of cores in your machine. To run jobs in local mode, you need to first reserve a machine through SLURM in interactive mode and log in to it. You can subsequently set the level of parallelism to the number of cores available in this machine. The available number of cores per machine in the Discovery cluster differ per partition: the specifications of each partition can be found here.

Configuration

Local mode runs on discovery "out-of-the-box". As for all other software on Discovery, running spark simply requires pre-loading several modules that spark depends on. It is best to add these directly in your .bashrc file, so that you do not have to load them again every time you log in to a new machine. An example .bashrc file that contains all necessary module dependencies can be found here. Make sure that you have added the commands outlined in this file to your .bashrc file prior to running spark.

Tip. Configuring spark to launch a cluster involving multiple machines is more complicated. If you intend to use spark only in local mode, loading the appropriate modules in your .bashrc file suffices. If you intend to launch standalone clusters as well, use these instructions to configure your environment appropriately. Doing so is also compatible with running spark in local mode (i.e., you will be able to run spark both in local and cluster mode).

Running Spark

There are three different ways of running Spark in local mode. The first, and simplest, is to use the pyspark shell, which launches an interpreter allowing you to type python and spark commands. The second is to execute a .py program using python, as if it were a regular python program. This requires providing information about the spark context master you are using from within your python program. The third way is to use spark-submit, which requires you to specify the spark context master outside the program: this makes your code more portable, as it is easy to run the same code in different environments and to switch, e.g., between the local and cluster modes.

All three settings are supported in Discovery and are described below. In all three cases, you should first reserve a machine in interactive mode and connect to it before launching spark. Never run spark in local or standalone mode from a gateway. All instructions below presume that you have first connected to such a machine.

Running the pyspark shell

To run the pyspark interpreter in local mode from an interactive node, type:

pyspark --master local[N]

where N is the number or cores you want to use. For example local[10] uses 10 cores, local[40] reserves 40 cores, etc. Use these instructions to find out how many cores are available in the machine you have reserved.

The above command launches the pyspark interpreter, automatically creating a SparkContext variable called sc that you can evoke in commands you type. You can exit the interpreter by typing Ctrl-D.

Running a program using python

You can run a program as if it were a regular python program by (a) importing the pyspark module in your code, and (b) creating a spark context with a master argument set to local. That is, your program must include lines such as the following:

from pyspark import SparkContext
sc = SparkContext(master='local[N]', appName='NameOfProgram', sparkHome='/shared/apps/spark/spark-1.4.1-bin-hadoop2.4')

For example, the program WordCounter.py below:

import sys
from pyspark import SparkContext

if __name__ == '__main__':
    sc = SparkContext(master='local[10]', appName='WordCount')
    lines = sc.textFile(sys.argv[1])
  
    lines.flatMap(lambda s: s.split()) \
         .map(lambda word: (word, 1)) \
         .reduceByKey(lambda x, y: x + y) \
                .sortBy(lambda (x,y):y, ascending=False) \
         .saveAsTextFile(sys.argv[2])                     

can be run on file WarAndPeace.txt , producing output output.txt, by typing (on an interactive node):

python WordCounter.py WarAndPeace.txt ouput.txt

Running a program with spark-submit

To run a program called myprogram.py with spark-submit, type (on an interactive node):

spark-submit --master local[N] myprogram.py [arguments]

where N is the number of processors you want to use, and [arguments] are any arguments that you wish to pass to myprogram.py. Program myprogram.py must import pyspark and create a SparkContext to use spark functionalities, but does not need to specify a master when doing so.

For example, we modify WordCounter.py as follows:

import sys
from pyspark import SparkContext

if __name__ == '__main__':
    sc = SparkContext(appName='WordCount')
    lines = sc.textFile(sys.argv[1])
  
    lines.flatMap(lambda s: s.split()) \
         .map(lambda word: (word, 1)) \
         .reduceByKey(lambda x, y: x + y) \
                .sortBy(lambda (x,y):y, ascending=False) \
         .saveAsTextFile(sys.argv[2])                     

We can then run it on file WarAndPeace.txt , producing output output.txt, by typing (on an interactive node):

spark-submit --master local[10]   WordCounter.py WarAndPeace.txt ouput.txt

Note that the master parameter inside SparkContect does not need to be specified inside the program when running a program through spark-submit. In fact, doing so will produce an error.

Additional Resources

Links

Back to the Spark page