spark local mode

Running Spark in Local Mode

Overview
Configuration
Running Spark
Additional Resources

Overview

In local mode, spark jobs run on a single machine, and are executed in parallel using multi-threading: this restricts parallelism to (at most) the number of cores in your machine. To run jobs in local mode, you need to first reserve a machine through SLURM in interactive mode and log in to it. You can subsequently set the level of parallelism to the number of cores available in this machine. The available number of cores per machine in the Discovery cluster differ per partition: the specifications of each partition can be found here.

Configuration

Local mode runs on discovery "out-of-the-box". As for all other software on Discovery, running spark simply requires pre-loading several modules that spark depends on. It is best to add these directly in your .bashrc file, so that you do not have to load them again every time you log in to a new machine. An example .bashrc file that contains all necessary module dependencies can be found here. Make sure that you have added the commands outlined in this file to your .bashrc file prior to running spark.

Tip. Configuring spark to launch a cluster involving multiple machines is more complicated. If you intend to use spark only in local mode, loading the appropriate modules in your .bashrc file suffices. If you intend to launch standalone clusters as well, use these instructions to configure your environment appropriately. Doing so is also compatible with running spark in local mode (i.e., you will be able to run spark both in local and cluster mode).

Running Spark

There are three different ways of running Spark in local mode. The first, and simplest, is to use the pyspark shell, which launches an interpreter allowing you to type python and spark commands. The second is to execute a .py program using python, as if it were a regular python program. This requires providing information about the spark context master you are using from within your python program. The third way is to use spark-submit, which requires you to specify the spark context master outside the program: this makes your code more portable, as it is easy to run the same code in different environments and to switch, e.g., between the local and cluster modes.

All three settings are supported in Discovery and are described below. In all three cases, you should first reserve a machine in interactive mode and connect to it before launching spark. Never run spark in local or standalone mode from a gateway. All instructions below presume that you have first connected to such a machine.

Running the `pyspark` shell

To run the pyspark interpreter in local mode from an interactive node, type:

pyspark --master local[N]

where N is the number or cores you want to use. For example local[10] uses 10 cores, local[40] reserves 40 cores, etc. Use these instructions to find out how many cores are available in the machine you have reserved.

The above command launches the pyspark interpreter, automatically creating a SparkContext variable called sc that you can evoke in commands you type. You can exit the interpreter by typing Ctrl-D.

Running a program using `python`

You can run a program as if it were a regular python program by (a) importing the pyspark module in your code, and (b) creating a spark context with a master argument set to local. That is, your program must include lines such as the following:

from pyspark import SparkContext
sc = SparkContext(master='local[N]', appName='NameOfProgram', sparkHome='/shared/apps/spark/spark-1.4.1-bin-hadoop2.4')

For example, the program WordCounter.py below:

import sys
from pyspark import SparkContext

if __name__ == '__main__':
    sc = SparkContext(master='local[10]', appName='WordCount')
    lines = sc.textFile(sys.argv[1])
  
    lines.flatMap(lambda s: s.split()) \
         .map(lambda word: (word, 1)) \
         .reduceByKey(lambda x, y: x + y) \
                .sortBy(lambda (x,y):y, ascending=False) \
         .saveAsTextFile(sys.argv[2])

can be run on file WarAndPeace.txt , producing output output.txt, by typing (on an interactive node):

python WordCounter.py WarAndPeace.txt ouput.txt

Running a program with `spark-submit`

To run a program called myprogram.py with spark-submit, type (on an interactive node):

spark-submit --master local[N] myprogram.py [arguments]

where N is the number of processors you want to use, and [arguments] are any arguments that you wish to pass to myprogram.py. Program myprogram.py must import pyspark and create a SparkContext to use spark functionalities, but does not need to specify a master when doing so.

For example, we modify WordCounter.py as follows:

import sys
from pyspark import SparkContext

if __name__ == '__main__':
    sc = SparkContext(appName='WordCount')
    lines = sc.textFile(sys.argv[1])
  
    lines.flatMap(lambda s: s.split()) \
         .map(lambda word: (word, 1)) \
         .reduceByKey(lambda x, y: x + y) \
                .sortBy(lambda (x,y):y, ascending=False) \
         .saveAsTextFile(sys.argv[2])

We can then run it on file WarAndPeace.txt , producing output output.txt, by typing (on an interactive node):

spark-submit --master local[10]   WordCounter.py WarAndPeace.txt ouput.txt

Note that the master parameter inside SparkContect does not need to be specified inside the program when running a program through spark-submit. In fact, doing so will produce an error.

Additional Resources

Additional information on the pyspark shell
Reserving a node in interactive mode on Discovery
Launching a Spark standalone cluster on Discovery

Links

Back to the Spark page

Back to main page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark local mode

Running Spark in Local Mode

Overview

Configuration

Running Spark

Running the `pyspark` shell

Running a program using `python`

Running a program with `spark-submit`

Additional Resources

Links

Main Topics

Special Topics

Clone this wiki locally

spark local mode

Running Spark in Local Mode

Overview

Configuration

Running Spark

Running the pyspark shell

Running a program using python

Running a program with spark-submit

Additional Resources

Links

Main Topics

Special Topics

Clone this wiki locally

Running the `pyspark` shell

Running a program using `python`

Running a program with `spark-submit`