-
Notifications
You must be signed in to change notification settings - Fork 9
spark local mode
In local mode, spark jobs run on a single machine, and are executed in parallel using multi-threading: this restricts parallelism to (at most) the number of cores in your machine. To run jobs in local mode, you need to first reserve a machine through SLURM in interactive mode and log in to it. You can subsequently set the level of parallelism to the number of cores available in this machine. The available number of cores per machine in the Discovery cluster differ per partition: the specifications of each partition can be found here.
Local mode runs on discovery "out-of-the-box". As for all other software on Discovery, running spark simply requires pre-loading several modules that spark depends on. It is best to add these directly in your .bashrc
file, so that you do not have to load them again every time you log in to a new machine. An example .bashrc
file that contains all necessary module dependencies can be found here. Make sure that you have added the commands outlined in this file to your .bashrc
file prior to running spark.
Tip. Configuring spark to launch a cluster involving multiple machines is more complicated. If you intend to use spark only in local mode, loading the appropriate modules in your
.bashrc
file suffices. If you intend to launch standalone clusters as well, use these instructions to configure your environment appropriately. Doing so is also compatible with running spark in local mode (i.e., you will be able to run spark both in local and cluster mode).
There are three different ways of running Spark in local mode. The first, and simplest, is to use the pyspark
shell, which launches an interpreter allowing you to type python and spark commands. The second is to execute a .py
program using python
, as if it were a regular python program. This requires providing information about the spark context master you are using from within your python program.
The third way is to use spark-submit
, which requires you to specify the spark context master outside the program: this makes your code more portable, as it is easy to run the same code in different environments and to switch, e.g., between the local and cluster modes.
All three settings are supported in Discovery and are described below. In all three cases, you should first reserve a machine in interactive mode and connect to it before launching spark. Never run spark in local or standalone mode from a gateway. All instructions below presume that you have first connected to such a machine.
To run the pyspark
interpreter in local mode from an interactive node, type:
pyspark --master local[N]
where N
is the number or cores you want to use. For example local[10]
uses 10 cores, local[40]
reserves 40 cores, etc. Use these instructions to find out how many cores are available in the machine you have reserved.
The above command launches the pyspark
interpreter, automatically creating a SparkContext
variable called sc
that you can evoke in commands you type. You can exit the interpreter by typing Ctrl-D
.
You can run a program as if it were a regular python program by (a) importing the pyspark
module in your code, and (b) creating a spark context with a master argument set to local
. That is, your program must include lines such as the following:
from pyspark import SparkContext
sc = SparkContext(master='local[N]', appName='NameOfProgram', sparkHome='/shared/apps/spark/spark-1.4.1-bin-hadoop2.4')
For example, the program WordCounter.py
below:
import sys
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(master='local[10]', appName='WordCount')
lines = sc.textFile(sys.argv[1])
lines.flatMap(lambda s: s.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y) \
.sortBy(lambda (x,y):y, ascending=False) \
.saveAsTextFile(sys.argv[2])
can be run on file WarAndPeace.txt
, producing output output.txt
, by typing (on an interactive node):
python WordCounter.py WarAndPeace.txt ouput.txt
To run a program called myprogram.py
with spark-submit
, type (on an interactive node):
spark-submit --master local[N] myprogram.py [arguments]
where N
is the number of processors you want to use, and [arguments] are any arguments that you wish to pass to myprogram.py
. Program myprogram.py
must import pyspark
and create a SparkContext
to use spark functionalities, but does not need to specify a master when doing so.
For example, we modify WordCounter.py
as follows:
import sys
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName='WordCount')
lines = sc.textFile(sys.argv[1])
lines.flatMap(lambda s: s.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y) \
.sortBy(lambda (x,y):y, ascending=False) \
.saveAsTextFile(sys.argv[2])
We can then run it on file WarAndPeace.txt
, producing output output.txt
, by typing (on an interactive node):
spark-submit --master local[10] WordCounter.py WarAndPeace.txt ouput.txt
Note that the master
parameter inside SparkContect
does not need to be specified inside the program when running a program through spark-submit
. In fact, doing so will produce an error.
-
Additional information on the
pyspark
shell - Reserving a node in interactive mode on Discovery
- Launching a Spark standalone cluster on Discovery
Back to the Spark page
Back to main page.