Skip to content

Build a Spark Standalone cluster on Kubernetes using MapR's packages

Notifications You must be signed in to change notification settings

rcpbayindir/mapr-spark-kubernets-cluster

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deploy standalone Spark cluster on Kubernetes

Introduction

The artifacts included in this setup will help you deploy a standalone Spark cluster using MapR's packages. Each node on the cluster is configured as a MapR client. There is no data or scripts to go along with the setup (Yet) Once the cluster is up and running you can get into one of the nodes and run spark-shell or spark-submit with access to maprfs.

Architecture

Each Spark node will be assigned a location by Kubernetes. This is based on the resources requested and the availability of the resources in your Kubernetes environment. Because we are using a custom Docker image, we need a local registry available to Kubernetes to pull the image. Each node in the Kubernetes cluster has to be able to access the MapR cluster.

There are some complexities to the Kubernetes networking that I will not address in this document. Make sure you select a Pod network add-on that support communication and DNS resolution between pods/nodes( I am using Flannel).

alt_text

Requirements

  • Kubernetes installed and working.
  • Install a Docker registry that is accessible from all kubernetes nodes.
  • Download the artifacts to the Kubernetes master node
  • A running Kubernetes Dashboard and connectivity tested from a browser.

Configuration

Download and extract the artifacts.

Create and push the Docker image

Under the docker directory you will find the Dockerfile. Create the image following these steps.

  • cd docker
  • docker build -f mapr-spark:latest .
    • You can name the image whatever you want

It will take a few minutes to build.

Tag and push to your registry

  • docker tag mapr-spark:latest :/mapr-spark:latest
  • docker push :/mapr-spark:latest

Configure deployment

To customize the configuration modify the variables in setup.sh. The file is self documented. At a high level this is what you need to configure:

  • The name of your Docker image
  • Information about the MapR cluster to connect to
    • Ticket, ssl_truststore, CLDB…

Run setup.sh to generate the Yaml files

Deployment

This configuration is meant to explore the Kubernetes functionality along with using MapR packages. Take the time to start components one at a time to make sure that services are operational.

NOTE: I append a random string to the pod names to facilitate the creation of multiple clusters with the same framework.

Deploy Spark Master

Deploy

kubectl apply -f spark_master.yaml

Confirm the pod/container is running (note the name of the pod)

kubectl --namespace=spark-cluster get pods

    NAME                         READY     STATUS    RESTARTS   AGE


    **spark-mstr-bydkxg-rc-jmbhq**   1/1       Running   0          1h

Open a command prompt

kubectl --namespace=spark-cluster exec -it spark-mstr-bydkxg-rc-jmbhq /bin/bash

Connect to the Spark UI

http://localhost:8001/api/v1/namespaces/spark-cluster/services/spark-mstr-bydkxg-rc-jmbhq :8080/proxy/

NOTE:

localhost I assume you tunneled to the Kubernetes master node

Make sure you get the pod name correctly

You should get a response from the master, but no worker nodes.

alt_text

Deploy Spark Workers

Deploy (default config deploys 2 workers)

kubectl apply -f spark_worker.yaml

Confirm the pod/containers are running

kubectl --namespace=spark-cluster get pods

NAME                         READY     STATUS    RESTARTS   AGE


spark-mstr-jkchzi-rc-npvmd   1/1       Running   0          2m


spark-wrkr-jkchzi-rc-488nf   1/1       Running   0          13s


spark-wrkr-jkchzi-rc-r7k2w   1/1       Running   0          13s

Connect to the Spark UI

After a minute or so, you should see the worker nodes.

alt_text

Use

Open a command prompt to the master node (You need to get the name of the pod with get pods)

kubectl --namespace=spark-cluster exec -it spark-mstr-bydkxg-rc-jmbhq /bin/bash

Confirm you are connected to MapR fs

root@spark-mstr-jkchzi-rc-npvmd:/# /opt/mapr/hadoop/hadoop-2.7.0/bin/hdfs dfs -ls /


Found 9 items


drwxr-xr-x   - mapr 5000          4 2019-01-09 21:24 /apps


drwxr-xr-x   - mapr 5000          0 2018-12-06 21:24 /hbase


drwxr-xr-x   - mapr 5000          0 2018-12-06 21:36 /opt


drwxr-xr-x   - mapr 5000          3 2019-01-11 18:12 /podvols


drwxr-xr-x   - mapr 5000          2 2019-01-09 18:37 /pv


drwxr-xr-x   - root root          0 2018-12-06 21:36 /tables


drwxrwxrwx   - mapr 5000          0 2018-12-06 21:24 /tmp


drwxr-xr-x   - root root          3 2018-12-07 01:12 /user


drwxr-xr-x   - mapr 5000          1 2018-12-06 21:24 /var

Run spark-shell

root@spark-mstr-jkchzi-rc-npvmd:/# /opt/mapr/spark/spark-2.3.1/bin/spark-shell

Warning: Unable to determine $DRILL_HOME


2019-01-16 22:19:47 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".


To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


Spark context Web UI available at http://spark-mstr-jkchzi-rc-npvmd:4040


Spark context available as 'sc' (master = local[*], app id = local-1547677196049).


Spark session available as 'spark'.


Welcome to


      ____              __


     / __/__  ___ _____/ /__


    _\ \/ _ \/ _ `/ __/  '_/


   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1-mapr-1808


      /_/


         


Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)


Type in expressions to have them evaluated.


Type :help for more information.


scala> 

Access data with scala

scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc)


warning: there was one deprecation warning; re-run with -deprecation for details


sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1b1ea1d9


scala> val df = sqlcontext.read.json("/podvols/iotdata.json")


....


df: org.apache.spark.sql.DataFrame = [device: string, guid: string ... 7 more fields]


scala> df.show(1)


+------+--------------------+------+------+------+--------------+----+-------------------+--------+


|device|                guid|metric|millis|sensor|      senstype|site|          timestamp|timezone|


+------+--------------------+------+------+------+--------------+----+-------------------+--------+


|  dev1|6f3743c6-01e1-4fb...|1000.0|   588| sen/2|currentPresure|  S1|2019-01-11T10:04:03|   -0800|


+------+--------------------+------+------+------+--------------+----+-------------------+--------+


only showing top 1 row


scala> 

Delete deployment

The easiest way to remove all artifacts is to run

kubectl delete namespace spark-cluster

Change the namespace if you changed it in the setup.

Other fun stuff

Scale workers

Do you need more power? Add more workers…

Get the name of the worker replication controller. In this example it has 2 replicas running (2 Spark worker nodes)

**kubectl --namespace=spark-cluster get rc**


NAME                   DESIRED   CURRENT   READY     AGE


spark-mstr-jkchzi-rc   1         1         1         11m


**spark-wrkr-jkchzi-rc**   2         2         2         2m

Add replicas (worker node).

**kubectl --namespace=spark-cluster scale --current-replicas=2 --replicas=3 rc/spark-wrkr-jkchzi-rc**


replicationcontroller "spark-wrkr-jkchzi-rc" scaled

You should now see one more worker node

kubectl --namespace=spark-cluster get rc

NAME DESIRED CURRENT READY AGE

spark-mstr-jkchzi-rc 1 1 1 18m

spark-wrkr-jkchzi-rc 3 3 3 9m

alt_text

References

Install cluster with Kubeadm

https://kubernetes.io/docs/setup/independent/install-kubeadm/

https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/

About

Build a Spark Standalone cluster on Kubernetes using MapR's packages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 91.8%
  • Dockerfile 8.2%