Skip to content

michalrudko/jpmml-sparkml-package

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JPMML-SparkML-Package

JPMML-SparkML as an Apache Spark Package.

Prerequisites

Installation

Clone the JPMML-SparkML-Package project and enter its directory:

git clone https://github.com/jpmml/jpmml-sparkml-package.git
cd jpmml-sparkml-package

When targeting Apache Spark 1.6.X, check out the spark-1.6.X development branch:

git checkout spark-1.6.X

Scala

Build the project:

mvn clean package

The build produces an uber-JAR file target/jpmml-sparkml-package-1.1-SNAPSHOT.jar.

PySpark

Add the Python bindings of Apache Spark to the PYTHONPATH environment variable:

export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python

Build the project using the pyspark profile:

mvn -Ppyspark clean package

The build produces an EGG file target/jpmml_sparkml-1.1rc0.egg and an uber-JAR file target/jpmml-sparkml-package-1.1-SNAPSHOT.jar.

Test the uber-JAR file:

cd src/main/python
nosetests

Usage

Scala

Launch the Spark shell with JPMML-SparkML-Package; use --jars to specify the location of the uber-JAR file:

spark-shell --jars /path/to/jpmml-sparkml-package/target/jpmml-sparkml-package-1.1-SNAPSHOT.jar

Fitting an example pipeline model:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.feature.RFormula

val df = spark.read.option("header", "true").option("inferSchema", "true").csv("Iris.csv")

val formula = new RFormula().setFormula("Species ~ .")
val classifier = new DecisionTreeClassifier()
val pipeline = new Pipeline().setStages(Array(formula, classifier))
val pipelineModel = pipeline.fit(df)

Exporting the fitted example pipeline model to PMML byte array:

val pmmlBytes = org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(df.schema, pipelineModel)
println(new String(pmmlBytes, "UTF-8"))

PySpark

Add the EGG file to the PYTHONPATH environment variable:

export PYTHONPATH=$PYTHONPATH:/path/to/jpmml-sparkml-package/target/jpmml_sparkml-1.1rc0.egg

Launch the PySpark shell with JPMML-SparkML-Package; use --jars to specify the location of the uber-JAR file:

pyspark --jars /path/to/jpmml-sparkml-package/target/jpmml-sparkml-package-1.1-SNAPSHOT.jar

Fitting an example pipeline model:

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula

df = spark.read.csv("Iris.csv", header = True, inferSchema = True)

formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)

Exporting the fitted example pipeline model to PMML byte array:

from jpmml_sparkml import toPMMLBytes

pmmlBytes = toPMMLBytes(sc, df, pipelineModel)
print(pmmlBytes.decode("UTF-8"))

License

JPMML-SparkML-Package is licensed under the GNU Affero General Public License (AGPL) version 3.0. Other licenses are available on request.

Additional information

Please contact [email protected]

About

JPMML-SparkML as an Apache Spark Package

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%