JPMML-SparkML as an Apache Spark Package.
- Apache Spark 1.6.X or 2.0.X.
Clone the JPMML-SparkML-Package project and enter its directory:
git clone https://github.com/jpmml/jpmml-sparkml-package.git
cd jpmml-sparkml-package
When targeting Apache Spark 1.6.X, check out the spark-1.6.X
development branch:
git checkout spark-1.6.X
Build the project:
mvn clean package
The build produces an uber-JAR file target/jpmml-sparkml-package-1.1-SNAPSHOT.jar
.
Add the Python bindings of Apache Spark to the PYTHONPATH
environment variable:
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python
Build the project using the pyspark
profile:
mvn -Ppyspark clean package
The build produces an EGG file target/jpmml_sparkml-1.1rc0.egg
and an uber-JAR file target/jpmml-sparkml-package-1.1-SNAPSHOT.jar
.
Test the uber-JAR file:
cd src/main/python
nosetests
Launch the Spark shell with JPMML-SparkML-Package; use --jars
to specify the location of the uber-JAR file:
spark-shell --jars /path/to/jpmml-sparkml-package/target/jpmml-sparkml-package-1.1-SNAPSHOT.jar
Fitting an example pipeline model:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.feature.RFormula
val df = spark.read.option("header", "true").option("inferSchema", "true").csv("Iris.csv")
val formula = new RFormula().setFormula("Species ~ .")
val classifier = new DecisionTreeClassifier()
val pipeline = new Pipeline().setStages(Array(formula, classifier))
val pipelineModel = pipeline.fit(df)
Exporting the fitted example pipeline model to PMML byte array:
val pmmlBytes = org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(df.schema, pipelineModel)
println(new String(pmmlBytes, "UTF-8"))
Add the EGG file to the PYTHONPATH
environment variable:
export PYTHONPATH=$PYTHONPATH:/path/to/jpmml-sparkml-package/target/jpmml_sparkml-1.1rc0.egg
Launch the PySpark shell with JPMML-SparkML-Package; use --jars
to specify the location of the uber-JAR file:
pyspark --jars /path/to/jpmml-sparkml-package/target/jpmml-sparkml-package-1.1-SNAPSHOT.jar
Fitting an example pipeline model:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula
df = spark.read.csv("Iris.csv", header = True, inferSchema = True)
formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)
Exporting the fitted example pipeline model to PMML byte array:
from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, df, pipelineModel)
print(pmmlBytes.decode("UTF-8"))
JPMML-SparkML-Package is licensed under the GNU Affero General Public License (AGPL) version 3.0. Other licenses are available on request.
Please contact [email protected]