Aster Analytics: Human Friendly Analytics - A Journey in Spark

Learn Data Science
Teradata Employee

I have been spending a great deal of time lately learning about Spark and don't get me wrong it is very neat.  Coding in python and/or scala is neat and I am not anti-spark or anything of the sort.  When it comes time for preparing data for analytics what would you like to use?  I guess it all depends on personal preference and what you know best.  From my perspective I want to use the easiest interface and programming language I can.  If you are using pure Apache Spark this is what you are presented:

spark.JPG

To perform a decision tree analysis in Spark this is one approach you can take:

Now, what they dont show you is all the work it would have taken to get your data into the LIBSVM format.  Not really that complicated but still would take some work.  You are going to have to decide, do I do that preparation in Spark or some other place.  Probably some other place. 

from pyspark import SparkContext, SQLContext

from pyspark.ml import Pipeline

from pyspark.ml.classification import DecisionTreeClassifier

from pyspark.ml.feature import StringIndexer, VectorIndexer

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load the data stored in LIBSVM format as a DataFrame.

# ORIGINAL IS BELOW

# data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

data = sqlContext.read.format("libsvm").load("/FileStore/tables/xrivu6ya1460332730046/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.

# Fit on whole dataset to include all labels in index.

labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

# Automatically identify categorical features, and index them.

# We specify maxCategories so features with > 4 distinct values are treated as continuous.

featureIndexer =\

    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)

(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.

dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline

pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.

model = pipeline.fit(trainingData)

# Make predictions.

predictions = model.transform(testData)

# Select example rows to display.

predictions.select("prediction", "indexedLabel", "features").show(120)

# Select (prediction, true label) and compute test error

evaluator = MulticlassClassificationEvaluator(

    labelCol="indexedLabel", predictionCol="prediction", metricName="precision")

accuracy = evaluator.evaluate(predictions)

print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]

# summary only

print("PRINT:  treeModel  ")

print(treeModel)

To perform this type of analytic in Aster would have required you to prepare you data in a similar format.  You can do that preparation in Aster using ANSI SQL and SQL-MR functions.  You can then use our simple SQL-MR interface: (Forest_Drive and Forest_Predict)

SELECT * FROM Forest_Drive(

ON (SELECT 1)

PARTITION BY 1

InputTable('carprices')

OutputTable('my_model')

TreeType('classification')

Response('price_class')

NumericInputs('cylinders','horsepower','city_mpg','highway_mpg',

'weight','wheelbase','length','width')

CategoricalInputs('sports','suv','wagon','minivan','pickup',

'awd','rwd')

MaxDepth(4)

MinNodeSize(5)

Variance(0.05)

NumTrees(10)

);

--  APPLY THE MODEL FROM THE PREVIOUS FOREST DRIVE FUNCTION CALL

SELECT car_name, prediction AS predicted_class

FROM Forest_Predict(

ON carprices

Forest('my_model')

NumericInputs('cylinders','horsepower','city_mpg','highway_mpg',

'weight','wheelbase','length','width')

CategoricalInputs('sports','suv','wagon','minivan','pickup',

'awd','rwd')

IDCol('car_name')

);

Output

Then with Teradata AppCenter I can make the whole experience interactive and HUMAN READY!

Then I could create an AppCenter Application to wrap the entire solution up and visualize it...

tree.JPG