I have been spending a great deal of time lately learning about Spark and don't get me wrong it is very neat. Coding in python and/or scala is neat and I am not anti-spark or anything of the sort. When it comes time for preparing data for analytics what would you like to use? I guess it all depends on personal preference and what you know best. From my perspective I want to use the easiest interface and programming language I can. If you are using pure Apache Spark this is what you are presented:
To perform a decision tree analysis in Spark this is one approach you can take:
Now, what they dont show you is all the work it would have taken to get your data into the LIBSVM format. Not really that complicated but still would take some work. You are going to have to decide, do I do that preparation in Spark or some other place. Probably some other place.
from pyspark import SparkContext, SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load the data stored in LIBSVM format as a DataFrame.
# ORIGINAL IS BELOW
# data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
data = sqlContext.read.format("libsvm").load("/FileStore/tables/xrivu6ya1460332730046/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
# Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(120)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="precision")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
treeModel = model.stages
# summary only
print("PRINT: treeModel ")
To perform this type of analytic in Aster would have required you to prepare you data in a similar format. You can do that preparation in Aster using ANSI SQL and SQL-MR functions. You can then use our simple SQL-MR interface: (Forest_Drive and Forest_Predict)
SELECT * FROM Forest_Drive(
ON (SELECT 1)
PARTITION BY 1
-- APPLY THE MODEL FROM THE PREVIOUS FOREST DRIVE FUNCTION CALL
SELECT car_name, prediction AS predicted_class
Then with Teradata AppCenter I can make the whole experience interactive and HUMAN READY!
Then I could create an AppCenter Application to wrap the entire solution up and visualize it...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.