Connecting Aster to Spark - An Example Workflow

Learn Data Science
Not applicable

Companies today use a variety of tools for their analytic needs. As an example, some of Teradata's customers using Aster also use Spark. They'll be happy to hear we have a new connector in Aster called RunOnSpark.

 

 

RunOnSpark

Several functions from Spark's MLlib and ML libraries can be called from RunOnSpark. An API and a set of pre-built wrappers are installed along with the RunOnSpark function, so it's easy to get started. What's especially nice for the Aster user is that the entire workflow is done in Aster. And for the Scala programmer, custom wrappers allow additional access to functionality on Spark.

 

 

Example with Spark ML's MultilayerPerceptronClassifier (MLP)

Process Overview:

 

Process Overview

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

To train a classification model using MLP, the RunOnSpark function calls MLPCTrainDF. There are a few required bits of information to include in the query parameters:

 

  • modelLocation - HDFS path where the model will be saved
  • layers - Four integer values representing:

Number of input variables

Number of elements in first hidden layer

Number of elements in second hidden layer

Number of classes as output

  • labelCol - Name of column containing class labels (this is optional if the default "label" is the column name)

 

 

There are a number of parameters that can either be specified in a configuration file or included in the query. They include details about the data transfer, Spark job management, and security. When the configuration file includes all of the mandatory information, the query syntax is short and simple:

 

MLP train query syntax

 

The training data contains the column with the class labels and the independent variables. Data from the Aster table is transferred to Spark and formatted as a data frame or RDD for input, depending on the function. MLPCTrainDF formats the data as a data frame and while running, the status can be viewed in Ambari like any regular Spark job:

 

RunOnSpark in Ambari

 

 

When the job finishes in Spark, MLPCTrainDF returns the total count of training records correctly predicted and incorrectly predicted to Aster.

 

MLPCTrainDF saves the trained classification model in the provided HDFS location. It can then be used to make predictions on new records, with the call to MLPCRunDF. In this query, only the modelLocation is required. MLPCRunDF returns the predicted class for each record in the table back to Aster.

 

Predict MLP query syntax

 

Just like other Aster functions, the query input for RunOnSpark can be parameterized in AppCenter for easier use. Here's an example with MLPCTrainDF:

 

AppCenter example

 

 

Custom wrappers

Functionality in Spark's machine learning libraries continues to grow, and there are other interesting libraries as well. The API installed with the function offers a set of classes and methods that allow a Scala programmer to write custom wrappers.

 

A simple example is shown here, where just the input data is returned back to Aster:

 

UserEcho custom wrapper scala

 

Custom wrappers need to be built and packaged using sbt or a similar program, and the resulting .jar file is referenced in RunOnSpark with the app_resource parameter.

 

Custom wrapper query syntax

 

Whether using pre-built or custom wrappers, RunOnSpark offers expanded functionality in a familiar Aster environment.

1 Comment
Not applicable

In big data situations, be careful about transferring large volumes of data from Aster to Spark in this manner!  To support this architecture, the network infrastructure bridging the 2 should be based on bynet or infini-band.