Companies today use a variety of tools for their analytic needs. As an example, some of Teradata's customers using Aster also use Spark. They'll be happy to hear we have a new connector in Aster called RunOnSpark.
Several functions from Spark's MLlib and ML libraries can be called from RunOnSpark. An API and a set of pre-built wrappers are installed along with the RunOnSpark function, so it's easy to get started. What's especially nice for the Aster user is that the entire workflow is done in Aster. And for the Scala programmer, custom wrappers allow additional access to functionality on Spark.
Example with Spark ML's MultilayerPerceptronClassifier (MLP)
To train a classification model using MLP, the RunOnSpark function calls MLPCTrainDF. There are a few required bits of information to include in the query parameters:
Number of input variables
Number of elements in first hidden layer
Number of elements in second hidden layer
Number of classes as output
There are a number of parameters that can either be specified in a configuration file or included in the query. They include details about the data transfer, Spark job management, and security. When the configuration file includes all of the mandatory information, the query syntax is short and simple:
The training data contains the column with the class labels and the independent variables. Data from the Aster table is transferred to Spark and formatted as a data frame or RDD for input, depending on the function. MLPCTrainDF formats the data as a data frame and while running, the status can be viewed in Ambari like any regular Spark job:
When the job finishes in Spark, MLPCTrainDF returns the total count of training records correctly predicted and incorrectly predicted to Aster.
MLPCTrainDF saves the trained classification model in the provided HDFS location. It can then be used to make predictions on new records, with the call to MLPCRunDF. In this query, only the modelLocation is required. MLPCRunDF returns the predicted class for each record in the table back to Aster.
Just like other Aster functions, the query input for RunOnSpark can be parameterized in AppCenter for easier use. Here's an example with MLPCTrainDF:
Functionality in Spark's machine learning libraries continues to grow, and there are other interesting libraries as well. The API installed with the function offers a set of classes and methods that allow a Scala programmer to write custom wrappers.
A simple example is shown here, where just the input data is returned back to Aster:
Custom wrappers need to be built and packaged using sbt or a similar program, and the resulting .jar file is referenced in RunOnSpark with the app_resource parameter.
Whether using pre-built or custom wrappers, RunOnSpark offers expanded functionality in a familiar Aster environment.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.