Develop SQL MR function using AsterData Developer Express, part 1: project setup

Aster
Teradata Aster is an analytic platform that embeds MapReduce analytic processing with data stores. •Embedded MapReduce analytic processing and unique SQL-MapReduce® framework •Massively parallel data store for multistructured data •Intuitive tools and SQL-MapReduce libraries for rapid analytic development
Enthusiast

Develop SQL MR function using AsterData Developer Express, part 1: project setup

We will try to parse Apache access log right now and see how badly structured data can be trasfromed into SQL-like table. I suppose you did already install  Aster Developer Express and ready to create your first Aster project.

Parsing Apache access log

Let's imagine the situation: we have a huge cluster of web servers and we need to collect and analyze web logs. We need to gather access logs from each server and upload them to our AsterData nCluster.

See the sample of Apache access log. It depends on logger settings and may vary. See two sample lines:

dyn-530.optus.com.au - - [22/Apr/2009:18:52:51 +1200] "GET /images/photos/5332.jpg HTTP/1.1" 200 244 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322)" "-"

dhcp-465.telstra.com.au - - [22/Apr/2009:18:52:51 +1200] "GET /gallery.php?section=entertainment HTTP/1.0" 200 58636 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/525.13 (KHTML, like Gecko) Version/3.1 Safari/525.13" "-"

While reading apache logs documentation we can see that it's possible to extract these useful fields:

client address:  dyn-530.optus.com.au

client request timestamp:  [22/Apr/2009:18:52:51 +1200]

GET URL:  GET /images/photos/5332.jpg

response code:  200

content length:  58636 (bytes)

user agent info:  Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/525.13 (KHTML, like Gecko) Version/3.1 Safari/525.13

The idea is to load raw log data into staging table and then parse it (get these useful fileds) using custom SQL-MR function. 

Prepare DB for loading Apache access log data

I suggest you to use Eclipse Data source expolorer. It will be very convinient to write implementation of custom SQL-MR function and issue queries in the next window. Here is a good article which explains how to setup and use Eclipse built-in Data source explorer


--create schema for staging and error table
--staging table will get input data, error table will get rejected records
--we can see later these records and understand what’s wrong with them.
CREATE SCHEMA stage;

--Create staging table to hold Web Log data
DROP TABLE IF EXISTS stage.web_loglines;
CREATE FACT TABLE stage.web_loglines(
logline character varying)
DISTRIBUTE BY HASH (logline);

--connect as db_superuser/db_superuser and grant all on error base table
grant all on nc_system.nc_errortable_part to beehive

--Create Error table in Stage Schema to trap load errors
DROP TABLE IF EXISTS stage.loader_error_table;
CREATE TABLE stage.loader_error_table()
INHERITS (nc_errortable_part);
--grant access to error table to beehive
grant all on stage.loader_error_table to beehive

Now It's time to load Apache access log data. We'll use ACT (see all three articles if haven't done it before). I suppose that you have uploaded text file with data to Queen node.

aster@linux-qvsn:/home/datasets> ncluster_loader -d beehive -U beehive -w beehive --el-enabled --el-label LOG_LOAD --el-table stage.loader_error_table stage.web_loglines weblog.txt

The output will be:

Loading tuples using node 'localhost'.

3260 tuples were successfully loaded into table 'stage.web_loglines'.

aster@linux-qvsn:/home/datasets> 

Your dataset will consist of few hundred lines. Let's issue some SQL queires using Datasource explorer and check that data has been uploaded. 

select count(*) from stage.loader_error_table; --0
select count(*) from stage.web_loglines; --3260

Creating a project

Open Eclipse and select Aster SQL-MapReduce Project

Setup the project

See the right way to configure connection with the nCluster. You sohuld specify username, password, database. All these things have the same value by default: beehive. See that my ip address differs from default one. I did add one more network interface to Queen node. So I Queen node has two network interfaces with two addresses: 192.168.100.100 (default) and 192.168.1.38 You can use default address. You don't need to add network interface. 

Get Java libraries from server and add them

In order to write Aster SQL-MapReduce functions you have to get several Java libraries from Queen node. Here they are:

/home/beehive/bin/lib/sqlmr/ncluster-sqlmr-api.jar

/home/beehive/bin/lib/sqlmr/ncluster-sqlmr-runnercommon.jar      

/home/beehive/bin/lib/sqlmr/ncluster-api-util.jar 

/home/beehive/bin/lib/sqlmr/ncluster-sqlmr-testrunner.jar

You can use WinSCP and copy libraries from Queen node.

Now add these libraries to project. Right click project, select "Properties", then select "Java build path". Add libraries by clicking "Add external JARs.."

Add test folder

We need a test folder. There we'll write  tests for our SQL-MapReduce function

Right click on project, select "New" then select "Folder"

Now we need to add "test" folder to the project build path. 

Right click the project, select "Properties"

The select "Java build path", select "Source", click button "Add Folder" mark folder named "test"

Add junit4

Go to project "Properties". We need special library to run our test. Junit is a standard library for writing and running tests in Java.

We are going to start to develop our function in the next chapter!