We will try to parse Apache access log right now and see how badly structured data can be trasfromed into SQL-like table. I suppose you did already install Aster Developer Express and ready to create your first Aster project.
Let's imagine the situation: we have a huge cluster of web servers and we need to collect and analyze web logs. We need to gather access logs from each server and upload them to our AsterData nCluster.
See the sample of Apache access log. It depends on logger settings and may vary. See two sample lines:
dyn-530.optus.com.au - - [22/Apr/2009:18:52:51 +1200] "GET /images/photos/5332.jpg HTTP/1.1" 200 244 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322)" "-"
I suggest you to use Eclipse Data source expolorer. It will be very convinient to write implementation of custom SQL-MR function and issue queries in the next window. Here is a good article which explains how to setup and use Eclipse built-in Data source explorer.
--create schema for staging and error table
--staging table will get input data, error table will get rejected records
--we can see later these records and understand what’s wrong with them.
CREATE SCHEMA stage;
--Create staging table to hold Web Log data
DROP TABLE IF EXISTS stage.web_loglines;
CREATE FACT TABLE stage.web_loglines(
logline character varying)
DISTRIBUTE BY HASH (logline);
--connect as db_superuser/db_superuser and grant all on error base table
grant all on nc_system.nc_errortable_part to beehive
--Create Error table in Stage Schema to trap load errors
DROP TABLE IF EXISTS stage.loader_error_table;
CREATE TABLE stage.loader_error_table()
--grant access to error table to beehive
grant all on stage.loader_error_table to beehive
Now It's time to load Apache access log data. We'll use ACT (see all three articles if haven't done it before). I suppose that you have uploaded text file with data to Queen node.
aster@linux-qvsn:/home/datasets> ncluster_loader -d beehive -U beehive -w beehive --el-enabled --el-label LOG_LOAD --el-table stage.loader_error_table stage.web_loglines weblog.txt
The output will be:
Loading tuples using node 'localhost'.
3260 tuples were successfully loaded into table 'stage.web_loglines'.
Your dataset will consist of few hundred lines. Let's issue some SQL queires using Datasource explorer and check that data has been uploaded.
select count(*) from stage.loader_error_table; --0
select count(*) from stage.web_loglines; --3260
Open Eclipse and select Aster SQL-MapReduce Project
See the right way to configure connection with the nCluster. You sohuld specify username, password, database. All these things have the same value by default: beehive. See that my ip address differs from default one. I did add one more network interface to Queen node. So I Queen node has two network interfaces with two addresses: 192.168.100.100 (default) and 192.168.1.38 You can use default address. You don't need to add network interface.
In order to write Aster SQL-MapReduce functions you have to get several Java libraries from Queen node. Here they are:
You can use WinSCP and copy libraries from Queen node.
Now add these libraries to project. Right click project, select "Properties", then select "Java build path". Add libraries by clicking "Add external JARs.."
We need a test folder. There we'll write tests for our SQL-MapReduce function
Right click on project, select "New" then select "Folder"
Now we need to add "test" folder to the project build path.
Right click the project, select "Properties"
The select "Java build path", select "Source", click button "Add Folder" mark folder named "test"
Go to project "Properties". We need special library to run our test. Junit is a standard library for writing and running tests in Java.
We are going to start to develop our function in the next chapter!