In this second part of our series on using Aster Express, we'll continuing using the Aster ACT query tool and introduce a new tool for bulk loading data, Aster's ncluster_loader. (See Using Aster Express: Act 1 for part 1 of this series).
With the Aster Express images for VMware Player, we've also included some sample data sets. These data sets will allow us to create analytic tutorials to showcase the power and flexibility of the Aster platform. These data sets have been zipped and included in a "demo" directory on the Queen node. As we saw in the first article of this series, we've also included the Aster client tools on the Queen as well. ACT is the tool that we've been using for submitting SQL queries to our Aster cluster. Another very useful tool is "ncluster_loader", which is the Aster tool for bulk loading data.
Let's start by logging into our Queen node. For simplicity with our virtual images, we'll login directly to the Queen instance. As we showed previously, it is also good practice to use a remote SSH utility. Both are equivalent for our purposes. Our user login/password is "aster/aster" for Aster Express. Once we're on the Queen, we open a terminal window by double-clicking the GNOME Terminal icon, and stretch it a bit to give ourselves some elbow room. Start the ACT query tool by typing "act" on the command line and password "beehive" when prompted.
We are going to create a new table for our first dataset. This dataset contains just over 1 million records that simulate customer visits to a bank web site. Here is the SQL CREATE TABLE statement that we'll use:
CREATE TABLE bank_web_clicks
) DISTRIBUTE BY HASH (customer_id);
Type or copy/paste this CREATE TABLE statement into ACT. (See the screenshot below). This now gives us a landing table for loading our data. Notice again the 'DISTRIBUTE BY HASH' syntax. This tell Aster how the data should be distributed across the Worker nodes in the cluster. We're choosing to distribute based on the customer_id. We'll cover this in more detail once we get to the analytic tutorials.
Let's exit ACT and find our dataset. Use the ACT quit command, "\q" to bring us back to the shell command prompt.
The zipped datasets are in a folder name "demo" under the aster home directory. Let's first unzip the "bank_web_data.zip" file.
We're now ready to use the Aster load tool, ncluster_loader.
For help on the syntax for using this tool, use the "--help" parameter syntax.
In our example, we'll specifiy all the required parameters to connect to our Aster cluster, plus our load table and the data file. We'll also tell the loader to skip the first row, as that contains the field names, plus we'll add the 'verbose' flag so that we can see all the processing details.
ncluster_loader --hostname 192.168.100.100 --username beehive --password beehive --dbname beehive --skip-rows 1 --verbose bank_web_clicks bank_web_data.txt
You'll see the processing details scroll on your screen; connecting to Aster, ready file formats and finally the output, showing just over 1 million records loaded in only a few seconds!
That's it. We now have some sample data to play with. Here's an example to show the distinct web pages visited by customers:
select distinct page, count(*) from bank_web_clicks group by page;
Also, if you want to load and play with the the other 2 data sets in the demo directory, here are the CREATE TABLE statements, along with the ncluster_load commands:
CREATE TABLE retail_web_clicks
) DISTRIBUTE BY HASH (cookie_id);
ncluster_loader --hostname 192.168.100.100 --username beehive --password beehive --dbname beehive --skip-rows 1 --verbose retail_web_clicks retail_web_data.txt
CREATE TABLE ad_web_events
) DISTRIBUTE BY HASH (user_id);
ncluster_loader --hostname 192.168.100.100 --username beehive --password beehive --dbname beehive --skip-rows 1 --verbose ad_web_events ad_web_data.txt
So there you go, with these 3 data sets, you have 25 million rows of sample data to play with. In our next article we'll show some powerful Aster analytics that you can apply to these tables to yield new insights from your Big Data -Aster Act 3.
Is it possible to load data using ncluster_loader with the data coming from a input stream or say a pipe?
Storing the large amount of data to a text file and then loading it to aster database is a very expensive step.