VIDEO: HOW TO REDUCE LOAD TIME: Using nCluster_Loader in Parallel

Learn Data Science
Teradata Employee

I know, I know... Aster is not all about speeds and feeds but I thought I would share this anyway...

I know what you are thinking BIG DEAL????  Loading data.  Granted it is not data science, data loading is very important in the data science process.  If I can speed up the data loads significantly I can reduce my time and costs of performing that load.  I can also exploit my data faster, prep my data faster, and perform analytics faster.  It is just one of the components of how Aster Flips the 80/20 Rule for Advanced Analytics.  Face it, who wants to spend time loading data?  Especially Big Data.

Watch the video as I go over the load process and how I did it.  I am sure there are more elegant ways of scripting this but I wanted to show it the way I did to demonstrate the speed.

So here is the deal:

On my cluster which is a 1/Queen 3/Worker system I loaded 123 million records in roughly 45 seconds.  There were 22 files in my data set and I loaded them all in parallel using nCluster_Loader into a table in Aster.

The ncluster_loader statement used looks like this:

ncluster_loader -h xx.xx.xx.xx -U jtxxxx -w xxxxxxx -d jt public.aa_airlines 2008.csv --csv --skip-rows 1

This file had roughly 7 million records in it and by itself it took roughly 17 seconds to load.  Remember there are 22 files.  So if I loaded all 22 files back to back it would have taken 22 X 17 considering every file had about the same amount of records and the exact same structure.

The total load time for the 22 files back to back is roughly:  374 Seconds or 6 minutes 23 seconds 

To load all 22 files at the same time took: 45 Seconds

This parallel process cuts my entire load time to 12% of my total back to back load time.  That is significant improvement.  

Now 123 million records is not a lot of data, however, when this is 123 billion records + it could be a big time savings.

Video Link : 1057