I know, I know... Aster is not all about speeds and feeds but I thought I would share this anyway...
I know what you are thinking BIG DEAL???? Loading data. Granted it is not data science, data loading is very important in the data science process. If I can speed up the data loads significantly I can reduce my time and costs of performing that load. I can also exploit my data faster, prep my data faster, and perform analytics faster. It is just one of the components of how Aster Flips the 80/20 Rule for Advanced Analytics. Face it, who wants to spend time loading data? Especially Big Data.
Watch the video as I go over the load process and how I did it. I am sure there are more elegant ways of scripting this but I wanted to show it the way I did to demonstrate the speed.
So here is the deal:
On my cluster which is a 1/Queen 3/Worker system I loaded 123 million records in roughly 45 seconds. There were 22 files in my data set and I loaded them all in parallel using nCluster_Loader into a table in Aster.
The ncluster_loader statement used looks like this:
ncluster_loader -h xx.xx.xx.xx -U jtxxxx -w xxxxxxx -d jt public.aa_airlines 2008.csv --csv --skip-rows 1
This file had roughly 7 million records in it and by itself it took roughly 17 seconds to load. Remember there are 22 files. So if I loaded all 22 files back to back it would have taken 22 X 17 considering every file had about the same amount of records and the exact same structure.
The total load time for the 22 files back to back is roughly: 374 Seconds or 6 minutes 23 seconds
To load all 22 files at the same time took: 45 Seconds
This parallel process cuts my entire load time to 12% of my total back to back load time. That is significant improvement.
Now 123 million records is not a lot of data, however, when this is 123 billion records + it could be a big time savings.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.