Linux Commands: Some of the best tools for Big Data Prep

Learn Aster
Teradata Employee

So if you are just getting started or you are wanting to get started in big data analytics you will definitely want to seriously consider Linux and other tools.  I started my career deeply committed to Microsoft technologies.  I even worked for Microsoft and in Redmond on the Windows Server team.  I never looked at Linux seriously until I started my career with Big Data Analytics.  There are tools available to help you out that are built for Linux that will come in very handy in preparing data.  I am sure there are similar tools on other platforms but I really like what I can do with Linux and Bash commands.  This article will go into some of the commands I just love to use and use frequently.  It is not a complete list and I do welcome feedback.

The most important trait of any software developer is the ability to use the GOOGLE machine.  I love GOOGLE and the internet.  I can find anything about anything, especially Bash and Linux commands.  Everything I am going to post today was found because of my desire to never quit, never let the problem win, and a hunger to learn.  It is all about the question you ask GOOGLE; because I am sure someone out there has already built the code or the command to accomplish what you want.

So here is a list of my favorite commands and what they are used for:

To get rid of windows carriage returns '\r' and replace with a Linux '\n'

     - DOS2UNIX filename

     - tr '\r' '\n' < filename > new filename

To find the number of records in a file

     - wc -l filename

To split a file into multiple files based on record count or sizes (great for parallel loads)

     - split command

To split a file based on a field value in a file:  (one of my favorites)

     - awk -F"|" '{print > $1".txt"}' filename

To add line numbers to records in a file (excellent if you want to create a unique key for distribution without skew):

     - awk '{ print FNR "|" %0}' filename > newfilename

To wrap all records in a file with single quotes:

     - sed -e "s/\(.*\)/'\1'/" filename > newfilename

To keep start a virtual terminal session that you can come back to in the event of a crash (big data means a lot of time to run jobs, so this is a big deal)

     - screen

To create a smaller file based on actual records:

     - head -n 10000 filename

To get rid of all UTF-8 characters

iconv -f utf-8 -t utf-8 -c file.txt

These are just some of my favorites and I have more.  I am sure you have your favorites and your favorite platform.  I really have opened my eyes to how flexible, reliable, and fast Linux experience provides.  There is also a wealth of experts out there that have done virtually everything you want to do in Linux.  The Linux community is amazing.

Have fun!

Some more fun ones:

tail

wc –l filename

df-k

dos2unix

chmod

bash

du –h

du –csh

du –h | sort –h

ls –lh

sort -k1 -n -T /data/tmp filename.txt  (where k is sort position)

du –a

How to create a user for ftp:

groupadd FTPGROUP

useradd –g FTPGROUP –d /data/SOMEPATH

passwd –e USERID

chmod 775 /data/SOME PATH

chown USERID /data/SOMEPATH

Where uppercase is a replaceable entry