Fast/Multi Load & Control Characters

UDA
Enthusiast

Fast/Multi Load & Control Characters

Our source data is loaded into our warehouse staging area via combination of Fast & Multi Load utilities.
From time to time some CNTRL characters can slip into the source systems data.

For example an email address such as Paul.O¢Hara@gmail.com appeared last week where the cent character was meant to be an apostrophy.

We use the caret^ symbol as a delimiter for the fastload files and so they are what delimits the columns in the .txt file created from fastload.

Everyone has to deal with Control characters and we are wondering what approaches some of you guys have employed.

Marc.
8 REPLIES
Enthusiast

Re: Fast/Multi Load & Control Characters

Use fixed width, it's also much faster to load.

The issue of space is normally circumvented in unix by using compression utilities (ie ask the source system to send zipped files) and loading them via named FIFOs.
Enthusiast

Re: Fast/Multi Load & Control Characters

excuse my ignorance.
but what do you mean by 'fixed width' and 'named FIFO'?
Can you use examples?

Marc
Enthusiast

Re: Fast/Multi Load & Control Characters

fixed width format is when you specify what is the maximum length of each field, and pad the rest with zero/spaces.

for example

if this is a variable text format, with pipe delimiters.

123|shannera@grare.com|NY

and if the max allowed size of the first field (assuming integer) is 5, second one to be 30 and last field is 2 then your data record format would look like. (there are no delimiters all fields start and end at fixed positions)

00123shannera@grare.com<12 spaces>NY

and in your load scripts, you can specify the record format to be TEXT.
you will also have to change your input field specifications in the script from VARCHAR(n) to CHAR(n) also use NULLIF checks to translate spaces to null values etc. lookup the manual for more details.

named FIFOs are a feature in unix where a process can write to a FIFO and another process can read from it (similar to using pipes in shell commands).

for example if I had a zipped file data, I would do something like this in my shell script.

#!/usr/bin/ksh

# create a fifo for use, only this userid will have read/write access to it
mkfifo -m 600 mymfifo

# unzip the contents of my data file and write the output to fifo, do it in background
gunzip - myfifo &

#store the pid of the background process.
unzippid=$!

# the usual fast load stuff ...
fastload <

define the record format as text, do other fastload gizzmos.

point the fastload to read from the fifo
FILE=myfifo

END

store the return code of fastload
rc=$?

# if fastload failed, kill the background process
if [ $rc -ne 0 ]
then

kill $unzippid

fi

rm -f myfifo

# end of script
Enthusiast

Re: Fast/Multi Load & Control Characters

Thanks Joe.

One of the better ideas I've been presented with is as below. Any thoughts are very much appreciated?

The problem relates to what is a control character and what is not. The cent character is not a control character, neither is it a standard ascii character, it's a special character, which differentiates it again from a non-printing character.

What could be used is a better filter than just stripping at the Source System or Linux end.
A translation table for problematic characters, i.e., ^ in the text is translated to ', cent character is translated to something else, and so on. The issue is where that should happen i.e. all character conversion and stripping should happen on the Source system then the removal of all control characters on Linux, with a secondary filter to check and translate any remaining characters that might cause
issues.

At this point in developent we feel fixed width is not an option and we are still exploring the named FIFO option.
Enthusiast

Re: Fast/Multi Load & Control Characters

Yes, it would be more of a data quality issue, that needs to be fixed at the source system. I am personally against any translation done at the receiving system, as it might have the wrong results at times, you shouldn't end up translating something that shouldn't have got translated. (and which might propagate downstream further complicating things).

May be this is something the BAs and DAs should get together and decide up on. (wash the linen only on the upstream ! :-) )
Teradata Employee

Re: Fast/Multi Load & Control Characters

In the ideal world there would never be delimiters or control (unprintable) characters imbedded in your data. Since we live in the real world you have to deal with this issue. As already stated you can try to correct the source system, but often times this is not possible either.

Here are some solutions I've seen used:

1. Pick a delimiter that can "never" appear in the data. For example be ASCII 127 ( 0x7F ) is the delete character. The delete character can be typed but it never appears in the file you are working on. This works quite well but is not intuitive when reviewing the data files.

2. If the source system is Oracle you can cleanse the offending columns by writing a cleanse function to change the unprintable characters to spaces.

3. Use perl or awk to change the unprintable characters in the source file: perl -ne 's/[[:cntrl:]]//g;print' file.txt

4. Use perl or awk to count the number of delimiters per row and reject the row or file if incorrect - meaning that there were delimiters embedded in the data.

5. Use double quotes around each column in the source file - when using tools like Informatica you can specify delimited and quoted input files.

There are probably more solutions but you get the idea....

Enthusiast

Re: Fast/Multi Load & Control Characters

We are researching the idea of cleaning at the source.

The problem is at the moment that the main 'block looking' cntrl characters(they seem to be carriage returns and line feeds) we are receiving most, don't all have the same hex value. So we are trying to build up a library of ebcdic hex values that we can scan our source files for before ETL.

Also, has anyone used/researched Teradata Data Mirroring products for anything in this context?

Finally, if we were to do something as radical as use another products Utility(say DTS in SQLServer which is currently working for us) to get the files cleanly across could we dynamically call something like the OleLoad utility for an array of files from a Unix/Linux script?

Re: Fast/Multi Load & Control Characters

You can FastLoad from a DTS package by using the DTSPackageDSO OLE DB provider and the OLE DB access module. Here is a small experiment that you can use to verify that this works:

* On a system having Microsoft SQL Server 2000 installed,
click "Start" then "Programs" then "Microsoft SQL Server"
then "Import and Export Data". Follow the wizard steps to
create a "Microsoft OLE DB Provider for SQL Server" to
"Microsoft OLE DB Provider for SQL Server" package. On the
"Specify Table Copy or Query" step, choose
"Copy table(s) and view(s) from the source database".
On the "Save, Schedule, and replicate package" step, uncheck
all of the check boxes in the "When" group box, check
the "Save DTS Package" check box and the "Structured Storage File"
radio button in the "Save" group box.
* Start the "SQL Server Enterprise Manager" and find
the "Data Transformation Services" icon. Right click it and choose
"Open package" to open the package that you created with
the "Import and Export Data" wizard. Right click the "transform data task"
icon (it is an arrow) and click "Workflow Properties". On the "Options" tab
in the "Execution" group box set the "DSO rowset provider" check box.
Also, copy down the entire "Name" of the transform data task. It is
important the get the entire name. It may not all be displayed and you might
have to click on it and then move the cursor all the way right and left over
it to get the whole name.
* Save the package.
* Start OleLoad and for the "Select a source" group box, select
the "Microsoft OLE DB Provider for DTS Packages ("DTSPackageDSO")" entry.
On the "All" tab of the "Data Link Properties" dialog box, set
the value of the Provider String property to /Ffilename (where filename
is the full pathname of your DTS package).
* In the Enter the name of the desired table edit box, enter the name
of the transform data task.
The rest is a normal FastLoad via OleLoad.