fixed width format is when you specify what is the maximum length of each field, and pad the rest with zero/spaces.
if this is a variable text format, with pipe delimiters.
and if the max allowed size of the first field (assuming integer) is 5, second one to be 30 and last field is 2 then your data record format would look like. (there are no delimiters all fields start and end at fixed positions)
and in your load scripts, you can specify the record format to be TEXT. you will also have to change your input field specifications in the script from VARCHAR(n) to CHAR(n) also use NULLIF checks to translate spaces to null values etc. lookup the manual for more details.
named FIFOs are a feature in unix where a process can write to a FIFO and another process can read from it (similar to using pipes in shell commands).
for example if I had a zipped file data, I would do something like this in my shell script.
# create a fifo for use, only this userid will have read/write access to it mkfifo -m 600 mymfifo
# unzip the contents of my data file and write the output to fifo, do it in background gunzip - myfifo &
#store the pid of the background process. unzippid=$!
# the usual fast load stuff ... fastload <
define the record format as text, do other fastload gizzmos.
point the fastload to read from the fifo FILE=myfifo
store the return code of fastload rc=$?
# if fastload failed, kill the background process if [ $rc -ne 0 ] then
One of the better ideas I've been presented with is as below. Any thoughts are very much appreciated?
The problem relates to what is a control character and what is not. The cent character is not a control character, neither is it a standard ascii character, it's a special character, which differentiates it again from a non-printing character.
What could be used is a better filter than just stripping at the Source System or Linux end. A translation table for problematic characters, i.e., ^ in the text is translated to ', cent character is translated to something else, and so on. The issue is where that should happen i.e. all character conversion and stripping should happen on the Source system then the removal of all control characters on Linux, with a secondary filter to check and translate any remaining characters that might cause issues.
At this point in developent we feel fixed width is not an option and we are still exploring the named FIFO option.
Yes, it would be more of a data quality issue, that needs to be fixed at the source system. I am personally against any translation done at the receiving system, as it might have the wrong results at times, you shouldn't end up translating something that shouldn't have got translated. (and which might propagate downstream further complicating things).
May be this is something the BAs and DAs should get together and decide up on. (wash the linen only on the upstream ! :-) )
In the ideal world there would never be delimiters or control (unprintable) characters imbedded in your data. Since we live in the real world you have to deal with this issue. As already stated you can try to correct the source system, but often times this is not possible either.
Here are some solutions I've seen used:
1. Pick a delimiter that can "never" appear in the data. For example be ASCII 127 ( 0x7F ) is the delete character. The delete character can be typed but it never appears in the file you are working on. This works quite well but is not intuitive when reviewing the data files.
2. If the source system is Oracle you can cleanse the offending columns by writing a cleanse function to change the unprintable characters to spaces.
3. Use perl or awk to change the unprintable characters in the source file: perl -ne 's/[[:cntrl:]]//g;print' file.txt
4. Use perl or awk to count the number of delimiters per row and reject the row or file if incorrect - meaning that there were delimiters embedded in the data.
5. Use double quotes around each column in the source file - when using tools like Informatica you can specify delimited and quoted input files.
There are probably more solutions but you get the idea....
We are researching the idea of cleaning at the source.
The problem is at the moment that the main 'block looking' cntrl characters(they seem to be carriage returns and line feeds) we are receiving most, don't all have the same hex value. So we are trying to build up a library of ebcdic hex values that we can scan our source files for before ETL.
Also, has anyone used/researched Teradata Data Mirroring products for anything in this context?
Finally, if we were to do something as radical as use another products Utility(say DTS in SQLServer which is currently working for us) to get the files cleanly across could we dynamically call something like the OleLoad utility for an array of files from a Unix/Linux script?
You can FastLoad from a DTS package by using the DTSPackageDSO OLE DB provider and the OLE DB access module. Here is a small experiment that you can use to verify that this works:
* On a system having Microsoft SQL Server 2000 installed, click "Start" then "Programs" then "Microsoft SQL Server" then "Import and Export Data". Follow the wizard steps to create a "Microsoft OLE DB Provider for SQL Server" to "Microsoft OLE DB Provider for SQL Server" package. On the "Specify Table Copy or Query" step, choose "Copy table(s) and view(s) from the source database". On the "Save, Schedule, and replicate package" step, uncheck all of the check boxes in the "When" group box, check the "Save DTS Package" check box and the "Structured Storage File" radio button in the "Save" group box. * Start the "SQL Server Enterprise Manager" and find the "Data Transformation Services" icon. Right click it and choose "Open package" to open the package that you created with the "Import and Export Data" wizard. Right click the "transform data task" icon (it is an arrow) and click "Workflow Properties". On the "Options" tab in the "Execution" group box set the "DSO rowset provider" check box. Also, copy down the entire "Name" of the transform data task. It is important the get the entire name. It may not all be displayed and you might have to click on it and then move the cursor all the way right and left over it to get the whole name. * Save the package. * Start OleLoad and for the "Select a source" group box, select the "Microsoft OLE DB Provider for DTS Packages ("DTSPackageDSO")" entry. On the "All" tab of the "Data Link Properties" dialog box, set the value of the Provider String property to /Ffilename (where filename is the full pathname of your DTS package). * In the Enter the name of the desired table edit box, enter the name of the transform data task. The rest is a normal FastLoad via OleLoad.