Character field Cleanup routines

Database
Enthusiast

Character field Cleanup routines

Hello .. We have many CHAR columns, name, address etc for example, that contains unwanted values, like @, $, %, ^, ( etc. We currently use INDEX to find the position of these characters and clean them up, and we run this repeatedly so that all unwanted characters in a single column can be cleaned up. This method works fine for most of the printable special characters that we need to remove.

But we are coming across unprintable characters with various HEX values, and the INDEX doesn't seem to work. I used CHAR2HEXINT, and that doesn't seem to help either. I am running into multiple data issues, and error 2621 - *** Failure 2621 Bad character in format or data.

Is there any standard cleanup routine that is available to remove any and all unprintable characters ?? If not, what would be the best approach to write one ? From what I have seen so far, it looks like one of the options is to write a Stored procedure that have a cursor loop, which would remove anything other than a set of allowable characters, like A - Z, a - z, 0 - 9 and may be some characters like comma, dash etc. This would need to look at the field character by character, and I am not sure whether this would affect the performance. The other option I saw is to use a UDF. I haven't done either, but willing to write a stored procedure, if needed. Any other options ?? Is there any Teradata functions that would do the trick ??

Thanks in advance for the group feedback ..
4 REPLIES
Enthusiast

Re: Character field Cleanup routines

A UDF would probably be the best option for an in-database solution. The Oracle UDF library (available for download from Teradata's main website) may contain some string manipulation functions that will help. We have simply created our own udf based on the UNIX (and C) functions retain_print, retain_alpha, and retain_alnum that afford us the same functionality.

The other option would be to pre-process the files using a third-party ETL tool or scripting in the environment from which you run your ETL.
Enthusiast

Re: Character field Cleanup routines

Thanks Rob .. I was able to get pipes out by POSITION('7C'XC IN column). Using the same method, I am able to get most of the '8B's also out, but not all. But it looks like POSITION and '1A' don't like each other, I get 6706 : The string contains an untranslatable character.
Enthusiast

Re: Character field Cleanup routines

What character are you trying to remove with '1A'XC? '0A'XC is CR whereas '1A'XC is SUB (substitute) and appears unacceptable in a LATIN character set.
Enthusiast

Re: Character field Cleanup routines

Just to add: Try to use the Translate_chk function which should help you identify the position of the untransalatble chars.You can may be then replace the chars on those positions as needed.Just the way you used index to find positions of translatable chars.

BTW, I would be glad if you could share how do you manage to identify all the unwanted values in a field. The INDEX function just gives the position of the first unwanted field and not all. I believe , it would require looping and am intrested to see how you do that.

Regards,
Ayush Jain