regexp_replace woes

Database
Enthusiast

regexp_replace woes

I want to replace all of the whitespace and non alphanumeric characters in a string. It seems like some form of the following should work:

SELECT REGEXP_REPLACE('00 - L* 123-4383', '[^a-zA-Z0-9]+', NULL)

However, that only replaces the initial find of ' - ' and leaves '* ' and '-' in place. I'm not interested in nesting REGEXP_REPLACE and, since I'm dealing in UNICODE, the block of characters that I would have to put into the TRANSLATE function would be unfathomably huge, so that's out.

Surely the usecase of REGEXP_REPLACE isn't so narrow that it can only find a single block of patterned text in a string. Right? Please tell me I'm missing something.

13 REPLIES
Senior Apprentice

Re: regexp_replace woes

The documentation on the manuals seems to be buggy (or the implementation doesn't match it).

By default REGEXP_REPLACE replaces only the first occurance, you have to add optional parameters:

REGEXP_REPLACE('00 -j L* 123-4383', -- sourcestring
'[^a-z0-9]+', -- regexp
'', --replacestring
1, --startposition
0, -- occurrence, 0 = all
'i' -- match -> case insensitive
)

Using 'i' as match argument you don't need both uppercase and lowercase a..z.

Dieter

Enthusiast

Re: regexp_replace woes

Nice.  I tried to add the 5th option parameter to have it continue searching, but the database yelled at me. Seems that you have to include the 6th parameter or it errors :/  Thanks for the pointer!

Teradata Employee

Re: regexp_replace woes

I have opened DR 168432 and we have done a recheck to see what is not correct with these functions.

 

Here are items where our documentation is wrong:

1) If the startSearch parameter is not specified, it defaults to 1 as stated. However, if it is specified as NULL, we return NULL. The documentations says if NULL, it defaults to 1.

2) If the occurrence arg is not specified, it defaults to 1 (not 0 as stated in the documentation.) If it is specified as NULL, the function returns NULL. The documentations says if NULL, it defaults to 0.

3) The documentation states: If match arg is omitted, an error is returned. This is not true.

Here is where we deviate from Oracle.

1. Oracle allows 3,4,5,or 6 parameters. We don't allow 5. – Will be fixed.

2. Oracle defaults the occurrence parameter to 0 if not specified, we set it to 1. - Will be fixed.

3. Oracle allows an x in the match arg meaning ignore whitespace. – Will investigate if we can implement.

4. If match arg is omitted. We default to case sensitive since there is no NLS_SORT.

5. We don't handle backreferencing correctly. Will be fixed.

Enthusiast

Re: regexp_replace woes

Hi there,

Thanks for the comments and suggestion. I am facing some problems with regexp_replace. I am trying to replace all instances of substring which matches the regexp. But it keeps throwing me an error. He are the details:

CREATE TABLE sample_split5 DISTRIBUTE BY HASH(text_without_htmltags) as
SELECT text_with_htmltags, regexp_replace(text_with_htmltags, '<([^>])*>',' ',1,1,'c') AS text_without_htmltags
FROM sample_split;

But i keep getting an error: 

Executed as Single statement.  Failed [34 : 42000] [AsterData][ASTERJDBCDSII](34) ERROR: function regexp_replace(character varying, unknown, unknown, integer, integer, unknown) does not exist () 

Elapsed time = 00:00:00.103 

STATEMENT 1: EXPLAIN Statement failed. 

I am not sure why this is happening. When I try removing the last 3 arguments it works perfectly fine. Can you help me out here? 

Thanks in advance,

Aniket

Senior Apprentice

Re: regexp_replace woes

Hi Aniket,

the REGEXP_REPLACE in Aster has different arguments:

regexp_replace(source, pattern, replacement [, flags ])

This should work:

regexp_replace(text_with_htmltags, '<([^>])*>',' ','g')

Enthusiast

Re: regexp_replace woes

Thanks Dieter - worked like magic! :)

Re: regexp_replace woes

I am attempting to use regext_replace with a string from a table that contains hex '1A'. The query returns error "the string contains untranslatable character. The string should contain alphanumerics only. How do I  find and replace non-unicode characters from this string?? My query is:

select REGEXP_REPLACE(profit_center_name, '[^a-zA-Z0-9]+',  '', 1, 0, 'i')

from scratch.FB_PROFIT_CTR_CORP_BUS_GRP_REL

Enthusiast

Re: regexp_replace woes

Hi all ,

I need some inputs for the following issue.

I am getting data as below this if I run a SQL

Dept No    Name    Salary     Region     Designation         Stream

1                A          1000       AO              Engineer               IT

1                    B              2000         AO                Engineer                IT

2                     C             3000         JO                    Sr. Engineer        IT

1                   D                 4000        FO                 Manager               IT

1                   E                 500           AO                Engineer               ?

1                  F                    400           AO                 Engineer              ?

But the required  output should be like this

Dept No    Name         Salary              Region        Designation           Stream 

1                   A,B         1000,2000            AO                  Engineer               IT

2                     C               3000                   JO                 Sr. Engineer          IT

1                   D                 4000                   FO                 Manager               IT

1                   E                 500                      AO                Engineer               ?

1                  F                    400                      AO               Engineer              ?

Conditions applied were :

- If Stream =IT  and data in all the columns(except Name , Salary) are same   then concatenate the  data in to single row with a comma seperator.

- If stream is not IT then display the data as it is.

To acheive the above I used the following TD function

cast(regexp_replace(tdstats.udfconcat(trim(A.NAME)),'"','',1,0,'i') as varchar(100)) as Name,

cast(regexp_replace(tdstats.udfconcat(trim(b.Salary)),'"','',1,0,'i') as varchar(100)) as Salary

After using this I am facing an error :

9134 : Results exceeds the output argument  size limit

Any inputs / suggestions  are appreciated

Regards

Enthusiast

Re: regexp_replace woes

Hi Guys,

I am asked to use some of the existing functions we have. Whats the use of this statement ?

TRIM(REGEXP_REPLACE(REGEXP_REPLACE(CAST(INOUT_STRING AS CHAR(3000)), '[\t\r\n\v\f|]',' ',1,0, 'I'), '[^[:print:]]','',1,0, 'I'))

Thanks.