Help with TD Regexp

Database
Fan

Help with TD Regexp

I have been struggling to convert a regex string that seems to work fine in all flavors of https://regex101.com/ to be used in

TD_SYSFNLIB.REGEXP_SUBSTR.  I have a column with a series of semicolon delimited alphanumric values.  Each column can have 0 to many values.  I need to pull out a subset of those values when they exist in the column. 

The regular expression I have come up with is:
(\b4\b;?)?(\b5\b;?)?(\b7\b;)?(\b11\b;)?(\b63\b;?)?(\b65\b)?(\b14A\b;?)?(\b8\/9\/10\b;?)?

The test string that seems to work elsewhere is:

3;4;5;7;24;27;52;53;55;63;65;14A;86;87;8/9/10

What the output I am expecting is: 4;5;7;63;65;14A;8/9/10

 

I have set up some test data and tried a few simple things that are not doing what I expected.

Using the below test_v:

select lic as lic, 
TD_SYSFNLIB.REGEXP_substr(lic,'(\b4\b;?)') version1, TD_SYSFNLIB.REGEXP_substr(lic,'(\b4\b;?)?') version2
from test_V; 

For this simple example, I do not understand the output. 

 

If I make the 4 optional (? at the end of version2) it only matches if 4 is the first word in the string. 

 

 

licversion1version2
3;4;5;7;11;63;65;8/9/104; 
4;994;4;
3;4;74; 
444
4;7;24;63;65;14A;86;87;8/9/104;4;
3;7;63;65;79;8/9/10  
3;4;5;7;24;27;33;52;53;55;63;65;14A;86;87;8/9/104; 

 

And if I get a little more complete it with

 select lic as lic, 
TD_SYSFNLIB.REGEXP_substr(lic,'(\b4\b;?)(\b5\b;?)?(\b7\b;?)?(\b11\b;?)?(\b63\b;?)?(\b65\b;?)?(\b14A\b;?)?(\b8\/9\/10\b;?)?') version3,
TD_SYSFNLIB.REGEXP_substr(lic,'(\b4\b;?)?(\b5\b;?)?(\b7\b;?)?(\b11\b;?)?(\b63\b;?)?(\b65\b;?)?(\b14A\b;?)?(\b8\/9\/10\b;?)?') version4 
from test_v;

 

 

licversion3version4
3;4;5;7;11;63;65;8/9/104;5;7;11;63;65;8/9/10 
4;994;4;
3;4;74;7 
444
4;7;24;63;65;14A;86;87;8/9/104;7;4;7;
3;7;63;65;79;8/9/10  
3;4;5;7;24;27;33;52;53;55;63;65;14A;86;87;8/9/104;5;7; 

I'm still stumped by why if I make the first word optional am I failing to get a match unless the 4 is the first word. 

select lic as lic, 
TD_SYSFNLIB.REGEXP_substr(lic,'(\b4\b;?)(\b5\b;?)?(\b7\b;?)?(\b11\b;?)?(\b63\b;?)?(\b65\b;?)?(\b14A\b;?)?(\b8\/9\/10\b;?)?') version3  
from test_v; 

 

 

replace view test_v
as
select * from (select cast('3;4;5;7;11;63;65;8/9/10' as varchar(200)) as LIC) v
union all
select * from (select cast('4;99' as varchar(200)) as LIC) v
union all
select * from (select cast('3;4;7' as varchar(200)) as LIC) v
union all
select * from (select cast('4' as varchar(200)) as LIC) v
union all
select * from (select cast('4;7;24;63;65;14A;86;87;8/9/10' as varchar(200)) as LIC) v
union all 
select * from (select cast('3;7;63;65;79;8/9/10' as varchar(200)) as LIC) v
union all
select * from (select cast('3;4;5;7;24;27;33;52;53;55;63;65;14A;86;87;8/9/10' as varchar(200)) as LIC) v; 

 

I use regex infrequently and I'm open to another approach altogether if it gets the job done. 

 

Thanks in advance for any insight you can provide.  


Accepted Solutions
Highlighted
Junior Contributor

Re: Help with TD Regexp

I don't think you can get what you want using a single REGEXP_SUBSTR, because this function returns only one of possibly multiple matches.

 

But instead of extracting the matching data you can remove the non-matching using REGEXP_REPLACE, following seems to work, after some trial & error :-)

SELECT lic
  ,TD_SYSFNLIB.RegExp_Replace(lic||';'  -- Make sure there's always a trailing semicolon
                             ,'(.*?)(\b(4|5|7|11|63|65|14A|8\/9\/10);|$)'
                             ,'\2')     -- remove the first capturing group
FROM test_v;

LIC                             
------------------------------  -----------------------
3;4;5;7;11;63;65;8/9/10         4;5;7;11;63;65;8/9/10;
4;99                            4;
3;4;7                           4;7;
4                               4;
4;7;24;63;65;14A;86;87;8/9/10   4;7;63;65;14A;8/9/10;
3;7;63;65;79;8/9/10             7;63;65;8/9/10;
3;4;5;7;24;27;33;52;53;55;63;6  4;5;7;63;65;14A;8/9/10;

This regex uses two capturing groups, the 1st group (.*?) simply looks for any string which is not in the 2nd group.

The 2nd group starts with a word boundary, followed by any of your search strings and a semicolon or the end of the string.

 

'\2' replaces the full match with the 2nd capturing group only, i.e. removes the 1st.

When you switch to '\1' you can see which data is removed.

1 ACCEPTED SOLUTION
6 REPLIES
Teradata Employee

Re: Help with TD Regexp

Try script table operators. Any grep tool, or Linux shell command, can be executed with the script table operator. See the reference: "SQL Functions, Operators, Expressions, and Predicates" for details. It is a very useful function.

 

-Dave

 

 

Fan

Re: Help with TD Regexp

Thanks for taking time to reply.  I read through that section of the manual you referenced for version 14.1.  Script table operators do look powerful but they seem like a totally different solution mindset than regexp_substr.  If I am close to a regexp solution I'd rather finish that.  If regexp won't work for this use case then I will have to look at something else.  

Do we have any regexp gurus who can tell me what I'm missing?  

Fan

Re: Help with TD Regexp

Right after the above reply I stumbled across strtok_split_to_table.  

 

New test data follows, note I added a key and an example with null in the LIC column: 

replace view test_v
as
select * from (select 1 mykey, cast('3;4;5;7;11;63;65;8/9/10' as varchar(200)) as LIC) v
union all
select * from (select 2 mykey, cast('4;99' as varchar(200)) as LIC) v
union all
select * from (select 3 mykey, cast('3;4;7' as varchar(200)) as LIC) v
union all
select * from (select 4 mykey, cast('4' as varchar(200)) as LIC) v
union all
select * from (select 5 mykey, cast('4;7;24;63;65;14A;86;87;8/9/10' as varchar(200)) as LIC) v
union all
select * from (select 6 mykey, cast('3;7;63;65;79;8/9/10' as varchar(200)) as LIC) v
union all
select * from (select 7 mykey, cast('3;4;5;7;24;27;33;52;53;55;63;65;14A;86;87;8/9/10' as varchar(200)) as LIC) v
union all
select * from (select 8 mykey, cast(null as varchar(200)) as LIC) v;

with the above test data I was able to splite the data into another column and then filter it with a where clause: 

 

 

 

select d.* from table (strtok_split_to_table(test_v.mykey, test_v.lic,';') 
RETURNS (outkey integer, tokennum integer, token varchar(20) )) as d 
where d.token in ('4','5','7','63','65','14A','8/9/10/')
order by 1,2;  

But now I have a normalized data set with a row for each value in the LIC column, which filters out any records where LIC is null.  I would still need to concatenate each of the unfiltered values back into a single string for the final output.  

Any suggestions? 
And this is just a sample of the actual data.  I have other columns to add back into the final data set. 
assuming we can concat the values back into a single string, I could imagine making this an inline view and using an outer join to bring it all together.  However I'm thinking this will have significant performance impacts over perfecting the Regexp.   

 

Teradata Employee

Re: Help with TD Regexp

I don't think you will find a solution with REGEXP_SUBSTR. REGEXP_SUBSTR only returns one match, not all matches. By default it's the first match ("Match 1" on regex101.com); you can use the "occurrence" argument to pick a different match, but only one. And there's no good way to "skip over" some characters and then continue matching.

 

Highlighted
Junior Contributor

Re: Help with TD Regexp

I don't think you can get what you want using a single REGEXP_SUBSTR, because this function returns only one of possibly multiple matches.

 

But instead of extracting the matching data you can remove the non-matching using REGEXP_REPLACE, following seems to work, after some trial & error :-)

SELECT lic
  ,TD_SYSFNLIB.RegExp_Replace(lic||';'  -- Make sure there's always a trailing semicolon
                             ,'(.*?)(\b(4|5|7|11|63|65|14A|8\/9\/10);|$)'
                             ,'\2')     -- remove the first capturing group
FROM test_v;

LIC                             
------------------------------  -----------------------
3;4;5;7;11;63;65;8/9/10         4;5;7;11;63;65;8/9/10;
4;99                            4;
3;4;7                           4;7;
4                               4;
4;7;24;63;65;14A;86;87;8/9/10   4;7;63;65;14A;8/9/10;
3;7;63;65;79;8/9/10             7;63;65;8/9/10;
3;4;5;7;24;27;33;52;53;55;63;6  4;5;7;63;65;14A;8/9/10;

This regex uses two capturing groups, the 1st group (.*?) simply looks for any string which is not in the 2nd group.

The 2nd group starts with a word boundary, followed by any of your search strings and a semicolon or the end of the string.

 

'\2' replaces the full match with the 2nd capturing group only, i.e. removes the 1st.

When you switch to '\1' you can see which data is removed.

Fan

Re: Help with TD Regexp

Thank you so much!  That works perfectly!  And it works about 10-100 faster than the kludge I came up with yesteday, which was to concatinate a separate substring for each term surrounded by nvl and cast.  

My final solution is now this: 

trim(both ';' from TD_SYSFNLIB.RegExp_Replace(lic||';'  -- Make sure there's always a trailing semicolon
                  ,'(.*?)(\b(4|5|6|7|11|17|21|22|24|25|26|28|39|42|51|52|53|55|62|63|65|66|14A|8\/9\/10);|$)'
                  ,'\2')     -- remove the first capturing group
                  ) as lic -- trim trailing

Instead of this: 

TRIM(BOTH ';' FROM cast(
  cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b4\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b5\b;?)'),'') as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b6\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b7\b;?)'),'')  as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b11\b;?)'),'') as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b17\b;?)'),'')  as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b21\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b22\b;?)'),'')  as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b24;?\b)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b25\b;?)'),'') as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b26\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b27\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b28\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b39\b;?)'),'') as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b42\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b51\b;?)'),'') as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b52\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b53\b;?)'),'') as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b55\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b62\b;?)'),'') as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b63\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b65\b;?)'),'') as varchar(5))
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b66\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b14A\b;?)'),'') as varchar(5)) 
|| cast(nvl(TD_SYSFNLIB.REGEXP_substr(REP.LIC,'(\b8\/9\/10\b;?)'),'') as varchar(8)) 
as varchar(200))) as LIC,

Thanks again!