Performance issue with strtok function

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
Highlighted
Junior Supporter

Performance issue with strtok function

Hi,

We are facing a performance issue when using strtok funtion. There is one column derived in a view using below code. The base table has 11 million records.There is a lot of difference seen in cpu when we dont derive these columns and when we derive these columns. without derivation -  4K with derivation - 23K.

Column definition is - mycol VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL and the derivation expression is :

SyntaxEditor Code Snippet

                        case when substr(strtok(mycol , '_',length(mycol ) - length(oreplace(mycol ,'_',''))),1,1)= 'T' 
                                                    then strtok(mycol , '_',length(mycol ) - length(oreplace(mycol ,'_','')))
                                        else strtok(mycol , '_',length(mycol ) - length(oreplace(mycol ,'_',''))+1) end 
                            AS VARCHAR(10)) der_col

Is this expected ? And what is the way to fix this ?

--Samir

8 REPLIES 8
Teradata Employee

Re: Performance issue with strtok function

Hi Samir,

 

Yes this is expected. Functions and computing does cost CPU.

4k to 23k isn't much an expand for what the function does.

 

If you could provide some data in input and expected output we may avoid using strtok, or use it differently, your case seems a bit complex it may be simplified.

Junior Supporter

Re: Performance issue with strtok function

Hi Walder,

Thanks for looking into this. I have mocked up the data but it macthes the actual one in format. I have bifurcated each expression in the final expression  to give a better idea. The Col6 is the actual expression. So essentially, it counts the "_" in the string and uses strtok to find the last but one string. In this ex, there are 5 "_" and finally it finds fifth part - P004 and takes out P. is there a better way to do this ? Any idea would help.

sel  
'O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123' as col1,length('O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123') as col2,oreplace('O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123',   '_','') as col3,length(oreplace('O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123',    '_','')) as col4,length('O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123') - length(oreplace('O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123',   '_','')) as col5,substr(strtok('O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123', '_', length('O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123') - length(oreplace('O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123',    '_',''))),1,1) as col6

 

 --Samir

Teradata Employee

Re: Performance issue with strtok function

It should be doable within a single regexp but I can't find it.

Anyway give it a try :

select substring(regexp_substr('O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123', '_(.[^_])*_(.[^_])*$') from 2 for 1);

Edit --

Ok this syntax works on its own :

select regexp_substr_gpl('O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123', '_([^_])[^_]*_[^_]*$', 1, 1, 'i', 1);

 

Teradata Employee

Re: Performance issue with strtok function

If there are always at least 3 tokens, this should return the next-to-last token if it starts with T, otherwise return the last token:

REGEXP_SUBSTR(mycol,'(?<=_)(((T[^_]*)(?=_[^_]+$))|([^_]+$))')

 

 

Ambassador

Re: Performance issue with strtok function

To avoid looahead/lookback syntax I usually use the undocumented RegExp_Substr_gpl which allows specifying the mathc group to be returned. But why oh why does a very similar regex not work for the second alternative?

RegExp_Substr_gpl(mycol,'_(T[^_]*)_[^_]+$|([^_]+$)',1,1,'i', 1) -- doesn't work for _P004_W123, but for _T004_W123
Teradata Employee

Re: Performance issue with strtok function

 

I could avoid the lookback from Fred's regexp using the \K argument, but not the lookahead.

Also, using branch reset (even if it looks like a negative lookahead, seems to not be the same thing), I could get your regexp to work :

with cte_data (mycol) as
(
select 'O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123' from cte_dual union all
select 'O12345678_TS_Thisismytestdatadatadatadat_mydat_T004_W124' from cte_dual
) , cte_dual (dummy) as (select 'X')
select mycol
     , regexp_substr (mycol, '(?<=_)(((T[^_]*)(?=_[^_]+$))|([^_]+$))')         as regexp_fred
     , regexp_substr (mycol, '_\K((T[^_]*)(?=_[^_]+$)|([^_]+$))')              as regexp_fred2
     , regExp_substr_gpl(mycol, '(?|(_(T[^_]*)_[^_]+$)|(([^_]+$)))',1,1,'i',2) as regexp_dnoeth
  from cte_data
;

 mycol                                                    regexp_fred regexp_fred2 regexp_dnoeth 
 -------------------------------------------------------- ----------- ------------ ------------- 
 O12345678_TS_Thisismytestdatadatadatadat_mydat_P004_W123 W123        W123         W123         
 O12345678_TS_Thisismytestdatadatadatadat_mydat_T004_W124 T004        T004         T004         

Side finding, on TD Studio I had to convert the regexp_substr_gpl to varchar(10), it may be a little too undocumented :)

Ambassador

Re: Performance issue with strtok function

Thanks for spotting the reset, I didn't notice that my regex creates two groups when I checked it on regex101.com

And it can be further simplified:

 

regExp_substr_gpl(mycol, '(?|_(T[^_]*)_[^_]+$|([^_]+$))',1,1,'i',1)

 

I'm still a regex noob :-)

 

Btw, both regExp_substr_gpl and regExp_substr return the same datatype, VARCHAR(8000) CHARACTER SET UNICODE, shouldn't cause a problem in Studio.

Teradata Employee

Re: Performance issue with strtok function

Thanks to @Waldar and @dnoeth both for finding a more efficient regex!