REGEXP_SPLIT_TO_TABLE within a WITH as

Database
Fan

REGEXP_SPLIT_TO_TABLE within a WITH as

Part of the challenge in using REGEXP_SPLIT_TO_TABLE or STRTOK_SPLIT_TO_TABLE is the inability to "carry" other columns from contributing tables along easily.  The function allows for a OUTKEY value, a count and the parsed value.  I'm trying to use CTE ( the WITH clause ) to make all the data available so one final query can bring it together and give me all the required data.  Unfortunately, it seems to work for a single REGEXP_SPLIT_TO_TABLE but when I add on a STRTOK_SPLIT_TO_TABLE the query throws an error that makes no sense to me.  I've posted a number of code chunks below to show various attempts.  In the end I did get something to work but I'm still looking for rationale why other attempts did not work.  Or, better ideas to "carrying" along the extra columns I want into the final result set.

 

Initial query works, but need more columns from the source table:

with split_groups as ( 
	select pk, group_cnt
	,	cast(trim(regexp_substr(prov_city_group_value,'[^/]+',1,1))		as varchar(32))	as province_nm
	,	cast(trim(regexp_substr(prov_city_group_value,'[^/]+',1,2))		as varchar(64))	as city_nm
	,	regexp_substr(prov_city_group_value,'[^/]+',1,3)			as npanxx_group_value
	from big_groups
	),
	
big_groups as ( 
	SELECT * 
         FROM TABLE (
   		REGEXP_SPLIT_TO_TABLE(test_data.pk, test_data.log_value, '\,\s+(?=[^0-9])', 'i')
   			RETURNS (pk VARCHAR(10)  character set unicode, group_cnt integer, prov_city_group_value varchar(16000) character set unicode)
   		) AS dt
	),
	
test_data as ( 
	SELECT 'abc' as pk, 1 as hee, 'def' as haw, 'New Brunswick/KEDGWICK/506283, 506284, New Brunswick/BLACKVILLE/506586, 506843,' as log_value
	)
SELECT npa.* 
FROM TABLE (
   		STRTOK_SPLIT_TO_TABLE(pk,  split_groups.npanxx_group_value, ',')
   			RETURNS (pk VARCHAR(32)  character set unicode, npanxx_cnt integer, npanxx varchar(10) character set unicode)
   		) AS npa

Inserting the STRTOK into a CTE ( so I can do one big join to bring everything together ) results in an error:  SELECT Failed.  [3822] Cannot resolve column ''. Specify table or view.

with split_npanxx as (
	SELECT * 
FROM TABLE (
   		STRTOK_SPLIT_TO_TABLE(split_groups.province_nm||split_groups.city_nm, split_groups, ',')
   			RETURNS (prov_city VARCHAR(100)  character set unicode, npanxx_cnt integer, npanxx char(6) character set unicode)
   		) AS npa
	),
	
split_groups as ( 
	select pk, group_cnt
		,	cast(trim(regexp_substr(prov_city_group_value,'[^/]+',1,1))		as varchar(32))	as province_nm
		,	cast(trim(regexp_substr(prov_city_group_value,'[^/]+',1,2))		as varchar(64))	as city_nm
		,	regexp_substr(prov_city_group_value,'[^/]+',1,3)				as npanxx_group_value
	from big_groups
	),
	
big_groups as ( 
	SELECT * 
FROM TABLE (
   		REGEXP_SPLIT_TO_TABLE(test_data.pk, test_data.log_value, '\,\s+(?=[^0-9])', 'i')
   			RETURNS (pk VARCHAR(10)  character set unicode, group_cnt integer, prov_city_group_value varchar(16000) character set unicode)
   		) AS dt
	),
	
test_data as ( 
	SELECT 'abc' as pk, 1 as hee, 'def' as haw, 'New Brunswick/KEDGWICK/506283, 506284, New Brunswick/BLACKVILLE/506586, 506843,' as log_value
	)
select *
from split_npanxx;

My ultimate intent is to create a query like below to bring it all together.  This one errors with:   SELECT Failed.  [3807] Object 'test_data' does not exist.

with split_groups as ( 
	select pk, group_cnt
		,	cast(trim(regexp_substr(prov_city_group_value,'[^/]+',1,1))		as varchar(32))	as province_nm
		,	cast(trim(regexp_substr(prov_city_group_value,'[^/]+',1,2))		as varchar(64))	as city_nm
		,	regexp_substr(prov_city_group_value,'[^/]+',1,3)								 as npanxx_group_value
	from big_groups
	),
	
big_groups as ( 
	SELECT * 
FROM TABLE (
   		REGEXP_SPLIT_TO_TABLE(test_data.pk, test_data.log_value, '\,\s+(?=[^0-9])', 'i')
   			RETURNS (pk VARCHAR(10)  character set unicode, group_cnt integer, prov_city_group_value varchar(16000) character set unicode)
   		) AS dt
	),
	
test_data as ( 
	SELECT 'abc' as pk, 1 as hee, 'def' as haw, 'New Brunswick/KEDGWICK/506283, 506284, New Brunswick/BLACKVILLE/506586, 506843,' as log_value
	)
SELECT npa.*, s.group_cnt, s.province_nm, s.city_nm
FROM TABLE (
   		STRTOK_SPLIT_TO_TABLE(pk,  split_groups.npanxx_group_value, ',')
   			RETURNS (pk VARCHAR(10)  character set unicode, npanxx_cnt integer, npanxx varchar(10) character set unicode)
   		) AS npa,
  split_groups	s
 where npa.pk = s.pk
 

However, if I create TEST_DATA as a volatile table ( which is really closer to the real situation ) it does work, though I was not able to include TEST_DATA in the final join to bring in columns HEE and HAW.  Not quite there yet....

create volatile multiset table test_data as (
 	with cte as ( 
	SELECT 'abc' as pk, 1 as hee, 'def' as haw, 'New Brunswick/KEDGWICK/506283, 506284, New Brunswick/BLACKVILLE/506586, 506843,' as log_value
	)
	select  pk, hee, haw, log_value
	 from cte
	 ) with data primary index ( pk ) on commit preserve rows
	 ;
 
with split_groups as ( 
	select pk, group_cnt
		,	cast(trim(regexp_substr(prov_city_group_value,'[^/]+',1,1))		as varchar(32))	as province_nm
		,	cast(trim(regexp_substr(prov_city_group_value,'[^/]+',1,2))		as varchar(64))	as city_nm
		,	regexp_substr(prov_city_group_value,'[^/]+',1,3)								 as npanxx_group_value
	from big_groups
	),
	
big_groups as ( 
	SELECT * 
FROM TABLE (
   		REGEXP_SPLIT_TO_TABLE(test_data.pk, test_data.log_value, '\,\s+(?=[^0-9])', 'i')
   			RETURNS (pk VARCHAR(10)  character set unicode, group_cnt integer, prov_city_group_value varchar(16000) character set unicode)
   		) AS dt
	)
	
SELECT npa.*, s.group_cnt, s.province_nm, s.city_nm
FROM TABLE (
   		STRTOK_SPLIT_TO_TABLE(pk,  split_groups.npanxx_group_value, ',')
   			RETURNS (pk VARCHAR(32)  character set unicode, npanxx_cnt integer, npanxx varchar(10) character set unicode)
   		) AS npa,
  split_groups	s
 where npa.pk = s.pk

Any pointers?

 

 

 

 


Accepted Solutions
Senior Apprentice

Re: REGEXP_SPLIT_TO_TABLE within a WITH as

You need to add another cte, but you must define the returned key column with the correct length. Latin must exact match and Unicode data must be twice the length. For your example data (VARCHAR(3) UNICODE) it must be VARCHAR(6) UNICODE  (if you define a larger size the result will be padded with binary zeroes and the join fails, bug or feature?):

 

WITH final_cte AS 
 (
   SELECT npa.*, s.group_cnt, s.province_nm, s.city_nm
   FROM TABLE (
        STRTOK_SPLIT_TO_TABLE(TRIM(split_groups.pk),  split_groups.npanxx_group_value, ',')
           RETURNS (pk VARCHAR(6) CHARACTER SET UNICODE
            , npanxx_cnt INTEGER, npanxx VARCHAR(20) CHARACTER SET UNICODE)
        ) AS npa,
        split_groups AS s
   WHERE npa.pk = s.pk
 )
 
,split_groups AS ( 
   SELECT pk, group_cnt
      ,CAST(TRIM(REGEXP_SUBSTR(prov_city_group_value,'[^/]+',1,1)) AS VARCHAR(32))   AS province_nm
      ,CAST(TRIM(REGEXP_SUBSTR(prov_city_group_value,'[^/]+',1,2)) AS VARCHAR(64))   AS city_nm
      ,REGEXP_SUBSTR(prov_city_group_value,'[^/]+',1,3)                              AS npanxx_group_value
   FROM big_groups
   )
   
,big_groups AS ( 
   SELECT * 
   FROM TABLE (
        REGEXP_SPLIT_TO_TABLE(TRIM(test_data.pk), test_data.log_value, '\,\s+(?=[^0-9])', 'i')
           RETURNS (pk VARCHAR(6) CHARACTER SET UNICODE
           , group_cnt INTEGER, prov_city_group_value VARCHAR(16000) CHARACTER SET UNICODE)
        ) AS dt
 )
   
SELECT * FROM final_cte 
JOIN test_data
ON final_cte.pk = test_data.pk

 

1 ACCEPTED SOLUTION
2 REPLIES
Senior Apprentice

Re: REGEXP_SPLIT_TO_TABLE within a WITH as

You need to add another cte, but you must define the returned key column with the correct length. Latin must exact match and Unicode data must be twice the length. For your example data (VARCHAR(3) UNICODE) it must be VARCHAR(6) UNICODE  (if you define a larger size the result will be padded with binary zeroes and the join fails, bug or feature?):

 

WITH final_cte AS 
 (
   SELECT npa.*, s.group_cnt, s.province_nm, s.city_nm
   FROM TABLE (
        STRTOK_SPLIT_TO_TABLE(TRIM(split_groups.pk),  split_groups.npanxx_group_value, ',')
           RETURNS (pk VARCHAR(6) CHARACTER SET UNICODE
            , npanxx_cnt INTEGER, npanxx VARCHAR(20) CHARACTER SET UNICODE)
        ) AS npa,
        split_groups AS s
   WHERE npa.pk = s.pk
 )
 
,split_groups AS ( 
   SELECT pk, group_cnt
      ,CAST(TRIM(REGEXP_SUBSTR(prov_city_group_value,'[^/]+',1,1)) AS VARCHAR(32))   AS province_nm
      ,CAST(TRIM(REGEXP_SUBSTR(prov_city_group_value,'[^/]+',1,2)) AS VARCHAR(64))   AS city_nm
      ,REGEXP_SUBSTR(prov_city_group_value,'[^/]+',1,3)                              AS npanxx_group_value
   FROM big_groups
   )
   
,big_groups AS ( 
   SELECT * 
   FROM TABLE (
        REGEXP_SPLIT_TO_TABLE(TRIM(test_data.pk), test_data.log_value, '\,\s+(?=[^0-9])', 'i')
           RETURNS (pk VARCHAR(6) CHARACTER SET UNICODE
           , group_cnt INTEGER, prov_city_group_value VARCHAR(16000) CHARACTER SET UNICODE)
        ) AS dt
 )
   
SELECT * FROM final_cte 
JOIN test_data
ON final_cte.pk = test_data.pk

 

Fan

Re: REGEXP_SPLIT_TO_TABLE within a WITH as

Thanks Dieter.  It appears the *SPLIT_TO_TABLE functions are a little finicky about translation and data types.  There's really no need to use UNICODE, so I'd rather use LATIN, but when I do, the final join gives me nothing.  And defining PK as CHAR(6) UNICODE won't work either, it must be VARCHAR(6) UNICODE...  It also looks like the OUTKEY ( my column PK ) is being generated as UNICODE - regardless what it's defined as.  When I define it as CHAR(3) CHARACTER SET LATIN the query errors with SELECT Failed.  [9134] Output column is not big enough to hold outkey.

 

Evidently there's a gap in my understanding of data types or the implementation and/or documentation is buggy / incomplete.

 

Solution: as provide by Dieter, define character output of *SPLIT_TO_TABLE functions as VARCHAR(x) and CHARACTER SET UNICODE.  

create volatile multiset table test_data as (
 	with cte as ( 
	SELECT 'abc' as pk, 1 as hee, 'def' as haw, 'New Brunswick/KEDGWICK/506283, 506284, New Brunswick/BLACKVILLE/506586, 506843,' as log_value
	)
	select  pk, hee, haw, log_value
	 from cte
	 ) with data primary index ( pk ) on commit preserve rows
	 ;
	 
WITH final_cte AS 
 (
   SELECT npa.*, s.group_cnt, s.province_nm, s.city_nm
   FROM TABLE (
        STRTOK_SPLIT_TO_TABLE(TRIM(split_groups.pk),  split_groups.npanxx_group_value, ',')
           RETURNS (pk VARCHAR(6) CHARACTER SET UNICODE
            , npanxx_cnt INTEGER, npanxx VARCHAR(20) CHARACTER SET UNICODE)
        ) AS npa,
        split_groups AS s
   WHERE npa.pk = s.pk
 )
 
,split_groups AS ( 
   SELECT pk, group_cnt
      ,CAST(TRIM(REGEXP_SUBSTR(prov_city_group_value,'[^/]+',1,1)) AS VARCHAR(32))   AS province_nm
      ,CAST(TRIM(REGEXP_SUBSTR(prov_city_group_value,'[^/]+',1,2)) AS VARCHAR(64))   AS city_nm
      ,REGEXP_SUBSTR(prov_city_group_value,'[^/]+',1,3)                              AS npanxx_group_value
   FROM big_groups
   )
   
,big_groups AS ( 
   SELECT * 
   FROM TABLE (
        REGEXP_SPLIT_TO_TABLE(TRIM(test_data.pk), test_data.log_value, '\,\s+(?=[^0-9])', 'i')
           RETURNS (pk VARCHAR(6) CHARACTER SET UNICODE
           , group_cnt INTEGER, prov_city_group_value VARCHAR(16000) CHARACTER SET UNICODE)
        ) AS dt
 )
   
SELECT * FROM final_cte 
JOIN test_data
ON final_cte.pk = test_data.pk