Searching for a huge constantly changing pattern list in a text column in teradata

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
Highlighted
Xya
Fan

Searching for a huge constantly changing pattern list in a text column in teradata

Hi All,

 

I have a query pertaining to searching for a pattern containing a huge list of words in a text column in teradata.

 

Lets say I have a table Addresses that contains millions of address from all over the world.

 

I have another table CountryCode, which contains a list of country codes. For e.g., US for USA, IN for India, UK for United Kingdom, SA for Saudi Arabia, BA for Bosnia so on and so forth.

 

I would need to find all the addresses in table Addresses that contain country codes which are present in table CountryCode.

A simple like statement would have got the job done, but with some addresses it gets tricky.

 

Lets say I have an address, 'No 10, Bank of Xyz, Abc'. As it can be seen, there is no country code in this address, yet, a like statement would falsely assume the 'Ba' from 'Bank' is the country code 'BA' for Bosnia. Thus, to identify presence of actual country codes, we would need to check if table Addresses contains country codes from table Countrycodes which is not preceded or succeeded by any alphabet or number. It can be preceded or succeeded by any other character, or no character at all.


Table:  Addresses

ADDRESS1
11 Wall Street New York ,US
221 B Baker St, London, England
IN/282001 Agra Uttar Pradesh
4 Privet Drive Surrey -UK/12345
No 2 Bank of Baroda, Chennai
....

Table: Countrycodes
CCode Country
IN India
US United States of America
SA Saudi Arabia
BA Bosnia
UK United Kingdom
CH China
JA Japan
.....

 

The 1st, 3rd, 4th address contain valid country codes whereas the rest do not. Is there any query that I can use to flag address that contain the country codes from the list in the second table that is not preceded or succeeded by alphabets or numbers?

 

It is important to note that the list in table Countrycodes will be periodically updated. Thus, a regexp hardcoded with the countries in the list (which is also quite long!) is impractical.

 

What I've got so far:

 

I would like a neat little solution to do this. Is there an option where regexp_substr can take dynamically generated patterns?

 

Can the below query be modified in any way to achieve this? I am trying to generate the pattern dynamically, which is most probably not allowed in regexp.

 

SELECT
'REGEXP_SUBSTR(ADDRESS1||'' '','||(SELECT '''[^0-9a-zA-Z]'||OREPLACE('('||TRIM(TRAILING '|' FROM
(XMLAGG (TRIM (CCODE)||'|' ORDER BY CCODE) (VARCHAR(10000))))||')[^0-9a-zA-Z]''', ' ','')
from COUNTRYCODES) ||')' AS XX
;

 

this would give me the regex search expression. Can it be modified to execute from the below query, without trying to export to a file and then using the Bteq Run file?

 

select ADDRESS1,
'REGEXP_SUBSTR(ADDRESS1||'' '','||(SELECT '''[^0-9a-zA-Z]'||OREPLACE('('||TRIM(TRAILING '|' FROM
(XMLAGG (TRIM (CCODE)||'|' ORDER BY CCODE) (VARCHAR(10000))))||')[^0-9a-zA-Z]''', ' ','')
from COUNTRYCODES) ||')' AS XX
FROM ADDRESSES
WHERE XX IS NOT NULL
;

Please let me know if there is another simpler solution to achieve this.

 

Also, Im not allowed to use stored procedures. So a solution without that would be very helpful.

 

Thankyou in advance.

 


Accepted Solutions
Ambassador

Re: Searching for a huge constantly changing pattern list in a text column in teradata

You can return only the RegEx using a Scalar Subquery similar to the existing one (and a bit simplified).

I also switched to a word boundary \b (any character besides a-z,0-9,_)

SELECT ADDRESS1,
   RegExp_Substr(ADDRESS1
                , (SELECT '\b('
                     || OTranslate(XmlAgg (Trim (CCODE) ORDER BY CCODE) (VARCHAR(10000)), ' ' , '|')
                     || ')\b'
                   FROM COUNTRYCODES
) ) AS ccode FROM addresses;

 

1 ACCEPTED SOLUTION
2 REPLIES 2
Ambassador

Re: Searching for a huge constantly changing pattern list in a text column in teradata

You can return only the RegEx using a Scalar Subquery similar to the existing one (and a bit simplified).

I also switched to a word boundary \b (any character besides a-z,0-9,_)

SELECT ADDRESS1,
   RegExp_Substr(ADDRESS1
                , (SELECT '\b('
                     || OTranslate(XmlAgg (Trim (CCODE) ORDER BY CCODE) (VARCHAR(10000)), ' ' , '|')
                     || ')\b'
                   FROM COUNTRYCODES
) ) AS ccode FROM addresses;

 

Xya
Fan

Re: Searching for a huge constantly changing pattern list in a text column in teradata

The solution works perfectly! Thanks for the quick response, that too on a Sunday!