need help regex url query SQL

General
Enthusiast

need help regex url query SQL

I have a column with long URLs strings and I want to extract their domains, also can someone point me to regex operators in teradata ?

 

Most of the HowTos that I have found so far are kind of confusing and hard to read.

 

The URL string can start (there could be sub-domains) of various kinds.

 

https://www.domain.com/asdasd/asdasdasdasd/

http://www.domain.net/blbasdasdasd-asd

https://n.text.domain.com

 

I basically I want to strip the domain part and leave everything after /asdasd/asdasdasdasd/  behind

 

How does this work ?

 

Here is what I have so far, it returns: https://www.domain.com/events/

 

but I want to ignore /events ? and return https://www.domain.com

 

select 'https://www.domain.com/events/143403?sid=1090794&mid=35' as string_to_search
,REGEXP_SUBSTR(string_to_search, '^.*(?=(/))',1,1,'i') as extract_domain

 

Thank you!

 

3 REPLIES

Re: need help regex url query SQL

Dear beeroclock,

 

Here is a solution that does not use regex, just substr and position.  It is probably not the most elegant or efficient, but....

 

Basically, it locates the position of the third '/' by interrogating a substr of URL that omits the "https://".  Then substr'ing the url from 1 to the "third / position" plus an offset (7 is the magic number in this case).  The CASE statement is needed in case there isn't a third '/'.

 

It works with all the example data you provided.

 

Hope that helps.  Cheers!

 

SELECT
 'https://www.domain.com/events/143403?sid=1090794&mid=35' as urlstring
, case when position ('/' in substr(urlstring ,9)) = 0 then urlstring
   else substr(urlstring, 1, position ('/' in substr(urlstring ,9))+7 ) end as domainstring;
  
SELECT
'https://www.domain.com/asdasd/asdasdasdasd/' as urlstring
, case when position ('/' in substr(urlstring ,9)) = 0 then urlstring
   else substr(urlstring, 1, position ('/' in substr(urlstring ,9))+7 ) end as domainstring;

 

SELECT
   'http://www.domain.net/blbasdasdasd-asd'as urlstring
, case when position ('/' in substr(urlstring ,9)) = 0 then urlstring
   else substr(urlstring, 1, position ('/' in substr(urlstring ,9))+7 ) end as domainstring;
  
SELECT
'https://n.text.domain.com' as urlstring
, case when position ('/' in substr(urlstring ,9)) = 0 then urlstring
   else substr(urlstring, 1, position ('/' in substr(urlstring ,9))+7 ) end as domainstring;   
  
SELECT
'https://www.domain.com/events/' as urlstring
, case when position ('/' in substr(urlstring ,9)) = 0 then urlstring
   else substr(urlstring, 1, position ('/' in substr(urlstring ,9))+7 ) end as domainstring;

 

Enthusiast

Re: need help regex url query SQL

hi rewaltd - thanks for alternative - definately appreciated, I really want to get better with regex, I found something that works in the simple example:

 

select 'https://www.domain.com/events/143403?sid=1090794&mid=35' as string_to_search
,REGEXP_SUBSTR(string_to_search, '^.*?(?<!/)(?=/)(?!//)',1,1,'i') as extract_domain;

 

the Column that I am trying to apply this on is a: VARCHAR(1500).  I am not sure that I understand the error at all ??

 

But then when I run it for real on the column in the table that has the PAGE_REFERENCE_URLS, I get this error:

 

Executed as Single statement. Failed [9881 : HY000] Function 'REGEXP_SUBSTR' called with an invalid number or type of parameters Elapsed time = 00:00:00.405 STATEMENT 1: Select Statement failed.

 

Here is the info on that column, it is in a View. 

PAGREF_URL VARCHAR(1500) CHARACTER SET UNICODE NOT CASESPECIFIC TITLE 'PAGE_REF_URL',

 

and what I am using:

 

SELECT * FROM DBC.DBCINFOV; VERSION 15.10.03.05 LANGUAGE SUPPORT MODE Standard RELEASE 15.10.03.05

 

 

 

 

 

Enthusiast

Re: need help regex url query SQL

Hi Beerlock

 

Try below query:

 

SyntaxEditor Code Snippet

select 'https://www.domain.com/events/143403?sid=1090794&mid=35' as string_to_search,REGEXP_SUBSTR(string_to_search, '^(.*)\:\/\/(.*)\.[a-zA-Z]*',1,1,'i') as extract_domain

Thanks
Nitin 

Tags (1)