Recommendations for dynamic phrase searching with tagging output.

Database
Enthusiast

Recommendations for dynamic phrase searching with tagging output.

Hi All,

Looking for recommendations on Full Text search Tagging via LIKE / LIKE ANY etc.

This has to be dynamic, from a table of terms and tags.

I have a billion row table of free text (myPK bigint + myTEXT varchar 4000).

I have a 200-500 row table of phrases and/or words tied back to Tags. (myTAG varchar(10) + myPHRASE varchar(100))

I need to have an output table of any matches of myPK and myTERM.   It has to be exact phrase matching.  

Example, my table of TAGS, this is a current method I've been playing with to achieve this goal:

CREATE MULTISET TABLE DBNAME.Z_Tags, NO JOURNAL, NO FALLBACK 

(

myPhrase varchar(100),

myTag varchar(10)

)

PRIMARY INDEX (myPhrase, myTag)

insert into DBNAME.Z_Tags (myPhrase, myTag) values ('%this is a test%','TEST');

insert into DBNAME.Z_Tags (myPhrase, myTag) values ('%testing testing 123%','TEST');

insert into DBNAME.Z_Tags (myPhrase, myTag) values ('%the quality of this test%','TEST');

insert into DBNAME.Z_Tags (myPhrase, myTag) values ('%RED ROBIN%','COLOR');

insert into DBNAME.Z_Tags (myPhrase, myTag) values ('%BLUE MAN GROUP%','COLOR');

insert into DBNAME.Z_Tags (myPhrase, myTag) values ('%PONTIAC%','CAR');

insert into DBNAME.Z_Tags (myPhrase, myTag) values ('%TESLA%','CAR');

insert into DBNAME.Z_Tags (myPhrase, myTag) values ('%UNABLE TO CREATE%','CREATION');

Example, my table of FreeText

CREATE MULTISET TABLE DBNAME.Z_Text, NO JOURNAL, NO FALLBACK 

(

myPK bigint,

myTEXT varchar(4000)

)

PRIMARY INDEX (myPK)

insert into DBNAME.Z_Tags (myPK, myTEXT) values (123, 'this is a dummy text, will not find anything here');

insert into DBNAME.Z_Tags (myPK, myTEXT) values (124, 'what is the quality of this test for the following etc etc.');

insert into DBNAME.Z_Tags (myPK, myTEXT) values (125, 'Pontiac vehicles are no longer made');

insert into DBNAME.Z_Tags (myPK, myTEXT) values (126, 'Elon Musk will eventually install paypal systems inside tesla vehicles and possibly rockets');

insert into DBNAME.Z_Tags (myPK, myTEXT) values (127, 'UNABLE TO CREATE THE THIS IS A TEST OF THE TESLA BLUE MAN GROUP');

...  billion(s) of rows of unstructured text under 4k in length and structured in the fact we have a PK association.

Natural language processing eventually on the free text with other tools in a slow fashion, but for now, we want to brute force based on known phrases.  

Based on the above 2, my output would be as follows:

output table:

CREATE MULTISET TABLE DBNAME.Z_Tags_Found, NO JOURNAL, NO FALLBACK 

(

myPK BIGINT,

myTAG varchar(100)

)

124 TEST

125 CAR

126 CAR

127 CREATION

127 TEST

127 COLOR

127 CAR

NOTE HOW I HAVE MULTIPLE 127 ROWS, THIS IS EXPECTED.  

This tagged table will then be utilized for a wide range of reporting purposes.

One method I was testing is as follows, it gets killed in our environment for extensive cpu and/or spool.   Note I'm doing a LIKE in the Join, it actually does work for small amounts of terms, but not a gauntlet of 100 + terms. 

select myPK, myTAG from DBNAME.Z_Text c 

inner join DBNAME.Z_Tags t on c.myTEXT like t.myPHRASE

Based on that, I can make it work by doing smaller chunks of z_tags, such as just doing TEST tag from my example.

this is a proc I was testing as an iterative way to search.  I can remove myTAG = 'TEST', etc.  This proc is not fully vetted.

REPLACE PROCEDURE DBNAME.z_Tag  ( OUT m INTEGER )

BEGIN

DECLARE T VARCHAR(40);

DECLARE cursor_name CURSOR FOR select DISTINCT myTAG from dbname.z_tags where myTAG = 'TEST';

OPEN cursor_bob;

uglybastid:

LOOP

FETCH cursor_bob INTO T;

IF (SQLSTATE = '02000') THEN

  LEAVE uglybastid;

END IF;

insert into DBNAME.Z_Tags_Found (myPK, myTag) select myPK, myTAG from DBNAME.Z_Text c 

inner join DBNAME.Z_Tags t on c.myTEXT like t.myPHRASE and myTag = T;

END LOOP uglybastid;

CLOSE cursor_bob;

set m = 0;

END;

thanks in advanced.  

-Scott

Tags (3)
2 REPLIES
Enthusiast

Re: Recommendations for dynamic phrase searching with tagging output.

correction, on the second group of insert examples, I meant Z_Text not Z_Tag.   

Enthusiast

Re: Recommendations for dynamic phrase searching with tagging output.

and one more correction, cursor_name should be cursor_bob.

For completeness, here is a SQL Server implementation using Contains Full Text Indexing, this works.    I would much rather do this on Teradata and NOT pull all the TEXT down, even at 50-100k rows a second it takes a very long time to ETL, do a rollup, then ETL back to Teradata.   Not a fan of having to ETL to SQL server to perform a rollup operation because TeraData can't handle it. (kidding, but not really)   ;)

CREATE procedure [dbo].[Term_Tagging] as


---- table for list of terms

--create table z_classification_tags 

--(

--term varchar(50) not null, 

--phrase varchar(100) not null

--) 

---- table for results

--Create table z_classification_results

--(id_num varchar(64) not null,

--term varchar(50) not null,

--id_rank int not null)

---- bulk load a list of terms

--bulk insert z_classification_tags from 'c:\temp\terms.csv'

--WITH

--(FIRSTROW = 2, 

--  FIELDTERMINATOR = ',',

--  ROWTERMINATOR = '\n',

--  TABLOCK

--)

---- dedup any dup terms

--with dedup as 

--(

--select term, phrase, row_number() over (partition by term, phrase order by term, phrase) as rownum from z_classification_tags 

--)

--delete from dedup where rownum > 1

-- truncate table z_classification_results

DECLARE @TagCursor CURSOR;

DECLARE @SearchTerm varchar(50);

DECLARE @SearchPhrase varchar(100);

BEGIN

-- cursor of terms

    SET @TagCursor = CURSOR FOR

select quotename(phrase,'"') as phrasequoted, term from z_classification_tags

-- open and get first row

    OPEN @TagCursor 

    FETCH NEXT FROM @TagCursor

    INTO @SearchPhrase, @SearchTerm

    WHILE @@FETCH_STATUS = 0

    BEGIN

-- print for debugging purposes, comment these out for proc when not testing

print @searchPhrase

print @searchterm

-- insert found ids for search term into temporary table

insert into z_classification_results (id_num, term, id_rank)

select id_num, @SearchTerm as term, isnull(min(id_rank),1) as id_rank 

from TD_TEXT t 

inner join TD_HEADER cm on cm.id = t.ID 

where contains(Call_Text, @searchphrase) and id_NUM is not null

group by id_num -- group by is used in getting the id_rank as we need the first occurance num

;

-- get next item in cursor

FETCH NEXT FROM @TagCursor 

INTO @SearchPhrase, @SearchTerm

    END; 

    CLOSE @TagCursor;

    DEALLOCATE @TagCursor;

-- dedup final result

with dedup as 

(

select id_num, term, id_rank, row_number() over (partition by id_num, term order by id_num, term, id_rank) as rownum from z_classification_results 

)

delete dedup where rownum > 1

End; -- end of proc

GO