Identity Match function gives only exact matchings

Analytics

Identity Match function gives only exact matchings

Hi, 

 

I am facing a problem with IdentityMatch function. I have this main table as you can see below. I want to find the smilar strings in the "title" column.

 

1.JPG

 

 

By using this main table (havale_eft_node_2), I create two sub-tables (a and b) based on cust_tp (T or X) and I want to compare each record from "a" and each record from "b" for the title column on a string basis. 

 

SELECT * FROM IdentityMatch (

ON (SELECT ROW_NUMBER () OVER (ORDER BY a.ID DESC) AS ROW_ID , a.* FROM havale_eft_node_2 a WHERE CUST_TP IN ('T')) AS a PARTITION BY TITLE

ON (SELECT ROW_NUMBER () OVER (ORDER BY a.ID DESC) AS ROW_ID , a.* FROM havale_eft_node_2 a WHERE CUST_TP IN ('X')) AS b PARTITION BY TITLE

IDColumn ('a.row_id: b.row_id')

FuzzyMatchColumns ('a.title: b.title, JARO-WINKLER, 1')

Accumulate ('a.id', 'a.title', 'b.id', 'b.title', 'a.cust_tp', 'b.cust_tp')

Threshold (0.85)

)

 

 

The problem is when I run the query above, the function gives only the exact matchings as output. I could not figure out the reason and would greatly appreciate any help.

 

Best regards.

Ahmet

Tags (1)
2 REPLIES
Teradata Employee

Re: Identity Match function gives only exact matchings

Ahmet,

 

 

The only exact matches is due to partitioning each data set by TITLE. In this way, only equal partitions are being compared to each other (to reduce computation time). This means non-identical matches are never even being considered.

In you dataset there isn't a good column to partition by since we want an attribute that we know would likely be the same in both inputs such as an age range or country. Using the following syntax should work for getting your fuzzy matches:

 

SELECT * FROM IdentityMatch (
ON (
SELECT ROW_NUMBER () OVER (ORDER BY a.ID DESC) AS ROW_ID
, a.*
FROM havale_eft_node_2 a
WHERE CUST_TP IN ('T')
) AS a PARTITION BY ANY
ON (
SELECT ROW_NUMBER () OVER (ORDER BY a.ID DESC) AS ROW_ID
, a.* FROM havale_eft_node_2 a
WHERE CUST_TP IN ('X')
) AS b DIMENSION
IDColumn ('a.row_id: b.row_id')
FuzzyMatchColumns ('a.title: b.title, JARO-WINKLER, 1')
Accumulate ('a.id', 'a.title', 'b.id', 'b.title', 'a.cust_tp', 'b.cust_tp')
Threshold (0.85)
)

Your run time with increase since more comparisions are being made, but the output will include non-identical matches. 

 

Good luck!

Michelle

Re: Identity Match function gives only exact matchings

Hi Michelle,

 

Thanks for replying. I had run the query as you mentioned. However, the table has more than 2m rows and I had to cancel the execution after 3 days. Therefore I had tuned the query. 

 

I had a similar case where I took advantage of "partitioning by key" concept. It was a money transfer table (havale_eft_edge) of 8 million rows. You can see the data structure on the following sample table.

 

ROW_ID

SENDER_TITLE

RCPNT_TITLE

3

SAMI ESIM

CAVIDE GOKNAR

4

SUNAR MISIR ENTEGRE TESI SLERI SAN.VE TIC.A.S.

SUNAR MISIR ENTEGRETESIS LERI SAN.VETIC.A.S.

5

SUNAR MISIR ENTEGRE TESI SLERI SAN.VE TIC.A.S.

SUNAR MISIR A.S.

6

SUNAR MISIR ENTEGRE TESI SLERI SAN.VE TIC.A.S.

SUNAR MISIR ENTEGRE TESISLERI SAN.VE TIC.A.S. SISLERI SAN.VE TIC.A.S.

7

SUNAR MISIR ENTEGRE TESI SLERI SAN.VE TIC.A.S.

SUNAR MISIR ENTEGRE TES. A.S.

8

SUNAR MISIR ENTEGRE TESI SLERI SAN.VE TIC.A.S.

RAMAZAN ULUC

9

SUNAR MISIR ENTEGRE TESI SLERI SAN.VE TIC.A.S.

MAHSUM BINGOL

10

DEMIRER HIDROLIK SANAYI VE TICARET LTD STI

DEMIRER HIDROLIK SANAYI VE TICARET LTD STI I VE TICARET LTD STI

13

OZGUL OZDEMIR

KAMIL OZDEMIR

15

FATMA NILGUN SARI

GIMSA YAPI TAAHHUT SAN.V E TIC.LTD.STI.

 

 

I wanted to calculate the similarity between the SENDER_TITLE and RCPNT_TITLE columns for each row on a string basis and used the query below. The output included non-identical matches too.

 

SELECT * FROM IdentityMatch (

ON (SELECT * FROM havale_eft_edge) AS a PARTITION BY sender_title

ON (SELECT * FROM havale_eft_edge) AS b PARTITION BY sender_title

IDColumn ('a.row_id: b.row_id')

FuzzyMatchColumns ('a.sender_title: b.rcpnt_title, JARO-WINKLER, 1')

Accumulate ('a.sender_title', 'b.rcpnt_title')

Threshold (0.95)

)

 

Therefore I thought that I could use the same method here but I think there is a problem with the data structure for this example. 

 

Kind regards,

Ahmet