Powerful Identity Matching with Aster

Learn Aster
Teradata Employee

Aster’s IdentityMatch function

In these examples, I’m comparing all customer records to all customer records.  In a real-world process, it would likely be comparing ‘new records’ to the existing customer records or some scenario along those lines.

Sample Data

id

First

name

Middle

name

Last

name

city

zip

phone

dob

ssn

1

John

D

Rockefeller

New York

10021

212-555-1000

1839-07-08

111-22-3456

2

John

Rockefeller

New York

10021

212-555-1000

1839-01-01

3

J

D

Rockefeller

New York

10021

212-555-1000

1839-07-01

111-22-3456

4

J

Rockefeller

Newport

212-555-1000

5

Jon

B

Rockefella

New York

10021

7/1/1925

444-11-4321

6

J

B

Rockefella

New York

10021

5/15/1925

444-11-4321

select * from IdentityMatch

(

on customers as partition by any

on customers as dimension 

idColumn('a.id: b.id')

nominalMatchColumns('a.ssn: b.ssn')

fuzzyMatchColumns('a.first_name: b.first_name,JARO-WINKLER, 3'

                     ,'a.last_name: b.last_name,JARO-WINKLER,5'

                     ,'a.city: b.city,JARO-WINKLER,3'

                     ,'a.dob: b.dob,JARO,10')

accumulate('a.first_name','a.last_name','b.first_name','b.last_name'

                      ,'a.city','b.city','a.dob','b.dob','a.ssn','b.ssn')

)

where "a.id" <> "b.id"

order by "a.id", "b.id";

a.id

a.first

name

a.last

name

a.city

a.dob

a.ssn

b.id

b.first

name

b.last

name

b.city

b.dob

b.ssn

score

1

John

Rockefeller

New York

1839-07-08

111-22-3456

2

John

Rockefeller

New York

1839-01-01

0.9365

1

John

Rockefeller

New York

1839-07-08

111-22-3456

3

J

Rockefeller

New York

1839-07-01

111-22-3456

1

1

John

Rockefeller

New York

1839-07-08

111-22-3456

5

Jon

Rockefella

New York

7/1/1925

444-11-4321

0.8818

1

John

Rockefeller

New York

1839-07-08

111-22-3456

6

J

Rockefella

New York

5/15/1925

444-11-4321

0.8064

2

John

Rockefeller

New York

1839-01-01

1

John

Rockefeller

New York

1839-07-08

111-22-3456

0.9365

2

John

Rockefeller

New York

1839-01-01

3

J

Rockefeller

New York

1839-07-01

111-22-3456

0.9292

2

John

Rockefeller

New York

1839-01-01

5

Jon

Rockefella

New York

7/1/1925

444-11-4321

0.8591

2

John

Rockefeller

New York

1839-01-01

6

J

Rockefella

New York

5/15/1925

444-11-4321

0.8117

3

J

Rockefeller

New York

1839-07-01

111-22-3456

1

John

Rockefeller

New York

1839-07-08

111-22-3456

1

3

J

Rockefeller

New York

1839-07-01

111-22-3456

2

John

Rockefeller

New York

1839-01-01

0.9292

3

J

Rockefeller

New York

1839-07-01

111-22-3456

4

J

Rockefeller

Newport

0.5018

3

J

Rockefeller

New York

1839-07-01

111-22-3456

5

Jon

Rockefella

New York

7/1/1925

444-11-4321

0.904

3

J

Rockefeller

New York

1839-07-01

111-22-3456

6

J

Rockefella

New York

5/15/1925

444-11-4321

0.8596

4

J

Rockefeller

Newport

3

J

Rockefeller

New York

1839-07-01

111-22-3456

0.5018

5

Jon

Rockefella

New York

7/1/1925

444-11-4321

1

John

Rockefeller

New York

1839-07-08

111-22-3456

0.8818

5

Jon

Rockefella

New York

7/1/1925

444-11-4321

2

John

Rockefeller

New York

1839-01-01

0.8591

5

Jon

Rockefella

New York

7/1/1925

444-11-4321

3

J

Rockefeller

New York

1839-07-01

111-22-3456

0.904

5

Jon

Rockefella

New York

7/1/1925

444-11-4321

6

J

Rockefella

New York

5/15/1925

444-11-4321

1

6

J

Rockefella

New York

5/15/1925

444-11-4321

1

John

Rockefeller

New York

1839-07-08

111-22-3456

0.8064

6

J

Rockefella

New York

5/15/1925

444-11-4321

2

John

Rockefeller

New York

1839-01-01

0.8117

6

J

Rockefella

New York

5/15/1925

444-11-4321

3

J

Rockefeller

New York

1839-07-01

111-22-3456

0.8596

6

J

Rockefella

New York

5/15/1925

444-11-4321

5

Jon

Rockefella

New York

7/1/1925

444-11-4321

1

Note: records with a 100% match are due to the nominal rule on SSN.

Also, the scores are much too high, look at the second to last row.  'Rockefella in 1925' should not have an 86% match with 'Rockefeller of 1839'!  Let's tune our model.

Model Iteration

New recorded added for a new 1955 Rockefeller.

insert into customers values('7','Jim','P','Rockefeller Jr','New Jersey','07047',null,'1955-11-22','555-01-6778');

For tuning, I’ve used simple SQL on the input data to extract the DOB year into a new field.  That will give us a more precise comparison.  In fact, I’ve commented out the original DOB rule and replaced it with a rule for DOB_year.

I’ve also commented out the exact match ‘nominal’ rules.  Those can easily be done in an initial step to remove the ‘exact matches’ first.

This now leaves the Fuzzy matches, which we can iterate and tune by specifying different models; ie Levenshtein vs Jaro, etc.

And we can easily tune the model by changing the weighting factor for each rule. In my example, I’m using a simple 1-10 scale.

And finally we can filter with the ‘threshold’ parameter; ie “only include results with scores greater than this threshold”.  The default is .5, but I’ve lowered it to .2 so that we can evaluate more results for tuning purposes.

select * from IdentityMatch

(

    on ( select *, date_part('year',dob) as dob_year from customers ) as A partition by any

    on ( select *, date_part('year',dob) as dob_year from customers ) as B dimension

    idColumn('a.id: b.id')

          -- if nominal Match columns are equal, records are considered a complete

          -- match and fuzzy Matching is not calculated

  --  nominalMatchColumns('a.ssn: b.ssn'

  --                     ,'a.last_name: b.last_name'

  --                     ,'a.dob: b.dob')

  -- fuzzy match weights are normalized; here I'm using a 1-10 scale (10=most weight)

    fuzzyMatchColumns('a.first_name: b.first_name,JARO-WINKLER, 5'

                     ,'a.last_name: b.last_name,JARO-WINKLER,10'

                     ,'a.city: b.city,JARO-WINKLER,4'

                --   ,'a.dob: b.dob,JARO,5'

                     ,'a.dob_year: b.dob_year,LD,7'

                     ,'a.ssn: b.ssn,LD,7'

                     )

    accumulate('a.first_name','b.first_name','a.last_name','b.last_name'

              ,'a.city','b.city'

              --,'a.dob','b.dob'

              ,'a.dob_year','b.dob_year'

              ,'a.ssn','b.ssn')

    threshold('.2')

)

where "a.id" <> "b.id"

order by "a.id", "b.id";

a.id

a.first

name

a.last_

name

a.city

a.dob

year

a.ssnb.id

b.first

name

b.last

name

b.city

b.dob

year

b.ssnscore

5

Jon

Rockefella

New York

1925

444-11-4321

6

J

Rockefella

New York

1925

444-11-4321

0.9798

6

J

Rockefella

New York

1925

444-11-4321

5

Jon

Rockefella

New York

1925

444-11-4321

0.9798

1

John

Rockefeller

New York

1839

111-22-3456

3

J

Rockefeller

New York

1839

111-22-3456

0.9773

3

J

Rockefeller

New York

1839

111-22-3456

1

John

Rockefeller

New York

1839

111-22-3456

0.9773

1

John

Rockefeller

New York

1839

111-22-3456

2

John

Rockefeller

New York

1839

0.7879

2

John

Rockefeller

New York

1839

1

John

Rockefeller

New York

1839

111-22-3456

0.7879

2

John

Rockefeller

New York

1839

3

J

Rockefeller

New York

1839

111-22-3456

0.7652

3

J

Rockefeller

New York

1839

111-22-3456

2

John

Rockefeller

New York

1839

0.7652

6

J

Rockefella

New York

1925

444-11-4321

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

0.7413

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

6

J

Rockefella

New York

1925

444-11-4321

0.7413

3

J

Rockefeller

New York

1839

111-22-3456

6

J

Rockefella

New York

1925

444-11-4321

0.7033

6

J

Rockefella

New York

1925

444-11-4321

3

J

Rockefeller

New York

1839

111-22-3456

0.7033

5

Jon

Rockefella

New York

1925

444-11-4321

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

0.7009

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

5

Jon

Rockefella

New York

1925

444-11-4321

0.7009

1

John

Rockefeller

New York

1839

111-22-3456

5

Jon

Rockefella

New York

1925

444-11-4321

0.6932

5

Jon

Rockefella

New York

1925

444-11-4321

1

John

Rockefeller

New York

1839

111-22-3456

0.6932

3

J

Rockefeller

New York

1839

111-22-3456

5

Jon

Rockefella

New York

1925

444-11-4321

0.6831

5

Jon

Rockefella

New York

1925

444-11-4321

3

J

Rockefeller

New York

1839

111-22-3456

0.6831

1

John

Rockefeller

New York

1839

111-22-3456

6

J

Rockefella

New York

1925

444-11-4321

0.6806

6

J

Rockefella

New York

1925

444-11-4321

1

John

Rockefeller

New York

1839

111-22-3456

0.6806

3

J

Rockefeller

New York

1839

111-22-3456

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

0.666

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

3

J

Rockefeller

New York

1839

111-22-3456

0.666

2

John

Rockefeller

New York

1839

5

Jon

Rockefella

New York

1925

444-11-4321

0.6546

5

Jon

Rockefella

New York

1925

444-11-4321

2

John

Rockefeller

New York

1839

0.6546

2

John

Rockefeller

New York

1839

6

J

Rockefella

New York

1925

444-11-4321

0.642

6

J

Rockefella

New York

1925

444-11-4321

2

John

Rockefeller

New York

1839

0.642

1

John

Rockefeller

New York

1839

111-22-3456

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

0.6218

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

1

John

Rockefeller

New York

1839

111-22-3456

0.6218

2

John

Rockefeller

New York

1839

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

0.5832

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

2

John

Rockefeller

New York

1839

0.5832

3

J

Rockefeller

New York

1839

111-22-3456

4

J

Rockefeller

Newport

0.5571

4

J

Rockefeller

Newport

3

J

Rockefeller

New York

1839

111-22-3456

0.5571

4

J

Rockefeller

Newport

6

J

Rockefella

New York

1925

444-11-4321

0.54

6

J

Rockefella

New York

1925

444-11-4321

4

J

Rockefeller

Newport

0.54

1

John

Rockefeller

New York

1839

111-22-3456

4

J

Rockefeller

Newport

0.5343

2

John

Rockefeller

New York

1839

4

J

Rockefeller

Newport

0.5343

4

J

Rockefeller

Newport

1

John

Rockefeller

New York

1839

111-22-3456

0.5343

4

J

Rockefeller

Newport

2

John

Rockefeller

New York

1839

0.5343

4

J

Rockefeller

Newport

5

Jon

Rockefella

New York

1925

444-11-4321

0.5198

5

Jon

Rockefella

New York

1925

444-11-4321

4

J

Rockefeller

Newport

0.5198

4

J

Rockefeller

Newport

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

0.5135

7

Jim

Rockefeller Jr

New Jersey

1955

555-01-6778

4

J

Rockefeller

Newport

0.5135

Iteration #2

A few tweaks to the weighting with two new additions.  First, I’m including the middle name field and also using the ‘EQUAL’ rule for SSN. Setting the threshold to 70% gives a reasonable set of matching records.

select * from IdentityMatch

(

    on ( select *, date_part('year',dob) as dob_year from usb.customers ) as A  partition by any

    on ( select *, date_part('year',dob) as dob_year from usb.customers ) as B  dimension

    idColumn('a.id: b.id')

  -- fuzzy match weights are normalized; here I'm using a 1-10 scale (10=most weight)

    fuzzyMatchColumns('a.first_name: b.first_name,JARO-WINKLER,10'

                     ,'a.last_name: b.last_name,JARO-WINKLER,10'

                     ,'a.middle_name: b.middle_name,LD,5'

                     ,'a.city: b.city,JARO-WINKLER,7'

                     ,'a.dob_year: b.dob_year,EQUAL,10'

                     ,'a.ssn: b.ssn,EQUAL,7'

                     )

    accumulate('a.first_name','b.first_name'

              ,'a.middle_name','b.middle_name'

              ,'a.last_name','b.last_name'

              ,'a.city','b.city'

              ,'a.dob_year','b.dob_year'

              ,'a.ssn','b.ssn')

    threshold('.7')

)

where "a.id" <> "b.id"

order by "a.id", "b.id";

a.id

a.first

name

a.middle

name

a.last

name

a.city

a.DOB

year

a.ssnb.id

b.first

name

b.middle

name

b.last

name

b.city

b.dob

year

b.ssnscore

5

Jon

B

Rockefella

New York

1925

444-11-4321

6

J

B

Rockefella

New York

1925

444-11-4321

0.9728

6

J

B

Rockefella

New York

1925

444-11-4321

5

Jon

B

Rockefella

New York

1925

444-11-4321

0.9728

1

John

D

Rockefeller

New York

1839

111-22-3456

3

J

D

Rockefeller

New York

1839

111-22-3456

0.9694

3

J

D

Rockefeller

New York

1839

111-22-3456

1

John

D

Rockefeller

New York

1839

111-22-3456

0.9694

1

John

D

Rockefeller

New York

1839

111-22-3456

2

John

Rockefeller

New York

1839

0.7551

2

John

Rockefeller

New York

1839

1

John

D

Rockefeller

New York

1839

111-22-3456

0.7551

2

John

Rockefeller

New York

1839

3

J

D

Rockefeller

New York

1839

111-22-3456

0.7245

3

J

D

Rockefeller

New York

1839

111-22-3456

2

John

Rockefeller

New York

1839

0.7245