Dear Teradata users,
I am fairly new to TD and struggling with a question. Hope you can help!
In my dataset there are two columns with company name and address. Unfortunately, a single company name is sometimes spelled slightly different on different occasions. Before further processing I would like to clean this up. My dataset contains about 10 million records and I would like something reasonably fast.
I came up with the following procedure (code below): first I select the compressed name and address of the first record as macro variables. Then I calculate a similarity measure against all other records (this is suprisingly fast with the buildin TD function ‘editdistance’). For the records with an acceptable similarity I keep the name. Then I go on to the next record in line. I repeat this step in a do while loop until every record is processed. This works but requires many many loopups (many company names occur only once or a few times) and takes a long time (>15 hours). I was hoping for something faster. Any help would be greatly appreciated.
Thanks in advance and best regards,
proc sql ; create table work.temp(dbcreate_table_opts = "no primary index") as select trim(Name) as Name, compress(Name,,'kad') as cName, compress(Address,,'kad') as cAddress, ("") format=$70. length=70 as Name_ from work.source where length(compress(Name,,'kad')) >= 4 and length(compress(Address,,'kad')) >= 6; quit; proc sql noprint; select Name, cName, cAddress into :name, :cname, :caddress from work.temp where Name_ is missing; quit; %macro clean_names; %do %while(%length(&cname)>3); proc sql; connect to teradata (server=server authdomain=TeradataAuth tpt=yes mode=teradata connection=global); execute( update work.temp set Name_ = %bquote('&name') where Name_ is null and editdistance(trim(cName), %bquote('&cname'), 1,1,2,2) <= 4 and editdistance(trim(cAddress), %bquote('&caddress'), 1,1,2,2) <= 4 ; ) by teradata; disconnect from teradata; quit; %let cname=fin; proc sql noprint; select Name, cName, cAddress into :name, :cname, :caddress from work.temp where Name_ is missing; quit; %end; %mend clean_names; %clean_names;
If you want to keep it fast, start by calculating the SOUNDEX of the names, which will be very fast, then compute the EDITDISTANCE between the companies which share the SOUNDEX. It will be less job to do.
Maybe with Aster function STRINGSIMILARITY we could do otherwise but I can't test it for now.