A Data Scientist is often tasked to find insights on huge swath of data. The data sources could be web logs, sensor time series, gene data, email text, scanned PDF docs, etc.,. One could be fishing for specific patterns or even finding out what patterns emerge, to model the underlying structure of the data. It would be great if the data we are analyzing is more structured or 'exact' to start with, so we can apply the analytic workflow over that. Unfortunately that's often the bottleneck !! I like numbers even if they are decimals as I can add and subtract or compare easily. If the data is text, this is where it gets funner as it gets pretty messy quickly .. (not to mention speech which adds a whole new dimension). Here are some fun problems we want to solve around text data. In this blog post, I will discuss some ideas around Document and String or Sequence Similarity (italicized below):
• Find documents that are similar to each other in content
• Classify documents to one or more categories
• Automatically "Topicize" documents into different buckets
• Find the 'best' match for a dictionary word or phrase in a given bucket of words (if it's human generated expect misspellings, abbreviations, prefixes and suffixes)
• Find sequence patterns on genome data with 'approximately close' matches
How does one find 'acceptable' structure or even analyze text data to solve problems like above ? Enter the world of 'fuzzy match'. In a not so exact world, there is enormous business value in getting close to even 80% accuracy of pattern finding. Most business analysts can fill the blanks and get insights if the data scientist or analyst can surface with only 60% ~ 80% accuracy as most times it's directional and usable. That's why fuzzy matches are so popular in text mining.
Example: If we are comparing two strings "Acme Finance" and "Acme Financial Inc" for similarity, there is a lot value in saying these strings match with a confidence of 0.85 than just saying they appear 'similar' or 'not equal'. A business analyst can easily decide a level of cutoff for a score that produces good results for aggregation or clustering purposes.
There are several terms that Data Scientists use to describe Similarity concepts. Aster Analytical Foundation guide has a SQL/MR function called "IdentityMatch" that does a bunch of algorithms. "Fuzzy Matching" is another way of describing the same problem. What I've seen is that the term "Similarity" is used around Documents and "Fuzzy Matching" when it comes matching words & phrases.
We want to measure the similarity between two text strings or documents TEXT1 and TEXT2. See picture below on the design pattern (Blog posts only talk about the techniques available in Aster Foundation or Field Supported). Algorithms such as Pierson Coefficient etc ., not covered here.
So the basic steps are:
The Term extraction requires the strings first need to be broken up into chunks. You'd then need to decide the weighting methodology and then feed the chunks of both the strings to the algorithm. The algorithm spits out a 'measure'. There are a couple of ways to break the strings up. You can either break a phrase, sentence etc., into sequential words or tokens also called as NGRAM. Or you can break them apart by a few sequential letters at a time - aka QGRAM. You can also break them apart by NGRAM that are not sequential but in proximity with each other - known as SKIP-GRAM.
Here's a phrase that we want to fuzzy match with a table full of potential matches - "A quick brown fox"
NGRAMs for "A quick brown fox" for Lengths 1,2 and 3:
|A||A quick||A quick brown|
|quick||quick brown||quick brown fox|
QGRAMs of Lengths 1,2 and 3 is as follows:
NOTE: QGRAM functions are currently field supported and not part of the Aster Foundation yet. You can write QGRAM as a row function with PERL or JAVA easily.
SKIP-GRAMs of Lengths 2 with Window=3 and 4 is as follows:
NOTE: SKIP-GRAM functions are currently field supported and not part of the Aster Foundation yet. You can write a SKIP-GRAM as a row function in PERL or JAVA easily.
|Length=2 and Window=3||Length=2 and Window=4|
|A quick||A quick|
|A brown||A brown|
|quick brown||A fox|
|quick fox||quick brown|
|brown fox||quick fox|
As you can see above, there are a million combinations that you can derive your workflow from. One needs to come up with some cross validation technique to find the best parameters starting with some initial values. NGRAM and QGRAM lengths up to 3 seem to be a good place to start.
Once we decide on the Term Extraction, the next thing to do is to determine what kind of Term Weighting we need. Not all NGRAMS, QGRAMS and SKIP-GRAMS are equal. We sometime want to provide higher weight to the grams that are occurring less frequently (prominence/differentiator). One can do this by running the TF/IDF function in Aster. This is great for comparing large documents as opposed to just strings.
The Exponential Decay weighting seems to work best when you do QGRAMS for strings. The first QGRAM gets some score say 0.9, the second QGRAM gets 0.9^2 followed by 0.9^3, 0.9^4 etc., This works great for names that has noisy suffixes like LLC, Inc, Services etc., which gets weighted lower.
Some algorithms don't care about the term weighting. Distance Algorithms like Levenshtein Distance only look at the GRAMS as a whole and each letter in the GRAM.
There are several algorithms available in Aster that does Fuzzy Matching - both in the Aster Foundation as well as the ones that are Field Supported. If used correctly, it can result in greater than 95% accuracy as we've seen in many use cases. Here are a list of the algorithm combinations that targets Similarity problems:
Given two strings String1 and String2, Edit based algorithms look at how many inserts, deletes and updates are required to go from String1 to String2. They are more suited for words or short phrases. Token based algorithms are often used to fuzzy match sentences, paragraphs or documents. They break up the content to words or parts of strings, apply some weighting (based on how frequent they appear within a document or how close they are to the beginning etc.,) and then use a formula like Cosine Similarity to score the match. Phonetic algorithms use more NLP aware techniques to find similar "sounding" words by breaking the words to phonemes and then comparing those.
The obvious answer is 'one size does not fit all'. Algorithm generates measures for different scenarios. In general algorithms are not decided by the user, but largely by the data !!! Each data set automatically is aligned best with some algorithm - or conversely each algorithm is built around an assumption of how the data is represented. So when the data assumptions fail for one algorithm, it's time to try to another one! Pick the winner by testing different scenarios and stick to it.
An algorithm is said to fail badly if it generates a lot of false positives & negatives (very high misses) There are also design patterns that are 'generally' popular because of the high success rate with less surprises out of the gate. One of them is a classic combination of NGRAM, TF/IDF with Cosine Similarity. See John Thuma's you-tube video on using this design pattern. The CosineSimilarity measure is calculated by Aster's VECTORDISTANCE() SQL/MR function:
All the other algorithms work really well with certain types of data - person names, addresses, short strings, long strings, common typos, merchant names in transactions etc., A more interesting analysis provided here on the subtle differences between different algorithms.
This blog post was intended more for providing a high level overview. As a follow on blog post, I will post the code sometime that creates a few QGRAM type fuzzy match workflows to help disambiguate text in your environment. Tks for your patience and hope you enjoyed the blog post!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.