A few words on the simil algorithm


T-SQL allows us to perform a wide range of text searches. Still, a lot remains to be desired, especially with regard to misspellings. If you want to find a set of records even if they have misspellings, or want to prevent misspellings, you need to perform fuzzy string comparisons, and Simil is one algorithm suited for that task. One use for Simil is in data cleanup. In one example, a company had a table with organic chemistry compounds, and their names were sometimes spelled differently. The application presents the user with the current record and similar records. The user can decide which records are duplicates, and choose the best one. One button click later, all child records are pointed to the chosen record, and the bad records are deleted. Then the user moves to the next record. Another typical use for Simil is in preventing bad data from entering the database in the first place. Our company has a Sales application with a Companies table. When a salesperson is creating or importing a new company, the application uses Simil to scan for similar company names. If it finds any records, it'll show a dialog box asking the user if the new company is one of those, or indeed a new company. Other uses include educational software with open-ended questions. One tantalizing option the original authors mention is to combine Simil with a compiler, which could then auto-correct common mistakes.

Let's look at Simil in more detail, and learn how we can take advantage of it. The Simil algorithm looks for the longest common substring, and then looks at the right and left remainders for the longest common substrings, and so on recursively until no more are found. It then returns the similarity as a value between 0 and 1, by dividing the sum of the lengths of the substrings by the lengths of the strings themselves. Simil is case sensitive. If you want to ignore case, convert both strings to uppercase or lowercase before calling Simil. At its core, Simil is a longest common substring or LCS algorithm, and its performance can be expected to be on par with that class of algorithms. Anecdotally, we know that using Simil to test a candidate company name against 20,000 company names takes less than a second. Simil has good performance and is easy to understand. It also has several weaknesses, including the following:

The result value is abstract. Therefore it'll take some trial and error to find a good threshold value above which you'd consider two strings similar enough to take action. For data such as company names, I recommend a starting Simil value of about 0.75. For the organic chemistry names, we found that 0.9 gave us better results.

It's insensitive for very small strings. For example, Adams and Ramos have three out of five characters in common, so the Simil value is 0.6. Most people wouldn't call those names similar.
It treats every letter the same, without regard for vowels or consonants, or for letters that often occur together, or for the location in the string, or any other criteria. Some other algorithms do; for example, in the English language the letters Q and U nearly always occur together and in that order, so much so that they could almost be considered a single letter. In a more comprehensive algorithm, such occurrences could be given special consideration. SOUNDEX is another algorithm that does take into account that some consonants are almost the same.
Simil can't be precalculated, always requires a table scan, and can't take advantage of indexes. This may be a problem for large datasets.

Legal Disclaimer

Our website is not responsible for the information contained by this article. Articleinput.com is a free articles resource thus practically any visitor can submit an article. However if you notice any copyrighted material, please contact us and we will remove the article(s) in discussion right away.

Note: This article was sent to us by: Loyd A. Swanson at 01092010

Related Articles

1. What can be used as a key in Python
Python permits more than just strings to be used in this manner. Any Python object that is immutable and hashable can be used as a key to a dic...

2. Definitions and uses of functions in Python
This article assumes you're familiar with function definitions in at least one other computer language and with the concepts that correspond to...

3. Python functions handle variable numbers of arguments
Python functions can also be defined to handle variable numbers of arguments. You can do this two different ways. One way handles the relativel...

4. Lambda expressions and generator functions in Python
Short functions like those you just saw can also be defined using lambda expressions of the form. Lambda expressions are anonymous little funct...

5. How to create a basic program in Python
Up until now, you've been using the Python interpreter mainly in interactive mode. For production use, you'll want to create Python programs or...

6. How to make Python script execution in UNIX and Mac OS X and in Windows
If you're on UNIX, you can easily make a script directly executable. Note that if Python 3.x isn't your default version of Python, you may need...

7. The difference between scripts on Windows scripts on UNIX
The way you call scripts on Windows differs from the way scripts are called on Linux/ UNIX, and that difference can affect what kind of scripts...

8. Python applications are distributed as source files
You can distribute your scripts as source files (as .py files). You can also ship them as byte code (as .pyc or .pyo files). A byte code file w...