Fuzzy Logic in SQL and SSIS

SQL Server provides the capability of detecting similarity between strings for the purpose of data entry discrepancy or other tasks in mind.

An on-demand capability which SSIS came up with: applying intelligent detection of string data entry discrepancy. There are two distinguished dataflow tasks for this:
Fuzzy Grouping dataflow task allows detection of similar strings within the same dataset and grouping them under same id and same one string.
Fuzzy Lookup dataflow task allows detection of similar strings between two different datasets, and matching them together as joins. Then one can optionally choose which row and field to keep.
Furthermore, one can set the similarity threshold they find appropriate for their dataset when using these dataflow tasks.
Here is a nice post describing Fuzzy Grouping and Fuzzy Lookup and how to build your own discrepancy detection task.
Here is an illustrated example on how to use Fuzzy Grouping to remove duplicates, along with other more traditional solutions.
And here is another illustrated example on using Fuzzy Grouping when you want to literally group by and aggregate numeric values for “similar” rows.

If you feel that these tasks are a little inflexible, then you can use the SQL statements: SOUNDEX and DIFFERENCE. SQL allow some flexibility SSIS fuzzy tasks don’t allow, but SOUNDEX and DIFFERENCE have their own limitations too, such as no similarity thresholds.
Here is also a nice explanation of how they work.

Leave a Reply

Your email address will not be published. Required fields are marked *