Open source software security

Using Soundex with MySQL

30 November -0001
by: Justin
on 21 May, 2002

Soundex is a really cool, and very often overlooked feature of a ton of different programming languages. I run across it most often in database work, and its a very under appreciated tool. I'll restrict myself in this discussion to soundex's uses in MySQL. Implementation varies across platforms, so the information in this guide may only be useful on identical systems. Soundex documentation in MySQL is sparse to say the least. Quoting from MySQL.com:

SOUNDEX(str)

Returns a soundex string from str. Two strings that sound almost the same should have identical soundex strings. A standard soundex string is 4 characters long, but the SOUNDEX() function returns an arbitrarily long string. You can use SUBSTRING() on the result to get a standard soundex string. All non-alphanumeric characters are ignored in the given string. All international alpha characters outside the A-Z range are treated as vowels:

mysql> SELECT SOUNDEX('Hello');
     -> 'H400'
mysql> SELECT SOUNDEX('Quadratically');
     -> 'Q36324'

Well, that doesn't give us much to work from. What soundex does is look at the phonetics of a particular string and convert it to a standardized string (using this process letters and words that sound similar will return the same value, for instance, the letter 'd' will have the same value as the letter 't'). This doesn't sound very helpful at all, but it actually is. Lets take the soundex of a few words in MySQL (I'm copying and pasting this out of a command prompt where I've already logged into MySQL ('mysql -u username -p)):

mysql> select 'text', soundex('text');
+------+-----------------+
| text | soundex('text') |
+------+-----------------+
| text | T230            |
+------+-----------------+
1 row in set (0.00 sec)

What you see here is soundex converting the word 'text' into the soundex value 'T230'. This is only really helpful if you compare it to some other strings:

mysql> select 'text', 'test', soundex('text'), soundex('test');
+------+------+-----------------+-----------------+
| text | test | soundex('text') | soundex('test') |
+------+------+-----------------+-----------------+
| text | test | T230            | T230            |
+------+------+-----------------+-----------------+
1 row in set (0.00 sec)

This query reveals that the soundex values of the word 'text' (T230) and the word 'test' (T230) are the exact same. Lets try another word for comparison:

mysql> select 'tracks', soundex('tracks');
+--------+-------------------+
| tracks | soundex('tracks') |
+--------+-------------------+
| tracks | T620              |
+--------+-------------------+
1 row in set (0.00 sec)

You can see the soundex of the new word 'tracks' is similar, but different from the 'T230' we have been getting up until now. The key to soundex is to compare the various soundex values for separate words. Many databases have fairly complex processes to do this and some give greater accuracy than others. MySQL doesn't have a whole lot to choose from, but there is one option to compare string values. The easiest way to do this in MySQL is with the strcmp function. Documentation for this function is as follows:

STRCMP(expr1,expr2)

STRCMP() returns 0 if the strings are the same, -1 if the first argument is smaller than the second according to the current sort order, and 1 otherwise:

mysql> SELECT STRCMP('text', 'text2');
     -> -1
mysql> SELECT STRCMP('text2', 'text');
     -> 1
mysql> SELECT STRCMP('text', 'text');
     -> 0

A practical example of comparing the soundex values of the words 'text' and 'test' is as follows:

mysql> select strcmp(soundex('text'), soundex('test'));
+------------------------------------------+
| strcmp(soundex('text'), soundex('test')) |
+------------------------------------------+
|                                        0 |
+------------------------------------------+
1 row in set (0.00 sec)

You see that the soundex values are the same so we get zero, you can modify a query to report values if soundex values are the same using a query like this one:

mysql> select * from search_table where strcmp(soundex(search_title), soundex('searchterm')) = 0;

This would return all the results from the column search_title in the table search_table where the column entry had the same soundex value as the word 'searchterm' (ideally you would use a variable instead of a specific word so you could reuse the code with various user input). I use this type of code in the search functionality on Madirish.net's search feature. Here's a snipit:

$query = mysql_query("select s.*, a.article_id, a.article_section, a.article_title from tech_article a, tech_section s where (a.article_title like '%$searchterm%' or a.article_body like '%$searchterm%' or strcmp(soundex(article_title), soundex('$searchterm')) = 0) and a.article_section = s.section_id group by s.section_name, a.article_id, a.article_title order by s.section_name, a.article_title");

Which works quite well since it will allow people to search for a term like 'Tutorial 9' and actually find the article titled 'Madirish Tutorial 09'. The point is to use the soundex to find out if there are other entries in a database that are similar, but not exactly the same as user input. While this will work in a pinch, you may want to explore alternative ways to compare strings so you can get more than a positive/negative result. Using functionality that will give you graded result depending on the degree of similarity is ideal, since this allows you to set a threshold for similarity and fine tune your results.