Posted in: Uncategorized

Creating a google-like search engine for your site

Google’s search engine (and others) are experts at correcting typos and misspellings. You would have seen many times Google suggesting, “did you mean orange” when you typed “oragne”.

Google’s algorithms for correcting typos and misspellings are not publicly known but we do know that Google has billions of searches performed every day and it could easily learn corrections in any language.

If you typed “oragne” and no relevant results came up because of your misspelling, so instead of clicking on anything you retyped your query correctly as “orange” and then clicked on a link, Google would know that you corrected a misspelling. Multiply that one example tens of billions of times across the entire world, in all languages, and Google could pretty quickly learn the world’s most common typos and misspellings.

But how do you implement such awesome correcting power on your own website?

In PHP there are two functions you can use.

metaphone – http://php.net/manual/en/function.metaphone.php

levenstein – http://php.net/manual/en/function.levenshtein.php

The metaphone function breaks down works into phonetic blocks. It creates the same “key” for similar sounding words. If the resulting “keys” are the same, then the word is considered the same. So “dore” and “door” are considered the same words by metaphone.

The levenstein function works by calculating the “distance” between two words. “Distance” is defined as the minimal number of characters you have to replace, insert or delete to transform the first word into the second word. So “dppr” would take 2 steps (two letters need to be changed) to transform into “door” so it’s a very close match according to this function.

Each function has their own uses. Usually, misspellings are captured using metaphone as it focuses on the way a word sounds. Usually a misspelling actually sounds the same when spoken as misspellings occur because people are trying to sound out the word. For example, someone might type “jemstone” which sounds exactly like “gemstone” and is probably what they meant if there are no results in your database for “jemstone”.

But levenshtein focuses on typos. Clearly “dppr” doesn’t sounds like “door” so you really need levenshtein to pick up typos.

but I like to use both functions together:

  1. First, search the database for an exact match. If nothing is found…
  2. Use levenshtein to check for typos. If nothing “close” (explained later) is found…
  3. Use metaphone to check for mispellings.

When using levenshtein, “close” is subjective. levenshtein(‘recurs’, ‘resource’) = 4, meaning ‘resource’ is 4 “steps” away from ‘recurs’. As a percentage of the length of the word, 4 steps on an 8 letter word means 50% of the word needs to change. Is a 50% change not “close enough” for you? That’s up to you. But a percentage should be determined, rather than a raw figure. 4 steps to get from gripe to green might only be 4 steps, but on a 5 letter word, that’s an 80% change.

Because using these functions means you need to loop through every word in your entire database, create the metaphone and levenshtein result for every word and compare that to the metaphone and levenshtein result of the search query, you may want to store in your database the levenshtein and metaphone results in separate fields.

But even then, depending on the size of your database, looping through every word may slow things down too much.

if ($query == $word) {
     return TRUE;
} else {
     $leven_steps = levenshtein($query, $word);
     $query_length = strlen($query);
     $word_length = strlen($word);
     $max_length = max($query_length, $word_length);
     $percent_diff = $leven_steps / $max_length;

     // set your own threshold. Eg; 0.4 (40%)
     $threshold = 0.4;

     if ($percent_diff <= $threshold) {
          return $word; // found a close matching levenshtein word
     } else {
          // try metaphone
          $query_meta = metaphone($query);
          $word_meta = metaphone($word);

          if ($query_meta == $word_meta) {
               return TRUE; // found metaphone match
          } else {
               // no exact, levenshtein or metaphone match found
          };
     };
};

Have your say

Recurring Date Settings

day week month year on
date of month day of week