NCBI C++ ToolKit
Public Types | Static Public Member Functions | List of all members
CDictionaryUtil Class Reference

Search Toolkit Book for CDictionaryUtil

Standard dictionary utility functions. More...

#include <util/dictionary_util.hpp>

Public Types

enum  { eMaxSoundex = 4 , eMaxMetaphone = 4 }
 
enum  EDistanceMethod { eEditDistance_Exact , eEditDistance_Similar }
 Return the Levenshtein edit distance between two words. More...
 

Static Public Member Functions

static void GetSoundex (const string &in, string *out, size_t max_chars=eMaxSoundex, char pad_char='0')
 Compute the Soundex key for a given word The Soundex key is defined as: More...
 
static void GetMetaphone (const string &in, string *out, size_t max_chars=eMaxMetaphone)
 Compute the Metaphone key for a given word Metaphone is a more advanced algorithm than Soundex; instead of matching simple letters, Metaphone matches diphthongs. More...
 
static void Stem (const string &in_str, string *out_str)
 Compute the Porter stem for a given word. More...
 
static size_t GetEditDistance (const string &str1, const string &str2, EDistanceMethod method=eEditDistance_Exact)
 
static int Score (const string &word1, const string &word2, size_t max_metaphone=eMaxMetaphone)
 Compute a nearness score for two different words or phrases. More...
 
static int Score (const string &word1, const string &meta1, const string &word2, const string &meta2, EDistanceMethod method=eEditDistance_Similar)
 Compute a nearness score based on metaphone as well as raw distance. More...
 

Detailed Description

Standard dictionary utility functions.

Definition at line 44 of file dictionary_util.hpp.

Member Enumeration Documentation

◆ anonymous enum

anonymous enum
Enumerator
eMaxSoundex 
eMaxMetaphone 

Definition at line 47 of file dictionary_util.hpp.

◆ EDistanceMethod

Return the Levenshtein edit distance between two words.

Two possible methods of computation are supported - an exact method with quadratic complexity and a method suitable for similar words with a near-linear complexity. The similar algorithm is suitable for almost all words we would encounter; it will render inaccuracies if the number of consecutive differences is greater than three.

Enumerator
eEditDistance_Exact 

This method performs an exhausive search, and has an algorithmic complexity of O(n x m), where n = length of str1 and m = length of str2.

eEditDistance_Similar 

This method performs a simpler search, looking for the distance between similar words.

Words with more than two consecutively different characters will be scored incorrectly.

Definition at line 110 of file dictionary_util.hpp.

Member Function Documentation

◆ GetEditDistance()

size_t CDictionaryUtil::GetEditDistance ( const string str1,
const string str2,
EDistanceMethod  method = eEditDistance_Exact 
)
static

◆ GetMetaphone()

void CDictionaryUtil::GetMetaphone ( const string in,
string out,
size_t  max_chars = eMaxMetaphone 
)
static

Compute the Metaphone key for a given word Metaphone is a more advanced algorithm than Soundex; instead of matching simple letters, Metaphone matches diphthongs.

The rules are complex, and try to match how languages are pronounced. The implementation here borrows some options from Double Metaphone; the modifications from the traditional Metaphone algorithm include:

  • all leading vowels are rendered as 'a' (from Double Metaphone)
  • rules regarding substitution of dge/dgi/dgy -> j were loosened a bit to permit such substitutions at the end of the word
  • 'y' is treated as a vowel if surrounded by consonants

Definition at line 47 of file dictionary_util.cpp.

References _ASSERT, CTempString::find(), in(), ITERATE, out(), and tolower().

Referenced by CSimpleDictionary::AddWord(), CStringMatching::CStringMatching(), CStringMatching::MatchString(), CSimpleDictionary::Read(), Score(), and CSimpleDictionary::SuggestAlternates().

◆ GetSoundex()

void CDictionaryUtil::GetSoundex ( const string in,
string out,
size_t  max_chars = eMaxSoundex,
char  pad_char = '0' 
)
static

Compute the Soundex key for a given word The Soundex key is defined as:

  • the first leter of the word, capitalized
  • convert the remaining letters based on simple substitutions and groupings of similar letters
  • remove duplicates
  • pad with '0' to the maximum number of characters

The final step is non-standard; the usual pad is ' '

Definition at line 332 of file dictionary_util.cpp.

References in(), int, ITERATE, out(), string, and toupper().

◆ Score() [1/2]

int CDictionaryUtil::Score ( const string word1,
const string meta1,
const string word2,
const string meta2,
EDistanceMethod  method = eEditDistance_Similar 
)
static

Compute a nearness score based on metaphone as well as raw distance.

Definition at line 549 of file dictionary_util.cpp.

References GetEditDistance().

◆ Score() [2/2]

int CDictionaryUtil::Score ( const string word1,
const string word2,
size_t  max_metaphone = eMaxMetaphone 
)
static

Compute a nearness score for two different words or phrases.

Definition at line 537 of file dictionary_util.cpp.

References GetMetaphone().

Referenced by CSimpleDictionary::SuggestAlternates().

◆ Stem()

void CDictionaryUtil::Stem ( const string in_str,
string out_str 
)
static

Compute the Porter stem for a given word.

Porter's stemming algorithm is one of many automated stemming algorithms; unlike most, Porter's stemming algorithm is a widely accepted standard algorithm for generating word stems.

A description of the algorithm is available at

http://www.tartarus.org/~martin/PorterStemmer/def.txt

The essence of the algorithm is to repeatedly strip likely word suffixes such as -ed, -es, -s, -ess, -ness, -ability, -ly, and so forth, leaving a residue of a word that can be compared with other stem sources. The goal is to permit comparison of socuh words as:

compare comparable comparability comparably

since they all contain approximately the same meaning.

This algorithm assumes that word case has already been adjusted to lower case.

Definition at line 804 of file dictionary_util.cpp.

References eConsonant, eVowel, NULL, s_EndsWith(), s_FindFirstVowel(), s_GetCharType(), s_MeasureWord(), s_ReplaceEnding(), s_TruncateEnding(), and str().

Referenced by CTextUtil::GetStemFrequencies(), and CTextUtil::GetWordFrequencies().


The documentation for this class was generated from the following files:
Modified on Wed Jun 12 11:17:51 2024 by modify_doxy.py rev. 669887