How Smart Can a Search Slot Be? (Part 1: Covering spellings)
How Smart Can a Search Slot Be? (Part 1: Covering spellings)
Search is ubiquitous and users are spoiled by the Google search experience. They therefore expect other searches to be just as good. But this is easier said than done. Right from the first contact – the behavior of the search slot – a user senses whether the search engine understands them or not. In our short series of articles, we present how an intelligent search slot contributes to user satisfaction – and what needs to happen in the background at a technical level to achieve this. In this first part, we describe the challenges at the level of different spellings. Part 2 deals with ambiguity, synonyms and grammatical forms. In the final article, we show how a “thinking” search slot can be designed.
Users are often unsure whether they are typing everything correctly into a program – e.g. a search term in Google – when entering it manually. spelling or individual spelling. What special features of the German language does software such as CONTEXTSUITE have to take into account in order to recognize and even correct an error? What“artificial intelligence” is needed to find out what is actually meant, even with ambiguous terms and complex types of error?
Do you know this situation? You want to type in a search term, but you’re not sure of the correct spelling. Is it email, e-Mail or E-Mail?
It would be practical if you could save yourself the trouble of looking it up in the dictionary and have an intelligent, benevolent check program recognize the imperfect search term as such and convert it into the correct spelling or a more suitable term with the same meaning.

How does a conventional program that receives and checks a user’s input work? Normally, the program compares the input for complete congruence (character string or string matching) with the entries in its own standard dictionary and then searches the documents, lists and websites to be searched in the event of a match. However, what is still considered “reasonably identical” for humans is not a match for the computer!
- Written together or separatedby a space or hyphen :
correct: E-Mail(see Duden entry)
incorrect: Email, EMail, email, eMail, E mail, E mail, e mail, e mail, e-mail, e-mail, e-Mail - Upper and lower case:
correct: Start-up
incorrect: Start-Up, start-up, start-Up, Startup, startup, Start up …
der gefangene floh (the prisoner fled?, the captured flea?) - Umlauts:
MenuMenu
Düsseldorf or Duesseldorf (e.g. because an English keyboard has no äöü) - Words that almost everyone unknowingly misspells:
AmalgamAmalgamEiffel TowerEiffel TowerFederationFederation, heart attackheart attack, Quartz watchquartz watch, Backbonebackbone, kinshipkinship…
Proper names:
LibyaLibyaTchiboChibo, TelekomTelecom - old and new spelling:
dass or dass, aufwändig or aufwendig, Bestellliste or Bestelliste, Börsentipp or Börsentip… - ph or f:
Foto or Photo, Christof or Christoph, PhysikFysik - Spellings when rewriting numbers and special characters:
Hartz IV, Hartz 4, Hartz Vier, Hartz vier, Hartz-IV, Hartz-4 …
First aid, erste Hilfe, 1. Hilfe, 1ste Hilfe…
Einmaleins, EinMalEins, Einmal-Eins, 1×1, 1X1, 1 x 1, Ein x eins …
So there are many opportunities in the German language to spell a word incorrectly or not quite correctly. How can the strict program that checks the input be outwitted here?
Usually by creating spelling synonyms:
Email –> Email, email, email, email, e-mail, e-mail, e-mail, e-mail, e-mail, e-mail, e-mail
The check program now first looks in its synonym list. If it finds the user input in the alternatives, it can retrieve the corresponding correct Duden entry and then use it to search – much more successfully – in the documents.
However, creating and maintaining lists of spelling synonyms is extremely time-consuming. According to Duden, “the vocabulary of contemporary German is estimated to be between 300,000 and 350,000 words.” More words are constantly being added, be they new word creations, foreign words, proper names or puns.
One way of having the synonym lists compiled by the computer rather than by hand is the concept of word embedding; more on this in part 2) of this blog post.
If a user input is neither in the dictionary nor in the spelling synonyms, there are other, but time-consuming methods for analysis and correction:
- Phonetic comparison:
Words or character strings with very similar pronunciation are recognized as matching; very good for proper names
Maier, Mayer, Mayr
Margerite, Margarethe - Approximate comparison:
Words or character strings are still the same even if they only differ in n places; bad for short words
Deoxyribonucleic acid, deoxirybonucleic acidBerg, Burg - Character spins and typing errors via the keyboard:
Swapped keystrokes
Including letters that are close together on the keyboard
Earthworm, Ergenwjrm
There are other categories of user input where a dictionary does not necessarily help:
- Common abbreviations:
Mwst, MwSt, Mw.-St., Mehrwertsteuer
DHH, semi-detached house
BAB, federal highway - Spellings for currencies, units of measurement and dates:
one hundred euros, 100 EUR, 100.00 € …
half cubic meter, 1/2 cubic meter, 0.5 cbm …
24/12/2019, 24-12-19, 2019/12/24, Dec. 24, 2019, Christmas Eve ’19 …
In the first group of cases with abbreviations, the shorter an input word is, the more ambiguous it becomes. Does BAB really mean the federal highway or does it mean company accounting sheet, vocational training grant or Bachelor of Arts in Business? And the question arises as to whether an abbreviated word in a hit document can be regarded as irrelevant.
In the second case group with the numerous input variants, lists with spelling synonyms would be bursting at the seams. Just imagine the possibilities of entering a single day’s date 365 times and for each calendar year …
For this reason, normalization is used for inputs that represent a currency, a date or a unit of measurement: the processing program has its own standard format adapted to the application, e.g.‘dd.mm.yyyy‘ for a date. The input is now converted into this format and then the program continues working with it. Of course, this requires that it is recognized whether an input should represent a date.
The currency, date and measurement unit recognition, including a plausibility check ( 92 Jan. 2019, +92° latitude ) solve regex patterns and/or rule-based algorithms.
However, this type of user input is usually found in special applications such as banking, hotel reservations or in the scientific field. There, the context of the input field also defines that a currency, a date or a unit of measurement is expected here.
Difficult categories of user input are synonyms of meaning(letter carrier – letter carrier, lift – elevator – elevator), ambiguous terms (economy: economy – inn, golf: car – sport – bay), compound nouns(butcher’s store) and deviations from the basic grammatical form(red giant – red giants). This will be the subject of part 2 of this series of articles.
Computer scientist and computer linguist
Ulrike is a computer scientist and computer linguist. The thematic focus of her blog posts is on the comprehensibility and usability of the human-machine interface.
More articles from Content in Context
|
|