How Smart Can a Search Slot Be? (Part 2: Resolving Ambiguities)
How Smart Can a Search Slot Be? (Part 2: Resolving Ambiguities)
A common problem when evaluating search fields is that users often do not use precise terms in their searches, but only use one or more words to express roughly what they actually mean. After part 1 of this article series dealt with the handling of different spellings, part 2 today deals with how software such as CONTEXTSUITE also covers more complex scenarios: How can even ambiguous terms be reliably resolved? When do synonyms improve a search result? Are deviations from the basic grammatical form permitted?
There is a whole range of linguistic peculiarities that can affect user input.
How Can the Correct Meaning of a Word Be Identified in Context?
-
- Ambiguity of words:
A word has several meanings:
Bow: weapon – gesture
Bark: sound – tree covering
Seal: mammal – closure – emblem
Light: illumination – weight – gentle
- Ambiguity of words:
If the input consists of only one or a few words, it is very difficult to resolve the ambiguity. Without the help of an indicative context, the correct reading and therefore the appropriate domain can only be estimated. The following considerations can help:
What is the thematic context of the application, what is the more common usage of the ambiguous term, is it a current trend term, are there any clues from previous user input? It is also possible for the program to ask the user:
“Did you mean: golf → car, golf → sport, golf → bay? ”
-
- Synonyms of meaning:
There are several words for one term (regional differences, linguistic history)
Rolls – bread rolls,
Letter carrier – mailman,
Match – matchstick,
often – frequently,
vertical – upright?
Old-age provision – old-age pension – old-age insurance ?
Bicycle – Bike???
24 – 2 x 12 – “8×4”
- Synonyms of meaning:
Here, the user input could be supplemented with all meaning synonyms and thus achieve a higher hit yield.
But are these really completely equivalent terms that can be used reciprocally ?
Examples:
Orange could also stand for the color orange and thus provide non-relevant hits. Confusion – chaos – hullabaloo look very suitable at first glance, but is there a traffic chaos as well as a traffic mess or a traffic hullabaloo? There is a “generic term” and “subset” relationship between wheel and bicycle: wherever it says bicycle, you could also use wheel, but not vice versa.
Is the Thesaurus a Solution?
A thesaurus (e.g. the German OpenThesaurus) contains synonyms for a term, but also generic and subordinate terms as well as associations with related terms. It is therefore more of a word network than a word index. Although there is a wide range of interesting “pseudo” synonyms, their quality needs to be checked by hand rather than using them automatically and unseen.
Does Computational Linguistics Have a Solution?
The concept of word embedding is particularly useful here. Roughly speaking, a word is described over a wide range of its left and right word neighbors. The computer determines or learns (machine learning) this using very large quantities of training documents and special algorithms. One of the most common approaches to this is the Word2Vec model, in which a word is mapped to numerical values or a multidimensional vector in text representation. If two words have an almost identical word vector – i.e. a very similar context – they are candidates for synonyms.
-
- basic grammatical form or with inflection:
Bicycles → Bicycle
Of a green tree → A green tree
Succeeded → To succeed
- basic grammatical form or with inflection:
A good checking program for user input should have a high-quality dictionary in which all valid conjugations, declensions and comparisons are entered for each word type. This means that entries that are not in the basic grammatical form (usually: nominative, singular, present tense) can also be recognized.
Consequently, it would also make sense to find the words in the documents to be searched using their basic form. This means more results, but also more effort.
In practice, however, tracing back to the basic form is only trivial for single words. In the case of multi-part expressions (phrases), grammatical peculiarities of the German language must be taken into account:
Congruence across several words, e.g. with indefinite and definite articles:
of a green tree → ein grüner Baum / der grüne Baum
separable first parts of a verb at the end of a sentence, etc.:
das Gremium vorgeschlagen heute → das Gremium vorschlagen heute
-
- Compound words (composites):
Filter coffee, coffee filter
Input string, string input
Schweinebraten, Schweinsbraten (pig + s/e + roast)
Fahrgast (drive + guest)
Fleischereierzeugnis (butchery + eggs + certificate ?)
- Compound words (composites):
In hardly any other language is it as easy and popular to combine individual words into a new, meaningful word as it is in German. The possible variations are endless, so that a dictionary could not include everything. So how do you help yourself with an unknown compound word? You break it down into its original individual words and continue working with them.
But this involves some effort: Taking into account fugue elements (departure time display), the basic form of the constituents (crown prince servant: Crown + prince + servant; roast goose knife: goose + roast + knife, roast goose + knife, geese + roast knife), word order (filter coffee powder ≠ ground coffee filter) and meaningfulness (banana saddle, deer antler handlebar, flower idea world).
-
- Pseudonyms and official names:
and aliases, Aliases, aliases and stage names
Loriot – Viktor von Bülow – Vicco von Bülow
Marilyn Monroe – Norma Jean Baker
Angela Merkel – Federal Chancellor – Federal Chancellor of the FRG – CDU Chairwoman – CDU Federal Chairwoman
Pope Francis – Pope – Bishop of Rome – Head of the Roman Catholic Church – Holy Father – Ponitfex Maximus – Jorge Mario Cardinal Bergoglio SJ
- Pseudonyms and official names:
Synonym directories for official names are linked to time periods and must be kept up to date. However, there is a certain amount of work involved in dealing with older documents in which the old validity or past periods of office still occur.
There are certainly other difficulties and traps for user input. But let’s leave it at the most common scenarios explained.
For the sake of completeness, it should be mentioned that an input check program usually performs the following tasks as a first step:
-
- Clean up the input string:
Remove multiple spaces and all irrelevant visible and invisible (control) characters that the keyboard of an imaginative or distracted user produces.
- Clean up the input string:
-
- Recognize the language in which the entry was made (if this has not already been defined via a previous language selection)
Winter in England (German?, English?)
- Recognize the language in which the entry was made (if this has not already been defined via a previous language selection)
What Options Does the User Have to Optimize Their Input?
-
- Specification:
from the generic term to the detailed sub-term
clasp, chain clasp, gold chain clasp, gold necklace clasp
- Specification:
-
- Specialization:
Application, material, production, size, quality
Net: Hair net, fishing net, traffic net, social net, grid net, spider net, mosquito net, shopping net, computer net …
- Specialization:
-
- Context:
Adjectives, adverbs
Fön: hair dryer, Föhn: weather conditions
inexpensive hair dryer → correction to: inexpensive hair dryer
Föhn (weather conidition) in the Alps → correction to: weather condition in the Alps
- Context:
This makes it clear that not only the main algorithm is important in software that takes the human-machine interface into account, such as CONTEXTSUITE, but also its suppliers: the program for receiving, checking, “decoding” and polishing the manual user input.
Computer scientist and computer linguist
Ulrike is a computer scientist and computer linguist. The thematic focus of her blog posts is on the comprehensibility and usability of the human-machine interface.
More articles from Content in Context
|
|