|

November 3, 2020

|

7 min. read

How Smart Can a Search Slot Be? (Part 2: Resolving Ambiguities)

|

November 3, 2020

|

7 min. read

How Smart Can a Search Slot Be? (Part 2: Resolving Ambiguities)

A common problem when evaluating search fields is that users often do not use precise terms in their searches, but only use one or more words to express roughly what they actually mean. After part 1 of this article series dealt with the handling of different spellings, part 2 today deals with how software such as CONTEXTSUITE also covers more complex scenarios: How can even ambiguous terms be reliably resolved? When do synonyms improve a search result? Are deviations from the basic grammatical form permitted?

There is a whole range of linguistic peculiarities that can affect user input.

How Can the Correct Meaning of a Word Be Identified in Context?

    • Ambiguity of words:
      A word has several meanings:
      Bow: weapon – gesture
      Bark: sound – tree covering
      Seal: mammal – closure – emblem
      Light: illumination – weight – gentle

If the input consists of only one or a few words, it is very difficult to resolve the ambiguity. Without the help of an indicative context, the correct reading and therefore the appropriate domain can only be estimated. The following considerations can help:
What is the thematic context of the application, what is the more common usage of the ambiguous term, is it a current trend term, are there any clues from previous user input? It is also possible for the program to ask the user:
“Did you mean: golf → car, golf → sport, golf → bay?

    • Synonyms of meaning:
      There are several words for one term (regional differences, linguistic history)
      Rolls – bread rolls,
      Letter carrier – mailman,
      Match – matchstick,
      often – frequently,
      vertical – upright?

      Old-age provision – old-age pension – old-age insurance ?
      Bicycle – Bike???
      24 – 2 x 12 – “8×4”

Here, the user input could be supplemented with all meaning synonyms and thus achieve a higher hit yield.
But are these really completely equivalent terms that can be used reciprocally ?
Examples:
Orange could also stand for the color orange and thus provide non-relevant hits. Confusion – chaos – hullabaloo look very suitable at first glance, but is there a traffic chaos as well as a traffic mess or a traffic hullabaloo? There is a “generic term” and “subset” relationship between wheel and bicycle: wherever it says bicycle, you could also use wheel, but not vice versa.

Is the Thesaurus a Solution?

A thesaurus (e.g. the German OpenThesaurus) contains synonyms for a term, but also generic and subordinate terms as well as associations with related terms. It is therefore more of a word network than a word index. Although there is a wide range of interesting “pseudo” synonyms, their quality needs to be checked by hand rather than using them automatically and unseen.

Does Computational Linguistics Have a Solution?

The concept of word embedding is particularly useful here. Roughly speaking, a word is described over a wide range of its left and right word neighbors. The computer determines or learns (machine learning) this using very large quantities of training documents and special algorithms. One of the most common approaches to this is the Word2Vec model, in which a word is mapped to numerical values or a multidimensional vector in text representation. If two words have an almost identical word vector – i.e. a very similar context – they are candidates for synonyms.

    • basic grammatical form or with inflection:
      Bicycles → Bicycle
      Of a green tree A green tree
      Succeeded To succeed

A good checking program for user input should have a high-quality dictionary in which all valid conjugations, declensions and comparisons are entered for each word type. This means that entries that are not in the basic grammatical form (usually: nominative, singular, present tense) can also be recognized.
Consequently, it would also make sense to find the words in the documents to be searched using their basic form. This means more results, but also more effort.

In practice, however, tracing back to the basic form is only trivial for single words. In the case of multi-part expressions (phrases), grammatical peculiarities of the German language must be taken into account:
Congruence across several words, e.g. with indefinite and definite articles:
of a green tree → ein grüner Baum / der grüne Baum
separable first parts of a verb at the end of a sentence, etc.:
das Gremium vorgeschlagen heute → das Gremium vorschlagen heute

    • Compound words (composites):
      Filter coffee, coffee filter
      Input string, string input
      Schweinebraten, Schweinsbraten (pig + s/e + roast)
      Fahrgast (drive + guest)
      Fleischereierzeugnis (butchery + eggs + certificate ?)

In hardly any other language is it as easy and popular to combine individual words into a new, meaningful word as it is in German. The possible variations are endless, so that a dictionary could not include everything. So how do you help yourself with an unknown compound word? You break it down into its original individual words and continue working with them.
But this involves some effort: Taking into account fugue elements (departure time display), the basic form of the constituents (crown prince servant: Crown + prince + servant; roast goose knife: goose + roast + knife, roast goose + knife, geese + roast knife), word order (filter coffee powder ground coffee filter) and meaningfulness (banana saddle, deer antler handlebar, flower idea world).

    • Pseudonyms and official names:
      and aliases, Aliases, aliases and stage names
      Loriot – Viktor von Bülow – Vicco von Bülow
      Marilyn Monroe – Norma Jean Baker
      Angela Merkel – Federal Chancellor – Federal Chancellor of the FRG – CDU Chairwoman – CDU Federal Chairwoman
      Pope Francis – Pope – Bishop of Rome – Head of the Roman Catholic Church – Holy Father – Ponitfex Maximus – Jorge Mario Cardinal Bergoglio SJ

Synonym directories for official names are linked to time periods and must be kept up to date. However, there is a certain amount of work involved in dealing with older documents in which the old validity or past periods of office still occur.

There are certainly other difficulties and traps for user input. But let’s leave it at the most common scenarios explained.

For the sake of completeness, it should be mentioned that an input check program usually performs the following tasks as a first step:

    • Clean up the input string:
      Remove multiple spaces and all irrelevant visible and invisible (control) characters that the keyboard of an imaginative or distracted user produces.
    • Recognize the language in which the entry was made (if this has not already been defined via a previous language selection)
      Winter in England (German?, English?)

What Options Does the User Have to Optimize Their Input?

    • Specification:
      from the generic term to the detailed sub-term
      clasp, chain clasp, gold chain clasp, gold necklace clasp
    • Specialization:
      Application, material, production, size, quality
      Net: Hair net, fishing net, traffic net, social net, grid net, spider net, mosquito net, shopping net, computer net …
    • Context:
      Adjectives, adverbs
      Fön: hair dryer, Föhn: weather conditions
      inexpensive hair dryer →
      correction to: inexpensive hair dryer
      Föhn (weather conidition) in the Alps
      → correction to: weather condition in the Alps

This makes it clear that not only the main algorithm is important in software that takes the human-machine interface into account, such as CONTEXTSUITE, but also its suppliers: the program for receiving, checking, “decoding” and polishing the manual user input.

Portrait von Ulrike Handelshauser

Computer scientist and computer linguist

Ulrike is a computer scientist and computer linguist. The thematic focus of her blog posts is on the comprehensibility and usability of the human-machine interface.

More articles from Content in Context

Banner Content ABC
Prof. Dr. Heiko Beier

|

October 28, 2022

|

4 min. read
Scroll to Top
Cookie Consent Banner by Real Cookie Banner