|

April 26, 2021

|

8 min. read

How Can a Computer Learn to Understand a Natural Language? (Part 1: The Dictionary)

|

April 26, 2021

|

8 min. read

How Can a Computer Learn to Understand a Natural Language? (Part 1: The Dictionary)

It is always amazing and fascinating to see how quickly computers can handle even the largest amounts of data and how “smart” they then present their results. In simple terms, this is possible when it comes to numerically coded values embedded in algorithms consisting of clear instructions and formulas, definitions, logical comparisons and “if-then-else” decisions. But can such algorithms also be applied to human language, whether acoustic or written? How can you explain a natural language to a computer?

It is certainly not enough to assign a numerical value to every letter, symbol or punctuation mark and then “calculate” with it. Every natural language has a life of its own, an individual inner structure as well as features and usage patterns that do not always appear logical, which have become ingrained over many centuries and have also been modified by a wide variety of influences. And there is another leap in intelligence to master, namely “calculating” the meaning of a word or sentence.

This multi-part blog post uses examples to illustrate the challenges of digital processing of natural languages and the corresponding solutions. This is a field of computational linguistics and is referred to as Natural Language Processing (NLP) or, more broadly, Natural Language Understanding (NLU) – i.e. a computer-aided understanding of human language. Our contribution is primarily concerned with clarity and food for thought and not with correct or complete linguistic analysis. And our contribution focuses less on language learning and translation, where the computer, with its wide range of interaction options, has now become an excellent substitute teacher for humans – but rather on the understanding of language itself.

These are the central questions:

  • What makes a natural language so difficult for the computer?
  • Why is it not so easy to translate a language into a digital model?
  • A computer is good at recognizing topics such as language use and language change, but what else can it decode?

In addition to a teacher and textbook, a person usually needs a large vocabulary book, or even better, a dictionary, to learn a foreign language. Learning vocabulary is the most tedious part, while the stoic computer with its abundance of memory and enormously fast access times is clearly at an advantage. However, it is the dictionary itself that proves to be the problem.

A Challenge: There Is No Complete Dictionary!

Every natural language is in a constant state of change. There are many reasons for this. Examples are

  • Acoustic processes:
    Sound shifts, whose triggers have not yet been clarified, give rise to new branches in language families;
    the pronunciation of “the“/”ðə” grinds down to “se” because English is in use worldwide as a universal language cement
  • Change of word type:
    The independent Germanic word “haidu” for “manner” became the word ending “-heit” via “heid” and “heit
  • Change in the meaning of the word:
    the noun ” race” underwent a transformation from a neutral term in genetics to a controversial and negative meaning during the Third Reich and the current anti-racism movement

    Word environments determined by the computer (CONTEXTSUITE software) for the term “race” in two different contexts: “animals” (left) and “society” (right)

  • Formation of sublanguages with special expressions depending on professional, social or age group
  • occasional man-made spelling reforms
  • the phenomenon that we experience and help shape every day: the emergence of new words and the disappearance of words

The last sub-item needs to be looked at more closely. Our consumer-oriented and networked world constantly needs new and unique word creations for products, start-ups, associations, trends and world events per se (gender asterisks, ghost game). As we are currently evolving more and more from hearing to reading word consumers, these new words and abbreviations also use upper and lower case letters, numbers and special characters (o2, H&M, BiFi, 8×4, Coffee2Go, 4you, AdCommerce, E.ON) or form acronyms (LOL, FAQ). In addition, globalization provides a flood of foreign words from all over the world, especially Anglicisms (urban gardening, sale, to-do list).

Each of us produces texts via WhatsApp, Twitter and emails as if on an assembly line. Poetic liberties and spontaneously invented, short-lived bubble words that satisfy grammar and spelling requirements are allowed to everyone and can even become established. Just think of the many new word creations in connection with corona.

A characteristic of the German language that contributes exponentially to the creation of new words is its fondness for forming compound words, known as Komposita. All word types can participate (e.g., Vergissmeinnicht [forget-me-not], Grünspecht [green woodpecker], honigsüß [honey-sweet]). Word order matters (Wandregal [wall shelf] vs. Regalwand [shelf wall]), as does the use of hyphens and pronunciation emphasis (Welt-Wassertag vs. Weltwasser-Tag [World Water Day]). Other factors include the linking s (Kindskopf [child’s head]), singular vs. plural (Kinderkopf [children’s head]), numerical forms (Bauherrenmodell [developer model], Regenwaldretterhubschrauber [rainforest rescue helicopter]), and many other linguistic nuances.

Such an oversupply of words makes creating or updating a dictionary a Sisyphean task. The spelling dictionary of the German language, the Duden, currently contains 148,000 entries, and the online version contains many more. If a new word has been in frequent use for a long period of time, it can be entered in the Duden. Conversely, words that have fallen out of fashion or been forgotten (dial, huntsman) are deleted. This requires constant analysis of German texts of all types of media by the Duden team and its data centers.

Does every company that wants to process texts with a computer need access to the Duden or the American Heritage Dictionary or Le Petit Robert? Or do they need to create their own digital dictionary with lots of information about meaning, origin, hyphenation rules, grammar, collocations (frequent surrounding words) as well as thesaurus and ontology data? Would it be possible to manage without a dictionary?

From the Dictionary Approach to Dynamic Learning – Artificial Intelligence Makes It Possible

In fact, the traditional practices for maintaining dictionaries in professional software solutions have been completely overtaken by the latest developments in artificial intelligence and machine learning. Computers – unlike humans – do not tire and can therefore read any number of texts and evaluate them in their overall context. New deep learning algorithms can use large volumes of text to determine which words occur in related contexts. Synonyms can be identified even before they have ever found their way into the dictionary. Related terms can also be determined very reliably. The fact that “social distancing” is related to the coronavirus pandemic and also has a special connotation in this context and is described in Austria, for example, with the term “baby elephant” – a machine learns this development faster and more reliably than a human. Not because it is more intelligent, but because it processes orders of magnitude more data than an editor ever could.

This means that purely dictionary-based approaches are increasingly taking a back seat. Maintaining dictionaries – both at the level of basic language vocabulary (Duden) and for the corporate context (keyword: thesauri and ontologies) – has always been extremely time-consuming. And the dynamic development of language means that a one-off creation (or modeling) of the dictionary or thesaurus is not enough. Editors can no longer perform this task to an economically viable extent. Instead, computers are increasingly taking over this task.

Their advantage is also that they can filter different contexts very quickly and thus also determine the context-specific use of language. In companies, this is relevant at all interfaces in order to bring product descriptions in line with requirements. Matching the language of marketing with that of developers. Applied to the Duden dictionary, this means not only publishing a standard Duden – and perhaps one for youth language – but also continuously creating dictionaries for various target groups (from migrants to vegans to senior citizens).

This brings us to the next questions, which will be the subject of blog post part 2):
What can a computer recognize in a text? Can it also read “between the lines” or even grasp a meaning?

A final tip: if you are interested in the origin of a word and its use in the past, you can look this up – in addition to the Duden dictionary – in descriptive illustrations at:

  • Collocation analysis in a diachronic perspective (Diacollo)
    Partially animated display of search results along a timeline, best suited for “old” words
    Input example “Revolution”:
    QUERY: Revolution, FORMAT: Bubble or Cloud or select HighChart
    (click on the triangle on the left-hand side of the screen to start the animation or click anywhere on the timeline; conclude each new input or change with SUBMIT).
Portrait von Ulrike Handelshauser

Computer scientist and computer linguist

Ulrike is a computer scientist and computer linguist. The thematic focus of her blog posts is on the comprehensibility and usability of the human-machine interface.

More articles from Content in Context

Banner Content ABC
Prof. Dr. Heiko Beier

|

October 28, 2022

|

4 min. read
Scroll to Top
Cookie Consent Banner by Real Cookie Banner