How Can a Computer Learn to Understand a Natural Language? (Part 2: Text Coding and Text Editing)
How Can a Computer Learn to Understand a Natural Language? (Part 2: Text Coding and Text Editing)
Anyone who knows anything about computers knows that they only think in “zeros” and “ones”. So how can a computer not only read a natural language text, but even understand it? We will take a closer look at this in the second part of our article (click here for part 1).
At the beginning of the digital age, algorithms for numerical calculations, for example in administration or for scientific applications, were created on the basis of the computer words “0” and “1”. The handling of texts was initially limited to displaying, saving, retrieving, comparing, sorting, changing and deleting. What techniques and refinements are required for a computer to be able to “understand” a text?
It All Starts with Coding
All upper and lower case letters including diacritics, numbers, punctuation marks and special characters are assigned a unique binary code, e.g. the ASCII character set (128 characters) or the EBCDIC character set (256 characters). Only the “image” of a letter or character is used. At first, different encodings could not be saved together in one file; the content had to be converted in a complicated process. It only became easier with the complex creation of UTF-8, the universal and international standard for encoding. UTF-8 comprises numerous so-called Unicode character sets: Version 1 from October 1991 contained 24 writing systems and around 7,000 characters, version 12 in May 2019 even contained 150 writing systems and around 138,000 characters. It has now become possible to use a wide variety of character systems in a colorful mix of
中國人, عربى, ᚮᚱᛄ, ⠃⠗⠊ , ♫???? ??
to
? ? ?
and store them side by side. An everyday convenience, but an underestimated and unnoticed milestone of the computer age and an engine of globalization.
Text Types and Information Content
So how can the computer make the leap from a binary-encoded “image” of a letter to a word, sentence, information, meaning and “reading between the lines”? Let’s look at the following example text:
“Call me Ismaelia Mercedes. Seven and a half years ago, on December 24, 1873 – no matter how long ago – when I had little or only $26 in my MyCashPocket and nothing in particular appealed to me in the land of Patagonia, I thought I would sail around a bit as a figurehead on a three-master and see the watery part of the world, namely the Bay of Biscay.”
While reading, images arise in a person’s mind, they may make a note of some concrete and interpreted data and feel strangely reminded of Moby Dick. Which passages in the sample text actually provide information or allow conclusions to be drawn?
“Call me Ismaelia-Mercedes (protagonist is a woman). Seven and a half years ago on December 24, 1873 (so it’s now June 1881) – it doesn’t matter how long ago exactly – I had little or only $26 in my MyCashPocket (?) and nothing in particular appealed to me in the land of Patagonia , so I thought I would sail around a bit as a figurehead on a three-master and see the watery part of the world (i.e. the seas), namely the Bay of Biscay.”
What stands out?
- Texts do not follow a fixed structure. Which words in which order, how many sentences in which length, how many headings or chapters – only the author of the text or the type of text such as email, tweet, stock ticker or poem can determine this.
- In a continuous text, there tend to be only a few information-bearing expressions, but a lot of linguistic embellishments and meaningless filler words, the so-called. Stop words. The 100 most common German words alone(die, der, und, in, zu, den, das, nicht, von, sie …) make up – in multiple uses – a good 60% of a conventional text! This accessory is only relevant when it comes to recognizing and characterizing a typical writing style (short or convoluted sentences, embellishing adjectives, word variety, short or rather longer words, abundance of stylistic devices such as metaphors or alliteration, etc.) in an authorship or plagiarism check by the computer.
Tools and Steps for Text Editing
First of all, the computer needs a digital dictionary or lexicon containing all the inflected forms, decompositions and irregularities of each word. It thus knows the morphology of the language. However, due to the fact that “there is no complete dictionary”, the dictionary is always somewhat imprecise. Secondly, the computer needs a digital language model. It describes the probabilities with which the inflected words can be placed one after the other and how sentences are formed: “She likes to read books” is more likely than “She likes to read cloths”. He therefore knows the structure of the language. There are long-established statistical language models based on N-grams or hidden Markov models through to the latest neural language models such as BERT technology. This makes it possible to predict possible subsequent words from a certain sentence position and to draw conclusions about words and phrases with similar meanings. A computer learns the probabilities from many millions of training sentences.
In order to recognize and extract information from a text, the text must first be processed and provided with meta-information. First, the computer reads the text into its memory as an eternally long string of zeros and ones and breaks it down into individual words and sentences using knowledge from the dictionary and language model. Special characters and positions that we humans virtually ignore are of great importance here: the space character (or blank, space, whitespace, etc.), the end of a sentence and the end of a related word sequence such as a heading or footnote. Each separated word is assigned a word type and basic form – if it was in inflected form – or it is given the attribute “unknown“. However, the latter only means that the word did not appear in its dictionary, but that the computer nevertheless knows that, according to the language model, it should be a masculine singular noun in the accusative case and is similar in meaning to the words cake or yeast plait (“She is baking a Googlehuup today.“).
These editing steps contain a lot of knowledge about language theory, language rules and, above all, language pragmatics, e.g. the use of the subject-predicate-object order “I will show you. – I’ll show you! The dog bites the girl.” . Especially with unstructured texts, only detective work and tricks can help to increase the recognition rate by tenths of a percent.
As an illustrative example, the end-of-sentence recognition is presented:
“Did Dr. H. – C. Mustermann deposit €3.50 at A.B.C.-Bank e.V. on April 1 at 8.00 a.m.?“
A sentence does not always end with a “.“, but also with “?!-:;“. A “.” can also occur with abbreviations, date, time, currency, e-mail address, URL, consecutive numbering, etc. Often there are only sentence and text fragments such as headings, page numbers, greetings, lists, enumerations, tables, footnotes, references or a table of contents. Here the computer has to decide where it is best to define an artificial end of sentence.
In the next part, you will find out what types of information the computer can extract and where it is even superior to humans.
Computer scientist and computer linguist
Ulrike is a computer scientist and computer linguist. The thematic focus of her blog posts is on the comprehensibility and usability of the human-machine interface.
More articles from Content in Context
|
|