How Can a Computer Learn to Understand a Natural Language? (Part 3: Text Analysis)
How Can a Computer Learn to Understand a Natural Language? (Part 3: Text Analysis)
We still remember text analysis as a discipline in German lessons, for better or worse. We had to find out the most important facts about who-what-when-where-why and summarize the original text in a few sentences. Can’t a computer do that too? And can it also read between the lines?
Our series of articles on “Understanding a computer’s language” began with a description of the complexity and mutability of natural language. Next, we presented the coding and pre-processing of text, the results of which are the prerequisite for analyzing and extracting information. A digital dictionary including a language model, the text broken down into words and sentences and numerous grammatical and statistical meta-information – without all this, no well-founded automated text analysis is possible.
This third and final part is dedicated to answering the following questions:
What words or phrases provide information and how does the computer recognize them? What types of information are there?
Concrete Data Values
Examples of concrete data values are:
Date and time information: 1.1.1703, 1703-Jan-01, New Year 1703
Currencies: 17 DM and three pfennigs, $ 17.03, 17.030.000.- Euro
URLs , e-mail addresses
Such format-oriented structures can be encapsulated using rule-based algorithms or regex patterns. A regex pattern is a very cryptic-looking, standardized character string that describes the specific desired format. For a date in the format dd.mm.yyyy and only within the 20th century, the regex pattern is [0123][0-9]\.[01][0-9]\.19[0-9]{2}.
Named Entities
Named entities are proper names that generally comprise the following categories:
- Personal names: Mustermann Max, Prof. Dr. Willibald Murke, Baroness Gunilla-Eulalia von und zu Grumpertshofen-Drachenstein
- Place and country names
- Organization, company, brand and product names: MORESOPHY GmbH, CONTEXTSUITE, Willibald-Murke-Stiftung für Experimentelle und Angewandte Erforschung der Satzende-Erkennung München-Sendling
The computer can make do with prefabricated, extremely long, but never complete or up-to-date look-up lists. It is better to use Named Entity Recognition (NER), a sub-area of information extraction. This uses rules to find the typical left and right word or word sequence neighbors of an entity. For example, expressions that follow “live in, come from, go to” and do not appear in the German lexicon are candidates for a place or country name.
Another method is based on so-called word embedding. This involves determining all possible surrounding words for a specific source word and converting them into a mathematical description using their distance position and occurrence statistics. The respective result is a multidimensional vector: “Word-to-Vector”. This allows semantic conclusions to be drawn: similar words and phrases have equivalent word embeddings or word vectors (“She successfully passed her exam / final test / fishing license test.”), word probabilities help fill gaps (“The apple ??? from the tree. –> falls, fell, tumbles, tumbles, strives, bounces, “), unknown words or phrases can be assigned a meaning (“In the trunk of his SPEEDY 2000 he found the spare tire. SPEEDY 2000 –> car, new car, BMW, Mercedes, Audi, convertible ... “), co-occurrences (strikingly frequent co-occurrence of two or more words: ” A delicate subject, by hook or by crook, First … Secondly … “) are easy to find, and statements with incorrect content are recognized (“His name is Kuala Lumpur. She lives in Borussia-Dortmund .”).
A higher-quality word embedding, which describes the context of a word even better, is obtained using the methodology of a neural network. This can not only make calculations for the current sentence, but also incorporate features from the previous input. A paragraph or text can therefore be examined and stored as a coherent unit.
Domain Mapping via Concise and Specific Expressions
Lexical words that you would highlight with a marker when reading through the text provide important information about the text content and its topic. In most cases, such words are generally less frequently used but concise nouns (the language model recognizes this with its statistical descriptions) or domain-specific vocabulary. For example, “mortgage, discount, effective annual interest rate” refers to a text about forms of investment and the banking industry, while “rear window, heated wing mirrors, power steering” refers to the automotive industry.
In an elaborate and complex process, the computer can learn to assign a text to a specific topic category. To do this, it also needs a well-considered classification that must be unambiguous and contain all the content categories relevant for its own purposes. At moresophy, we are guided by the internationally used taxonomy standard from IAB (see also moresophy blog post on IAB), which comprises a good 30 categories. The disambiguation of homonyms (one word stands for different terms, e.g. the money bank and the head office bank) is simplified by the assignment of a content category.
“Between the lines”
A computer can “calculate” characteristics of the author’s typical writing style. Examples are: short or long sentences, few or many adjectives / subordinate clauses / foreign words / technical terms / direct speech / persons appearing, small or large vocabulary range. This creates a kind of “fingerprint” that is helpful in identifying authorship or proving plagiarism.
There is much more to discover: a text is in a positive, neutral or negative mood (sentiment: “The optimal design improves acceptance in an excellent way”, “The miserable resignation and the dismaying downfall made it a pity.”), it expresses an emotion (“Yay, everything was great and beautiful!”, “Grr, selling such terrible junk should be punished!”), it falls into a risk category (it’s about crime, weapons, drugs or similar) or it contains swear words, obscenities and hate speech. It is even possible to determine which groups of people would read the text with great interest: Seniors, beauty queens and kings, nutrition and figure-conscious people, entertainment freaks and other personas.
Misspellings
Some information-bearing words cannot be recognized due to misspellings and therefore cannot be found in the internal models. Spelling and grammatical errors (heart attack instead of myocardial infarction, Eiffel Tower instead of Eiffel Tower) are caught by the correction service of a word processing program, but remain in a free text. Saew, Drizzelbranch, and Screechpull are examples of keyboard errors, misspellings and reading errors in automated optical character recognition (OCR). They can occur at any point and can be detected and corrected using similarity measures such as the Levenshtein distance, but at the cost of a very long program runtime.
All in all, the example text from the previous article “Coding and pre-processing text” could be “understood” by a computer as follows:
“Call me Ismaelia-Mercedes (personal name score 0.99, product name score 0.01). Seven and a half years ago on December 24, 1873 (date 1873-12-24) – no matter how long ago exactly – when I had little or only $26 (currency $26.00) money in my MyCashPocket (unknown; wallet, purse) and nothing in particular appealed to me in the land of Patagonia (Entitiy Ort), I thought I would sail around a little as a figurehead on a three-master and see the watery part of the earth, namely the Bay of Biscay (Entitiy Ort).”
Content category: Travel, cruises
Risk category: none
Sentiment: neutral
Emotion: neutral
Target group: Adventurers and travel enthusiasts
Authorship: “Moby Dick, Herman Melville”, score 0.87
Outlook
Computational linguistics has by no means reached its goal. Natural language also includes irony, humor, sarcasm, lies, idioms, puns, rhetorical stylistic devices and paradoxes. When speaking, there are further subtleties: Prosody, emphasis, pauses in speech, speech melody and speed. They can emphasize, reverse or modify the original meaning. Not an easy task for the further development of the “linguistically gifted” computer!
Computer scientist and computer linguist
Ulrike is a computer scientist and computer linguist. The thematic focus of her blog posts is on the comprehensibility and usability of the human-machine interface.
More articles from Content in Context
|
|