The problem of Simplification in Natural Language Processing (NLP)

Photo by Lysander Yuen on Unsplash

There are so many extraordinary AI applications out there, but so few looking to solve our big challenges as humans. Think about education for example. What are we really doing to improve our learning systems? For the first time ever, almost all content generated by human kind is available out there. Yet, only a few can access it.

Can we use AI technologies to balance unequal access to information?

During this year CERN’s WebFest I presented SYNTHIA, the App to change education by exploiting AI technologies and making content more accessible to everyone. How? The idea is to leverage on Natural Language Processing (NLP) techniques to perform tasks like text summarization, paraphrasing, speech to text transcription, and more. You can play around with the App (it’s free!), explore its source code and propose new developments.

The straight path to reduce inequality is by rebalancing education, and for that, we need to talk about our challenges. Let’s start with the problem of content complexity.

Keep it simple

One of the biggest barriers to the learning process is the inability to access complex content. But what does “complex” mean?

Content complexity can happen on different levels

  • Lexical complexity: happens when the document or source contains infrequent words that are unknown to the receptor.
  • Syntactic complexity: happens s when the source contains long sentences that are difficult or unknown to the receptor.
  • Semantic complexity: which accounts for the amount of background knowledge required to understand the meaning of the source.

From the 3 levels, semantic complexity is probably the least studied one. It refers to the number of “things” that we can talk about in a given domain, including all the objects in the domain, their attributes, and the relationships between them.

The semantic layer is more tacit than its syntactic structure and, as a result, its calculation is more difficult.

While there are famous and powerful academic and commercial syntactic complexity measures, the problem of measuring semantic complexity is still a challenging one.

How can we assess semantic complexity? Graphs!

Graphs are a data structure employed extensively in fields like Computer Science. Social networks, molecular structures, financial transactions, biological networks, transportation systems, they are all examples of the domain and can be modeled as graphs.

Graphs capture interactions (edges) between individual units (nodes), allowing relational knowledge to be stored, accessed and analyzed. For this reason, they are playing a key role in modern Machine Learning.

Sanja Štajner and Ioana Hulpus developed a method to automatically estimate the conceptual complexity of texts by exploiting a number of graph-based measures on a large knowledge base. By using a high-quality language learners corpus for English, they show that graph-based measures of individual text concepts, as well as the way they relate to each other in the knowledge graph, have a high discriminative power when distinguishing between two versions of the same text.

Automatic Assessment of Conceptual Text Complexity Using Knowledge Graphs. Source: Serbian AI Society


If we succeed in capturing content complexity, we’ll be on our way to simplify existing information and making it more accessible to everyone.

Simplification has a variety of important societal applications, like increasing accessibility for those with cognitive disabilities such as aphasia, dyslexia, and autism, or for non-native speakers and children with reading difficulties.

Having said that, content simplification as a subfield of Natural Language Processing (NLP) is in a very early stage, which results in a lack of robust data sources and techniques. In fact, the subjective nature of simplification itself is at the core of the challenge. What is simple or complex for one person might not be for another, and even if we could agree on the complexity assessment, we would find different levels of understanding (e.g. very complex vs. kind of complex).

Interested in these topics? Follow me on Linkedin or Twitter




Reshaping with technology |

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

The Unreasonable Effectiveness of Training With Jitter (i.e, How to Reduce Overfitting)

Everything you need to know about machine learning technology

Why Creativity is Essential in Deep Learning

Paper Review: SoundNet: Learning Sound Representations from Unlabeled Video (NeurIPS’16)

Top 4 Books for AI Driven Investing

GPT-3 Paper Summary

Automated Machine Learning User Interface — How Can We Simplify and Accelerate AI?

A Sketchy Introduction to Convolutional Neural Nets

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Diego Lopez Yse

Diego Lopez Yse

Reshaping with technology |

More from Medium

Formal grammar and information theory

BERT for Individual: Tutorial+Baseline

First steps on Text Augmentation in a non-English dataset

The NLP Landscape from 1960 to 2020