The problem of Simplification in Natural Language Processing (NLP)
How to accelerate understanding by reducing content complexity
There are so many extraordinary AI applications out there, but so few looking to solve our big challenges as humans. Think about education for example. What are we really doing to improve our learning systems? For the first time ever, almost all content generated by human kind is available out there. Yet, only a few can access it.
Can we use AI technologies to balance unequal access to information?
During this year CERN’s WebFest I presented SYNTHIA, the App to change education by exploiting AI technologies and making content more accessible to everyone. How? The idea is to leverage on Natural Language Processing (NLP) techniques to perform tasks like text summarization, paraphrasing, speech to text transcription, and more. You can play around with the App (it’s free!), explore its source code and propose new developments.
The straight path to reduce inequality is by rebalancing education, and for that, we need to talk about our challenges. Let’s start with the problem of content complexity.
Keep it simple
One of the biggest barriers to the learning process is the inability to access complex content. But what does “complex” mean?
- Lexical complexity: happens when the document or source contains infrequent words that are unknown to the receptor.
- Syntactic complexity: happens s when the source contains long sentences that are difficult or unknown to the receptor.
- Semantic complexity: which accounts for the amount of background knowledge required to understand the meaning of the source.
From the 3 levels, semantic complexity is probably the least studied one. It refers to the number of “things” that we can talk about in a given domain, including all the objects in the domain, their attributes, and the relationships between them.
The semantic layer is more tacit than its syntactic structure and, as a result, its calculation is more difficult.
How can we assess semantic complexity? Graphs!
Graphs are a data structure employed extensively in fields like Computer Science. Social networks, molecular structures, financial transactions, biological networks, transportation systems, they are all examples of the domain and can be modeled as graphs.
Graphs capture interactions (edges) between individual units (nodes), allowing relational knowledge to be stored, accessed and analyzed. For this reason, they are playing a key role in modern Machine Learning.
Sanja Štajner and Ioana Hulpus developed a method to automatically estimate the conceptual complexity of texts by exploiting a number of graph-based measures on a large knowledge base. By using a high-quality language learners corpus for English, they show that graph-based measures of individual text concepts, as well as the way they relate to each other in the knowledge graph, have a high discriminative power when distinguishing between two versions of the same text.
If we succeed in capturing content complexity, we’ll be on our way to simplify existing information and making it more accessible to everyone.
Simplification has a variety of important societal applications, like increasing accessibility for those with cognitive disabilities such as aphasia, dyslexia, and autism, or for non-native speakers and children with reading difficulties.
Having said that, content simplification as a subfield of Natural Language Processing (NLP) is in a very early stage, which results in a lack of robust data sources and techniques. In fact, the subjective nature of simplification itself is at the core of the challenge. What is simple or complex for one person might not be for another, and even if we could agree on the complexity assessment, we would find different levels of understanding (e.g. very complex vs. kind of complex).