Paraphrasing in Natural Language Processing (NLP)

How Paraphrase Generation techniques can be used and measured

6 min readJan 17, 2022

How much do you trust scientific publications? The growth in research publication combined with the availability of new digital technologies is a double-edge sword. Although an increase in the volume of publications may seem as a positive indicator of science advancing and knowledge spreading, reality is quite different below the surface.

The Problematic Paper Screener project searches through published science and seeks out “tortured phrases” in order to find suspect work. A tortured phrase is an established scientific concept paraphrased into a nonsensical sequence of words. Academic fraudsters and modern plagiarists are turning words like “Artificial Intelligence” into “counterfeit consciousness”, and “Mean square error” into “mean square blunder”.

Learning about Paraphrase Generation methods helps us not only to spot these kinds of threats, but also to exploit the bright side of paraphrasing: expand our understanding.

Paraphrase generation is the task of generating an output sentence that preserves the meaning of the input sentence with variations in word choice and grammar.

Two sentences are paraphrases if their meanings are equivalent but their words and syntax are different.

Paraphrasing can be used to aid comprehension, stimulate prior knowledge, and assist in writing-skills development.

Paraphrasing text can facilitate reading comprehension by transforming the text into a more familiar, and in the field of composition, allow writers to restate ideas from other works or their own drafts so that the reformatted language may better suit a voice, flow, or line of argument.

How is it done?

Let’s start from the beginning. Paraphrasing is considered a subtask within the Natural Language Processing (NLP) discipline.

Natural Language Processing or NLP is a field of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages.

Paraphrase Generation is the process of presenting and conveying information of original sentence/phrase in alternative words and order, which may be performed through two main methods:

Rule-based: in which rules are created manually to transform original text into semantically equivalent text or paraphrases (e.g. WordNets or thesaurus for replacing words in the original text with their synonyms). This may also include changing active voice into passive, adding or deleting function words, co-reference substitution, or changing part-of-speech, among others.
Machine Learning based: where paraphrases are created automatically from the data. Deep Learning and Generative Adversarial Networks (GANs), as well as Reinforcement Learning models are only examples of the techniques used for automatic paraphrasing. In fact, paraphrasing can even be treated as a language translation challenge, often performed using a bilingual corpus pivoting back and forth. But, everything changed since the creation of Transformers, a novel Artificial Neural Network model that completely revolutionized the paraphrasing landscape, as well as many others NLP tasks.

Introduced in 2017, Transformers rapidly showed effective results at modeling data with long-range dependencies. Originally thought to solve NLP tasks, the application of Transformers has expanded reaching incredible accomplishments in many disciplines.

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of the image, respectively. Source: HarvardNLP

Today, Transformer-based architectures met self-supervised pre-training to develop some of the most efficient automatic paraphrasing models:

GPT: developed by OpenAI, Generative Pre-trained Transformer (GPT) models require a small amount of input text to generate large volumes of relevant and sophisticated outputs.
BERT: or Bidirectional Encoder Representations from Transformers, is one giant model designed by Google. While being conceptually simple, BERT obtains new state-of-the-art results on eleven NLP tasks, including question answering, named entity recognition and other tasks related to general language understanding.
PEGASUS: which uses a pre-training self-supervised objective for Transformer encoder-decoder models to improve performance on abstractive summarization. Although designed to perform summarization tasks, it also proved superior performance on paraphrasing.

How to choose the right model?

Evaluation of NLP systems can be classified into intrinsic and extrinsic methods, which can be performed either automatically or manually.

In an intrinsic evaluation, quality of NLP systems outputs is evaluated against predetermined ground truth (reference text) whereas an extrinsic evaluation is aimed at evaluating systems outputs based on their impact on the performance of other NLP systems.

Intrinsic evaluation

For example, in an intrinsic evaluation of a paraphrase generation system, we would ask: does the generated paraphrase derive meaning in relation to the original text? These types of metrics are hard to automate since it’s difficult to find an acceptable ground truth (reference paraphrases). Even for humans, it’s very challenging to produce complete and ideal reference sentences or phrases.

Extrinsic evaluation

On the other hand, with an extrinsic evaluation we might ask: do the generated paraphrases significantly improve the performance of a question-answering model? Can the generated paraphrases be used as a substitute of the original text to train a text classification model? If so, it can be concluded that, extrinsically, the considered paraphrases are useful.

The best of both worlds

These metrics don’t need to run separately, and can actually be integrated into the same performance method. For example, in a study performed by Hailu, Yu and Fantaye, the authors developed a hybrid model that uses both intrinsic and extrinsic evaluation methods.

Intrinsically they compared the generated paraphrases directly with the original text, discouraging overlapping words and encouraging the substitution of words with alternative words.

Extrinsically, they performed two sentiment classification models (one with original sentences and the other with paraphrases), and then compared their prediction results.

But paraphrasing is complex, and we also need to take into account things like grammatical correctness and fluency.

Main metrics

The evaluation of paraphrase generation has similar difficulties as the evaluation of machine translation: often the quality of a paraphrase is dependent upon its context, and the degree of lexically dissimilarity from its source phrase. In this regard, while originally used to evaluate machine translations, the Bilingual Evaluation Understudy metric (BLEU) has been used to evaluate paraphrase generation models as well. However, paraphrases often have several lexically different but equally valid solutions which hurts BLEU and other similar evaluation metrics.

So, is it possible to automatically assess the quality of machine-generated paraphrases? Yes, and some of the metrics created for that are:

ParaMetric, a method that provides an objective measure of quality using a collection of multiple translations whose paraphrases have been manually annotated. It compares the paraphrases discovered by automatic paraphrasing techniques against gold standard alignments of words and phrases within equivalent sentences. Even though it is a good attempt, it is very challenging to prepare complete and ideal reference paraphrases.
PEM, which is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in the metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows to approximate adequacy independently of lexical similarity. Although PEM shows to correlate well with human judgments, the requirement of large parallel text for training is as difficult as generating paraphrases.
PINC, designed to be used in conjunction with BLEU and help cover its inadequacies. Since BLEU has difficulty measuring lexical dissimilarity, the idea is to use PINC to fill that gap, while considering BLEU for measuring adequacy and fluency of generated paraphrases against the source text. This way, PINC represents both BLEU and PINC together as a 2- dimensional scoring metric

In conclusion

Some institutions have been documented to impose content production targets that are nearly impossible to meet. In some cases, doctors have to get published to get promoted, but many are too busy in the hospital to do so. Frauds through machine-generated content in scientific publications are likely to get worse, and we can only try to improve our understanding of the problem if we want to get better at fighting them.

But on the other hand, NLP is bringing all sorts of benefits to our lives, accelerating our learning and knowledge. And this effect is also likely to increase. Can you imagine a time when machines augment ourselves with all the available content out there? A time when we can all access any content we want, no matter where or how complex it might be? I certainly do. And that time is right around the corner.

Interested in these topics? Follow me on Linkedin or Twitter