Sitemap

Abstractive Text Summarization in Python: Comparing Transformer Models

A Hands-On Guide to Summarizing Texts with BART, T5, FLAN-T5, and PEGASUS

9 min readApr 25, 2025
Photo by Karl Pawlowicz on Unsplash

In a world flooded with information, abstractive text summarization has become an essential tool for making sense of it all, turning lengthy, complex documents into concise, readable summaries. Unlike extractive methods, which simply copy and paste key phrases, abstractive summarization rephrases content in a more natural, human-like way. It doesn’t just trim the fat — it rewrites the message.

Abstractive summarization is a technique that aims to create a concise summary of a text by understanding the main ideas and then rephrasing them in new words and sentences. It doesn’t just copy and paste key phrases from the original text but rather generates novel sentences that capture the core meaning.

From legal contracts to customer service logs, the potential use cases span across industries:

  • 🗞 News Aggregators: Summarize daily headlines into bite-sized briefs.
  • 📄 Legal Tech: Reduce multi-page agreements into key clauses.
  • 💬 Customer Support: Turn verbose user complaints into quick-read summaries.
  • 🏥 Healthcare: Condense medical notes and records for faster decision-making.

In this article, we’ll put leading transformer models to the test — BART, FLAN-T5, T5, and PEGASUS — to see how they perform on abstractive summarization tasks using Python and the Hugging Face transformers library.

This evaluation focuses on the out-of-the-box performance of these models, without any task-specific fine-tuning. This allows us to assess their general abstractive capabilities and understand their inherent biases and strengths.

End-to-End Summarization Pipeline in Python

Understanding the out-of-the-box capabilities of these models provides a crucial baseline. It highlights their inherent strengths and weaknesses before any task-specific fine-tuning or complex optimizations are applied, guiding initial model selection and expectation setting.

Below is a complete Python script that takes a long text input, splits it into manageable chunks, runs each chunk through a summarization model, and re-summarizes the combined output to produce a concise summary.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

text = """
Artificial Intelligence (AI) has transformed the way we interact with technology.
From voice assistants that help us manage daily tasks to autonomous vehicles navigating city streets,
AI-powered applications are becoming increasingly prevalent in everyday life.
One of the most impactful subfields of AI is natural language processing (NLP),
which enables machines to understand, interpret, and generate human language.

In the business world, NLP is revolutionizing customer service through intelligent chatbots
and sentiment analysis tools that provide real-time feedback from consumer data.
In healthcare, it's being used to process clinical notes, assist in diagnostics, and even predict disease outbreaks based on unstructured data.
Governments and NGOs are leveraging NLP to analyze policy documents and news articles at scale,
allowing them to make more informed decisions faster.

Among the many tasks NLP can perform, text summarization stands out as a key capability.
As the volume of digital information continues to grow, users are often overwhelmed by the amount of reading required
to stay informed. Text summarization helps address this challenge by condensing long passages into shorter,
more digestible summaries, while preserving the most important content.

There are two main approaches to text summarization: extractive, where key sentences are selected from the original text,
and abstractive, which involves generating new sentences that capture the core meaning.
While extractive techniques are easier to implement and often reliable, they may lack coherence or fluency.
Abstractive models, on the other hand, require more advanced language understanding,
but can produce more natural and concise outputs - much like how a human might summarize a news article or report.

Recent advancements in deep learning, particularly with transformer-based architectures such as BART, T5, and PEGASUS,
have significantly improved the quality of abstractive summaries. These models are pre-trained on massive corpora
and fine-tuned for summarization tasks, enabling them to generate coherent, human-like summaries from complex documents.
As AI continues to evolve, it's likely that summarization tools will become even more accurate and context-aware,
making them indispensable in fields ranging from journalism to legal research to education.
"""

models = {
"BART": "facebook/bart-large-cnn",
"FLAN-T5": "google/flan-t5-base",
"T5": "t5-large",
"PEGASUS": "google/pegasus-xsum"
}

def chunk_text_by_sentence(text, tokenizer, max_tokens):
sentences = text.split(". ") # Simple sentence splitting
chunks = []
current_chunk = []
current_length = 0
for sentence in sentences:
sentence_tokens = tokenizer.encode(sentence, add_special_tokens=False)
if current_length + len(sentence_tokens) + (1 if current_chunk else 0) <= max_tokens - 20: # Account for potential joining space and special tokens
current_chunk.append(sentence)
current_length += len(sentence_tokens) + (1 if current_chunk else 0)
else:
if current_chunk:
chunks.append(". ".join(current_chunk))
current_chunk = [sentence]
current_length = len(sentence_tokens) + 1
if current_chunk:
chunks.append(". ".join(current_chunk))
return chunks

def summarize_with_model(text, model, tokenizer, max_input_tokens=512, max_output_tokens=120, device="cpu"):
chunks = chunk_text_by_sentence(text, tokenizer, max_input_tokens)
partial_summaries = []

for chunk in chunks:
inputs = tokenizer(chunk, return_tensors="pt", truncation=True, max_length=max_input_tokens).to(device)
summary_ids = model.generate(
inputs["input_ids"],
max_length=max_output_tokens,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.9,
num_return_sequences=1
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
partial_summaries.append(summary)

if len(partial_summaries) == 1:
return partial_summaries[0]
else:
combined_summary = " ".join(partial_summaries)
inputs = tokenizer(combined_summary, return_tensors="pt", truncation=True, max_length=max_input_tokens).to(device)
summary_ids = model.generate(
inputs["input_ids"],
max_length=max_output_tokens,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.9,
num_return_sequences=1
)
return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

device = "cuda" if torch.cuda.is_available() else "cpu"
loaded_models = {}
loaded_tokenizers = {}

for name, model_name in models.items():
try:
loaded_tokenizers[name] = AutoTokenizer.from_pretrained(model_name)
loaded_models[name] = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
except Exception as e:
print(f"Error loading model {name} ({model_name}): {e}")
loaded_models[name] = None
loaded_tokenizers[name] = None

results = {}
for name, model in loaded_models.items():
if model is not None:
tokenizer = loaded_tokenizers[name]
results[name] = summarize_with_model(text, model, tokenizer, device=device)

for model_name, summary in results.items():
print(f"\n--- {model_name} ---\n{summary}")

How the Pipeline Works

  1. Imports and Text Input

You start by importing Hugging Face’s transformers library and setting up a realistic input text. This simulates a long-form document with multiple paragraphs covering a topic (AI and NLP) rich enough to challenge the summarization models.

2. Model Selection
A dictionary of four transformer models is defined:

  • BART (facebook/bart-large-cnn)
  • FLAN-T5 (google/flan-t5-base)
  • T5 (t5-large)
  • PEGASUS (google/pegasus-xsum)

Each is known for its abstractive capabilities but trained on different datasets and objectives.

3. Chunking Strategy
In thechunk_text_by_sentence() function, the input text is split into smaller chunks based on sentences. This approach aims to respect semantic boundaries within the text, providing more coherent context for the summarization models when processing long documents that exceed their token limits.

4. Summarization Logic
The function summarize_with_model() performs the main task:

  • It encodes and summarizes each chunk individually.
  • Then, if there are multiple summaries, it concatenates them and performs a second pass to produce a final, coherent summary.
  • Sampling-based generation (do_sample=True) is used.

The model.generate() function utilizes various decoding strategies, controlled by parameters like temperature and top_p, which influence the randomness and predictability of the generated summary.

5. Model Loading
Models and tokenizers are dynamically loaded into memory and pushed to GPU if available. Any model that fails to load is skipped gracefully with a printed error message.

6. Evaluation Loop
Each model is run independently on the same input, and the resulting summaries are printed out in a clean, labeled format. This allows for side-by-side qualitative comparisons.

Evaluation of Summarization Models

Each model was tested using the same input text. Below are the generated outputs, with strengths, limitations, and tuning tips for better performance.

✅ BART (facebook/bart-large-cnn)

BART (Bidirectional and Auto-Regressive Transformer): A sequence-to-sequence model pre-trained as a denoising autoencoder. It excels at text generation and comprehension tasks by combining a bidirectional encoder (like BERT) with an autoregressive decoder (like GPT).

Output:

Artificial Intelligence (AI) has transformed the way we interact with technology. Natural language processing (NLP) enables machines to understand, interpret, and generate human language. NLP is revolutionizing customer service through intelligent chatbots and sentiment analysis tools that provide real-time feedback from consumer data.

Strengths:

  • Highly fluent and clearly structured.
  • Accurately reflects the first few sections of the input (AI’s influence, NLP’s role, and business applications).
  • Lexical quality is excellent — reads naturally and professionally.

Weaknesses:

  • Partial coverage: It doesn’t mention text summarization (the central topic in the second half).
  • Truncation bias: This is a clear case of front-loading (the summary focuses mostly on the first chunk of the input, and not the complete text).

Suggestions for Improvement:

  1. Increase max output tokens (max_output_tokens=180 or more) to allow for richer summaries.
  2. Post-process or re-summarize multiple partial summaries more carefully. Consider summarizing each chunk individually, collecting them, and then applying a final summarization pass with a higher max_length and lower temperature=0.7.
  3. Try deterministic decoding (do_sample=False, num_beams=4) for more stable and relevant outputs.

⚠️ FLAN-T5 (google/flan-t5-base)

FLAN-T5: An enhanced version of the T5 model that has been fine-tuned on a large number of diverse tasks using natural language instructions. This instruction tuning improves its zero-shot and few-shot learning abilities across various NLP tasks, including summarization.

Output:

Text summarization is revolutionizing healthcare with a new field of AI. What is text summarization best for?

Strengths:

  • Includes “text summarization” and “AI” — the right domain.
  • Mentions healthcare, which is a key section of the original.

Weaknesses:

  • Very short and lacks substance. It doesn’t reflect the full document’s depth or structure.
  • The second sentence is interrogative, which is stylistically inappropriate unless the input is conversational or inquisitive.
  • The sentence “revolutionizing healthcare” is not an accurate paraphrase — NLP is applied in healthcare, but the core topic is summarization across fields.

Suggestions for Improvement:

  1. Use a larger variant (flan-t5-xl) for better capacity and abstraction.
  2. Set a minimum generation length (min_length=60) and use beam search for more structured outputs.
  3. Reduce temperature to avoid hallucinated stylistic flourishes like rhetorical questions. Hallucination refers to the generation of content in the summary that is not present or cannot be inferred from the source text.

❌ T5 (t5-large)

T5 (Text-to-Text Transfer Transformer): A transformer-based model that frames all NLP tasks, including summarization, as text-to-text problems. It uses a unified architecture and training objective across different tasks, aiming to simplify and improve transfer learning.

Output:

…enables them to comb through data with unprecedented detail and coherence. As AI continues to evolve, it’s likely that summarization tools will become even more accurate and context-aware. We’re excited to introduce the AI toolkit that drives our summarization capabilities for 2018 at NIPS 2018…

Strengths:

  • Mentions summarization tools, AI evolution, and broader applications, which aligns with the closing of the source text.

Weaknesses:

  • Hallucination: “AI toolkit… NIPS 2018” is completely invented and has no basis in the input.
  • Disjointed tone: Switches between formal and promotional voice (“we’re excited to…”).
  • Repeats ideas and uses vague constructs (e.g., “artificial intelligence is transforming everything…”).
  • Grammatical slip in the opening clause (“and enables them…” without a subject).

Suggestions for Improvement:

  1. Avoid sampling with T5 — use do_sample=False, num_beams=4–6, repetition_penalty=2.0.
  2. This model wasn’t fine-tuned for summarization; consider using philschmid/t5-large-cnn-samsum or other fine-tuned variants.
  3. Lower temperature to reduce hallucinations and incoherence.

❌ PEGASUS (google/pegasus-xsum)

PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization): A transformer model specifically pre-trained for abstractive summarization. Its pre-training task involves masking out important sentences from a document and then training the model to generate these “gap sentences,” making it particularly adept at creating concise summaries.

Output:

In our series of letters from African journalists, film-maker, and columnist Ahmedou Ould-Abdallah looks at the importance of text summarization in journalism.

Weaknesses:

  • This is a pure hallucination, completely unrelated to the original text.
  • PEGASUS models are extremely domain-sensitive. This particular output matches an XSum article seed and not your text.
  • Indicates the model is using memorized training data patterns incorrectly in response to your input.

Suggestions for Improvement:

  1. Use google/pegasus-cnn_dailymail or google/pegasus-large instead. XSum is optimized for single-sentence summaries of news articles, not long-form documents.
  2. You must fine-tune or adapt the model, or it will not generalize well.
  3. Avoid pegasus-xsum unless you're summarizing short BBC-style news articles.

Conclusion

Transformer models have brought us closer than ever to human-like summarization. Yet, as this test shows, performance varies widely depending on architecture, input structure, and decoding strategies. BART proves to be a reliable generalist. FLAN-T5, though smaller, benefits from thoughtful parameter tuning. Meanwhile, T5 and PEGASUS remind us that pretraining and domain alignment matter deeply.

While our tests used out-of-the-box models without fine-tuning or parameter optimization, building an effective summarization system in practice is an iterative and experimental process.

Ultimately, there is no one-size-fits-all model — the best summarization results come from combining strong models with smart preprocessing, chunking strategies, and decoding control. Experimentation is essential.

In future work, integrating summarization into real-time pipelines or domain-specific fine-tuning (e.g., legal, financial, medical) will be the key to unlocking its full value.

Interested in these topics? Follow me on LinkedIn, GitHub, or X

--

--

No responses yet