Your Guide to Generative AI (GenAI)

Main concepts, opportunities, risks, and the future of GenAI

11 min readJun 11, 2024

Today, we can talk with digital models and guide their actions using human language. These models can output content that mirrors human creation, from writing text to generating pictures or videos, all powered by the forces of Generative AI (GenAI).

Demo from Adobe showcasing potential uses powered by GenAI. Source: YouTube

GenAI has the potential to unleash innovation, permit new ways of working, and amplify other AI systems and technologies. Since the release of OpenAI’s ChatGPT in 2022, GenAI has grown unprecedentedly and has rapidly expanded to several industries.

Generative Artificial Intelligence (GenAI) is a type of Artificial Intelligence technology that can produce various types of content, including text, imagery, audio and synthetic data.

GenAI involves models that generate new data similar to the data they were trained on. These models can learn the distribution of existing data and create new examples based on it.

A comparative view of AI, Machine Learning, Deep Learning, and GenAI. Source: ResearchGate

What does “generative” mean?

A generative model can take what it has learned from the examples it’s been shown and create something entirely new based on that information. The “generative” term describes a class of statistical models that contrasts with discriminative models:

Generative models can generate new data instances.
Discriminative models discriminate between different kinds of data instances.

A generative model could generate new photos of animals that look like real animals, while a discriminative model could tell a dog from a cat.

Discriminative and generative models of handwritten digits. The discriminative model tries to tell the difference between handwritten 0’s and 1’s by drawing a line in the data space. If it gets the line right, it can distinguish 0’s from 1’s without ever having to model exactly where the instances are placed in the data space on either side of the line. Source: Google Developers

The generative aspect of these models allows them to create things based on patterns learned from the data they were trained on. The generalization aspect of these models allows them to create things that are not exactly like what they have seen before.

“I think of GenAI as Artificial Intelligence that can produce output that is open-ended. The output is one of an infinite number of possibilities.” (Chris Glaze, AI researcher)

Why use GenAI?

The benefits and implications of GenAI for organizations are mainly driven by three levers that elevate tasks and processes significantly:

Automation: Generative AI catalyzes process automation, streamlining repetitive and low-value tasks and enhancing operational efficiency. The technology enables organizations to automate workflows, reduce human intervention, and minimize errors. Thus, it increases efficiency by freeing up resources for value-adding or strategic tasks and decision-making.
Insights: GenAI profoundly transforms the extraction of valuable information and insights. Not only is the extraction of valuable insights automated from datasets but it is also enriched through deeper root-cause analysis, providing valuable insights beyond possible human thinking.
User Experience: The success of AI applications hinges on their ability to seamlessly integrate into the user’s daily life, minimizing friction and enhancing accessibility, like Chatbots that allow for user-friendly Q&A or integrated natural language querying.

GenAI models are well poised to deliver considerable insights into nature itself, across biological, physical, and mental realms, with broad implications for solving key societal problems

GenAI can perform multiple tasks more efficiently, accurately, and quickly. It is used in activities like technical assistance and troubleshooting, content creation and editing, personal and professional support, learning and education, creativity and recreation, research, analysis, and decision-making.

A summary of GenAI uses. Source: Harvard Business Review

Find here a report analyzing multiple GenAI applications.

How does GenAI work?

GenAI models use Artificial Neural Networks (ANNs) to identify the patterns and structures within existing data to generate new and original content. However, the ability to create new digital content is not new.

For example, Generative Adversarial Networks (GANs) were developed in 2014 by Ian Goodfellow, providing the ability to create content that seemed like it was created by a human. GANs are a type of ANN that can create new data instances that resemble the training data, and can be used in image, video and voice generation.

These fake images were produced using GANs, trained with millions of pictures of bedrooms and people, and hundreds of thousands of Airbnb listings. Source: **thisrentaldoesnotexist.com**

While models like GANs can provide high-quality samples and generate outputs quickly, their sample diversity is weak and unstable. However, one of the most important limitations is that the previously generative solutions couldn’t generalize and, therefore, couldn’t tackle different kinds of problems.

What’s different now with GenAI?

One of the breakthroughs with GenAI models is the ability to leverage different learning approaches, including self-supervised learning for training.

Self-supervised learning is a Machine Learning approach in which labeled data is created from the data itself, without having to rely on historical outcome data or external (human) supervisors that provide labels or feedback. In effect, this is autolabeling or self-data tagging.

This has enabled organizations to leverage large amounts of unlabeled data more easily and quickly to train generative models.

But what unlocked the abilities of GenAI models to parse these massive data amounts, giving superior results, are two types of models that radically improved how computers processed content: Transformers and Diffusion.

Transformer models

Transformers are an Artificial Neural Network (ANN) architecture that generalizes many tasks with incredible results. Introduced in 2017, Transformers rapidly showed effective results in modelling data with long-range dependencies.

In very simple terms, a Transformer’s architecture consists of encoder and decoder components. The encoder receives an input (e.g., a sentence to be translated), processes it into a hidden representation, and passes it to the decoder, which returns an output (e.g., the translated sentence).

Before Transformers, ANN architectures had severe memory problems, with a limited ability to retain information about long-range dependencies — such as words encountered previously- that still influenced the prediction of the next word. The concept of self-attention fixed this problem, allowing Transformers to capture the context of a word from distant parts of a sentence, both before and after the appearance of that word, to encode valuable information. Sentences are processed as a whole rather than word by word.

What does “it” in this sentence refer to? Does it refer to the street or the animal? It’s a simple question to a human, but not as simple to an algorithm. As the model processes each word (each position in the input sequence), self-attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word. Source: Jay Alammar

While the original Transformers were designed for language tasks, the same Transformer architecture has been applied to many other applications, such as generating images, audio, music, or even actions. Because of that, Transformers are considered a key component of the new wave of GenAI.

Diffusion models

Diffusion models are the current go-to for image generation. They are the base model for popular image generation services, such as Dall-E, Stable Diffusion, Midjourney, and Imagen. Diffusion models are also used in pipelines for generating voices, video, and 3D content.

Image generated with Stable Diffusion 3. Source: Lykon

Compared to traditional generative models, diffusion models have better image quality, interpretable latent space, and robustness to overfitting. They excel at generating realistic and coherent content based on textual prompts and efficiently handling image transformations and retrievals.

In a diffusion model, the structure in a data distribution is systematically and slowly destroyed through an iterative forward diffusion process. The model then learns a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative data model.

Diffusion models smoothly perturb data by adding noise, then reverse this process to generate new data from noise. Each denoising step in the reverse process typically requires estimating the score function (figure on the right), which is a gradient pointing to the directions of data with higher likelihood and less noise. Source: ResearchGate

The concept of diffusion models is to transform a simple and easily samplable distribution (e.g., a Gaussian distribution) into a more complex data distribution of interest through a series of invertible operations. Once the model learns the transformation process, it can generate new samples by starting from a point in the simple distribution and gradually “diffusing” it to the desired complex data distribution.

It is important to note that the diffusion mechanism is not dependent on the Transformer architecture, and most modern diffusion approaches include a Transformer backbone.

Foundation & Large Language Models

Training a GenAI model from scratch can be extremely hard and expensive, requiring enormous computational power and consuming staggering amounts of energy.

Foundation Models (FMs)

Foundation Models (FMs) are pre-trained on large amounts of data, producing generalizable and adaptable output: text, audio, images, or videos. They serve as a “foundation” for building other things. For example, GPT (Generative Pre-trained Transformer) works as the foundation model of ChatGPT.

Trained on vast datasets, FMs are versatile and suitable for numerous downstream applications. Models such as GPT-4, Claude 3, and Llama 2 showcase remarkable abilities and are increasingly deployed in real-world scenarios.

Foundation models find a wide array of uses. Source: NVIDIA

Input a prompt, and the system generates an entire essay or a complex image based on your parameters, even if it wasn’t specifically trained to execute that exact argument or generate an image that way. Using Self-Supervised Learning (creating labels directly from the input data) and Transfer Learning, the model can apply information about one situation to another.

With the previous generation of AI techniques, if you wanted to build an AI model that could summarize bodies of text for you, you’d need tens of thousands of labeled examples just for the summarization use case. With an FM, we can dramatically reduce labeled data requirements.

Some examples of Foundation Models. Source: Towards AI

This report comprehensively discusses many FMs and their most critical aspects, from their technical underpinnings to societal consequences.

Large Language Models (LLMs)

Large language Models (LLMs) are one class of FMs. For example, OpenAI’s Generative Pre-trained Transformer (GPT) models are LLMs. LLMs specifically focus on language-based tasks such as summarization, text generation, classification, open-ended conversation, and information extraction.

LLMs are systems that take the context of an input (e.g., a text corpus) and predict the next output (e.g., the next word). They are designed to ingest and generate text or other forms of content (images, audio, video) based on the vast amount of data used to train them.

The general workflow of an LLM is predicting the next word. The model would select the most likely word and add it to the sequence of prompts. Source: NVIDIA

The introduction of LLMs enables efficient communication between humans and machines by converting instructions into machine-readable inputs and integrating solvers for multiple subtasks.

Besides excelling at general text, image, speech, or code generation, LLMs are being applied to solve deeper challenges. How? Languages can be thought of as systems of communication.

One typical language system is the human language, or natural language, such as English, Chinese, and Spanish. In addition, non-natural language systems exist, such as programming languages like Java and Python and chemical molecule languages like SMILES. Each language system possesses its unique set of vocabulary and grammar, often entirely disparate from others.

Pointing LLMs at biological data—enabling them to learn the language of life—might unlock huge possibilities. Companies like Deepmind have also shown that LLMs can be used to search for new solutions in areas like mathematics and computer science.

The risks

Last year, Aza Raskin and Tristan Harris discussed in the AI Dilemma how AI capabilities pose catastrophic risks to a functional society and how companies have been caught in a race to deploy as quickly as possible without adequate safety measures. But the phrase that resonated for me the most was:

“This is the year that all content-based verification breaks, just does not work.”

Have we already lost control over the consequences of AI? Some of the most famous AI researchers released a paper stating how AI safety research was lagging and how governance initiatives lacked the mechanisms and institutions to prevent misuse and recklessness and barely addressed autonomous systems.

In this regard, big organizations like the United Nations or the World Bank issued AI guidelines to suggest how to safeguard implementations. The German Office for Information Security released a detailed report on LLMs' opportunities and risks and suggested possible countermeasures to address these risks.

The EU has pioneered the regulation of AI with the AI Act, the world’s first comprehensive AI law. Known as the EU Artificial Intelligence Act, this legislation represents the first concrete initiative to regulate AI globally. Its goal is to establish Europe as a leading hub for trustworthy AI by setting harmonized rules for developing, marketing, and using AI across the EU.

The EU AI regulation is coming. Source: EY

What’s next?

The next wave of AI development is powered by groundbreaking technologies. This evolution will transform GenAI-centric technology into an indispensable tool for tech providers, driven by domain-specific and multimodal language model advancements.

As we navigate a landscape shaped by growing AI trust, regulatory pressures, and security concerns, the focus sharpens on the critical role of data, cutting-edge analytics techniques, and decision intelligence, heralding an era of unprecedented human augmentation and automation.

What are the key aspects of this reality? I believe we can define the current AI setting through the ideas of centralization, multimodality, the evolution of AI performance, and Agents. Let’s describe them next.

Centralization

As FMs become increasingly expensive to train, few players can afford to develop them. As a result, only a set of organizations are driving growth, excluding others from developing their leading-edge FMs.

Few massive and powerful FMs strengthen creators’ dominance over computing and, through their broad applicability, many other economic sectors, challenging our capacities for critical appraisal and regulatory response.

The figure depicts the cumulative count of foundation models released and attributed to respective countries since 2019. The country with the greatest number of foundation models released since 2019 is the United States (182), followed by China (30), and the United Kingdom (21). Source: Stanford University

Multimodality

Human communication is multimodal: We use text, voice, emotions, expressions, and images. Given this, it is safe to assume that future communication between humans and machines will also be multimodal.

Multimodal models can have multiple data inputs and outputs within a single generative model, such as images, videos, audio (speech), text, and numerical data. Multimodality augments GenAI's usability by allowing models to interact with and create outputs across various modalities.

A comparison between unimodal, cross-modal, and multimodal GenAI models. For unimodal and cross-modal, the results were generated using ChatGPT and Stable Diffusion 2.1. For the multimodal, the results were related to GPT-4 when its access was limited. Source: ResearchGate

Multimodal AI is important because robust information about the real world is typically multimodal. Multimodality helps capture the relationships between different data streams and scales GenAI's benefits across potentially all data types and applications.

The Evolution of AI Performance

The quest for ever-improving AI models prompts a critical question: What is the performance limit of FMs and LLMs? Can they continue to grow indefinitely? Researchers are beginning to pivot from simply scaling the model size and data volume to exploring more nuanced strategies.

Innovative training techniques, cutting-edge architectural designs, and meticulous data curation are emerging as the new frontiers in achieving state-of-the-art performance in natural language processing tasks. This shift represents a broader evolution in AI development, emphasizing quality and efficiency over sheer size.

Agents

The future will be agentic. Artificial Intelligence Agents (Agents) represent digital entities that evaluate their environment, learn from their interactions, and make decisions to accomplish particular objectives. These entities can execute tasks, comprehend the context, adjust their strategies, and develop new approaches to achieve their goals.

Interpretability Agents, for example, can plan and perform tests on other computational systems, ranging in scale from individual neurons to entire models. They produce explanations of these systems in various forms, such as language descriptions of what a system does and where it fails and code that reproduces the system’s behavior.

Looking ahead, a promising approach to making multimodal systems more interactive is embodying them as Agents within both physical and virtual environments, enhancing their ability to interact dynamically and respond to their surroundings.

Interested in these topics? Follow me on LinkedIn or X