Is It Time to Talk about GPT-5?—The Problem with Transformers

In the ever-evolving world of software development, artificial intelligence (AI) has emerged as a game-changer. Its potential to revolutionize industries and drive business growth has caught the attention of CEOs, CFOs, and investors alike. As technology continues to advance at an unprecedented pace, one question arises: can AI be enhanced with raw power? In this article, we will explore the possibilities and implications of empowering AI through increased computational capabilities.

AI has evolved at an incredible pace, from early chatbots like Eliza to modern machine learning algorithms, and this rapid progression has been greatly supported by AI development services. AI is now capable of matching and even surpassing human intelligence in many areas. However, this potential has come at a great cost: more powerful AIs require more power, as in more computational capability.

By adding more processing power to AI systems, engineers can unlock new levels of performance and achieve groundbreaking results. This can be achieved through various means, such as utilizing high-performance computing clusters or leveraging cloud-based infrastructure.

Let’s take GPT-3 and its family of models as an example. In regard to large language models (LLMs), when trying to create an AI, it seems like the standard for giving an estimate of the model’s abilities is given in terms of the number of parameters it has. The bigger the number, the more powerful the AI. And while yes, size does matter, parameters aren’t everything, and at some point, we are going to hit the engineering problem of requiring more processing power than we can supply.

Before we delve deeper, I want to draw a parallel to a subject that’s dear to my heart: video games and consoles. See, I’m a child of the 80s; I was there for the great console wars of the 90s—Sega does what Nintendon’t and all that jazz. At some point, consoles stopped marketing their sound capabilities or the quality of their colors and instead began talking about bits.

In essence, the more bits, the more powerful the console; everyone was after those big bits. And this led to companies coming up with some extremely wacky architectures. It didn’t matter how insane the hardware was as long as they could promote it as having more bits than the competition (Ahe, Atari Jaguar).

This kept on going for quite a while—Sega left the console market, Sony took the world by storm with the Playstation, Microsoft entered the competition with the Xbox—and at the heart of every generation, we had the bits. Around the era of the PS2, we also started talking about polygons and teraflops; once again, it was all about the big numbers.

And then came the era of the PS3 and Xbox 360. Oh, the promise of realistic graphics, immersive sound, and so much more. Now it wasn’t about bits; it was about how many polygons on screens, fps, storage capabilities; once again, it was about the biggest number.

The two console manufacturers went head to head, and without ever realizing it, one small alternative made its way into the market—Nintendo’s Wii. The Wii was a toy in comparison to the beasts Sony and Microsoft pushed to the market, but Nintendo was smart. They targeted the casual audience, the one that wasn’t intoxicated with big numbers. The end result speaks for itself. During that console generation, the PS3 sold 80 million units, Xbox 360 sold 84, and the Wii?—101 million units.

The small underdog took the market by storm, and all it took was some creativity and ingenuity.

What does my rambling have to do with the AI arms race? Actually, as we see, there is a very strong reason to be careful about bigger models, and it’s not because they are going to take over the world.

Why Do We Want Bigger Models?

So, what are the advantages of putting our models in bigger and more powerful hardware? Much like how software developers are able to pull off miracles with a box of energy drinks, more RAM and more processing power is a boost that enhances the computational possibilities of our models.

Boosting AI with more computing power involves providing it with greater resources to process data faster and more efficiently. This can be achieved through various means, such as utilizing high-performance computing clusters or leveraging cloud-based infrastructure. By supercharging AI systems, organizations can unlock new levels of performance and achieve groundbreaking results.

One significant advantage of empowering AI with increased computational capabilities, aided by machine learning services, is its ability to analyze large datasets in real-time. With access to immense computing power, AI algorithms can quickly identify patterns and trends that may have otherwise gone unnoticed. This enables CEOs and CFOs to make faster and more informed decisions based on accurate insights derived from complex data sets.

Also, more powerful AI systems, including AI for software testing, have the potential to process complex patterns within data sets more effectively, leading to highly accurate predictions that aid investors in making informed decisions. By harnessing increased computational power, organizations can leverage predictive analytics models that provide valuable insights into market trends, customer behavior, and investment opportunities.

Finally, empowered AI has the capability to automate repetitive tasks at scale while maintaining accuracy and reducing operational costs for businesses. With increased computational power, organizations can deploy advanced automation solutions that streamline processes across various departments, such as finance, operations, or customer service.

And all that is common sense, right? More power means more processing power, which translates to bigger models and faster/more accurate results. Nevertheless, while the potential benefits of boosting AI with more computing power are significant, there are several tangential issues that need to be considered:

Ethical Considerations: As AI becomes more powerful, ethical concerns surrounding privacy invasion or biased decision-making may arise. Organizations must ensure transparency and accountability when deploying empowered AI solutions to maintain trust and avoid potential pitfalls.
Environmental Impact: Increased computational power requires more energy consumption, which can have environmental implications. It is crucial for organizations to balance the benefits of empowered AI with sustainable practices and explore ways to minimize their carbon footprint.

The problem with just throwing more power into refining our models is that it is a bit like the dark side in Star Wars (I am such a geek…). Yes, it is a faster path toward power, but it also comes at a cost that might not be evident until it’s too late.

The Transformer Models: A Revolutionary Approach to AI

Just to build up some tension, let’s talk a bit about transformer models and why they are so important to modern day computing and machine learning. Let’s explore the transformative power of transformer models (pun completely intended) and their implications for businesses.

Transformer models are a type of deep learning architecture that utilizes self-attention mechanisms to process sequential data efficiently. In fact, attention is so important that the original paper was titled “Attention is all you need.”

To simplify a very complex subject, unlike traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs), transformers can capture long-range dependencies in data without relying on sequential processing. In other words, imagine that you have a box full of photographs and you wish to organize them chronologically.

One method would be to stack the photos and then look at each one in order, classifying it based on its relation to its nearest neighbors. That could definitely work, but it does come with a few key problems: Mainly that you aren’t paying attention to the whole stack of photos but rather a few at a time.

The second approach, the one that resembles transformers, involves laying all the photos on the floor and looking at all of them at once, finding which photos are closer to which based on the colors, styles, content, and so on. See the difference? This pays more attention to the context than a sequential analysis.

This breakthrough has paved the way for remarkable advancements in natural language processing (NLP) tasks such as machine translation, sentiment analysis, and question-answering.

One key advantage of transformer models is their ability to comprehend complex language structures with exceptional accuracy. By leveraging self-attention mechanisms, these models can analyze relationships between words or phrases within a sentence more effectively than previous approaches.

It’s pretty simple when we put it like this, right? Context is everything in language, and transformers can be “aware” of more information than just a few words, so it has more information to accurately predict the new word in a sentence. Or in the case of other applications like sentiment analysis, it can pinpoint the sentiment regarding a topic and even differentiate if a comment is sarcastic based on the context.

Machine translation has always been a challenging task due to linguistic nuances and cultural differences across languages. However, transformer models have significantly improved translation quality by modeling global dependencies between words rather than relying solely on local context as traditional approaches do. This breakthrough empowers businesses operating globally with more accurate translations for their products, services, and marketing materials.

The Dark Side of Power: The Challenges of Scaling Transformer Models

While transformer models have revolutionized the field of AI and brought about significant advancements in language understanding, scaling these models to handle larger datasets and more complex tasks presents its own set of challenges.

First and foremost, transformers are resource intensive. As they grow in size and complexity, they require substantial computational resources to train and deploy effectively. Training large-scale transformer models requires high-performance computing clusters or cloud-based infrastructure with specialized hardware, such as graphics processing units (GPUs) or tensor processing units (TPUs). This increased demand for computational power can pose financial constraints for organizations without adequate resources.

Look no further than OpenAI and its GPT models. No one can deny just how amazing these models are, but it does come at a cost. The models are running in data centers that would make old computer mainframes look like a laptop in comparison. In fact, you can download any of the open source LLMs out there and try to run it on your computer and watch as your RAM cries in pain as the model swallows it.

And most models dwarf in comparison to GPT-3.5 in terms of parameters. For example, Llama (Meta’s LLM) and its open-source cousins are somewhere around 40 billion parameters. Compare that to the 175 billion parameters of GPT-3. And while OpenAI has chosen not to disclose how many parameters GPT-4 has, rumors place it at around 1 trillion.

Just to put it in perspective, Sam Altman, OpenAI’s CEO, told the press that training GPT-4 costs around 100 million dollars. And take into consideration that this model is using data that has already been collected and preprocessed for the other models.

Scaling transformer models often necessitates access to vast amounts of labeled training data. While some domains may have readily available datasets, others may require extensive efforts to collect or annotate data manually. Additionally, ensuring the quality and diversity of training data is crucial to avoid biases or skewed representations within the model.

Just recently, a class action lawsuit was brought against OpenAI for lack of transparency in regard to their data gathering. Similar complaints have been raised by the EU. The theory is that just like you can’t make an omelet without cracking a few eggs, you can’t build a trillion-parameter model without obtaining data in a sketchy way.

Larger transformer models tend to have a higher number of parameters, making them more challenging to optimize during training. Fine-tuning hyperparameters and optimizing model architectures become increasingly complex tasks as the scale grows. Organizations must invest time and expertise into fine-tuning these parameters to achieve optimal performance while avoiding overfitting or underfitting issues.

Deploying scaled-up transformer models into production environments can be a daunting task due to their resource requirements and potential compatibility issues with existing infrastructure or software systems. Organizations need robust deployment strategies that ensure efficient utilization of computational resources while maintaining scalability and reliability.

Open Source Strikes Back

The competition within the world of AI has long been perceived as a battleground between tech titans such as Google and OpenAI. However, an unexpected contender is rapidly emerging: the open source community. A leaked letter from a Google engineer posits that open source has the potential to outshine Google and OpenAI in the race to AI dominance.

A significant advantage that open source platforms hold is the power of collaborative innovation. With the leakage of Meta’s capable foundation model, the open source community took a quantum leap. Individuals and research institutions worldwide quickly developed improvements and modifications, some outpacing the developments of Google and OpenAI.

The range of ideas and solutions produced by the open source community was far-reaching and high-impact due to its decentralized, open-to-all nature. The model created by this community iterated upon and improved existing solutions, something Google and OpenAI could consider for their strategies.

Interestingly, the engineer in question also points to the fact that these open source models are being built with accessibility in mind. In contrast to the juggernaut that is GPT-4, some of these models yield impressive results and can run on a powerful laptop. We can summarize their opinion on LLMs in five key points:

Lack of Flexibility and Speed: Large models are slow to develop, and it is difficult to make iterative improvements on them quickly. This hampers the pace of innovation and prevents quick reactions to new datasets and tasks.
Costly Retraining: Every time there is a new application or idea, large models often need to be retrained from scratch. This not only discards the pretraining but also any improvements made on top of it. In the open-source world, these improvements add up quickly, making a full retrain extremely costly.
Impediment to Innovation: While large models might initially offer superior capabilities, their size and complexity could stifle rapid experimentation and innovation. The pace of improvement from smaller, quickly iterated models in the open-source community far outpaces that of larger models, and their best versions are already largely indistinguishable from large models like ChatGPT. Thus, the focus on large models puts companies like Google at a disadvantage.
Data Scaling Laws: Large models also often rely heavily on the quantity of data, rather than the quality. However, many open-source projects are now training on small, highly curated datasets, which potentially challenges the conventional wisdom about data scaling laws in machine learning.
Restricted Accessibility: Large models often require substantial computational resources, which limits their accessibility to a wider range of developers and researchers. This factor impedes the democratization of AI, a key advantage of the open-source community.

In other words, smaller models allow for faster iterations, and consequently, faster development. This is one of those cases where we can confidently say that less is more. The experiments that the open-source community is doing with these models are incredible, and as we mention in the fourth point, it’s basically putting into question a lot of assumptions we’ve made so far about machine learning.

I started with a videogame analogy and I will close with one. In an interview with Yoshinori Kitase, the director behind the incredible Final Fantasy VI, the Japanese developer was asked about the climate and culture of game development in the 90s. Unsurprisingly, Kitase admitted that it was a pain.

Having to fit an epic tale with graphics, dialogue, music, and even cut scenes in a measly 8 megabytes of storage seems impossible by today’s standards. But Kitase actually spoke quite favorably about the experience. In his mind, the limitations of the time forced the team to think creatively, to shape and reshape their vision until they manage to get it under 8 megabytes.

It feels like the open-source community embodies that spirit. In the face of not having the resources of tech giants, they haven taken the task of creating and developing models that could potentially run on a potato. By leveraging ingenuity and, in some cases, enlisting software development services for specialized tasks, they have shown that chasing ever-larger models isn’t the only path to building powerful language solutions.

If you enjoyed this article, check out one of our other AI articles.