The Fickle Nature of ChatGPT

As an AI researcher and developer with more than ten years of experience, I have had the opportunity to work on a wide range of projects and technologies in the field. Despite this, I am still surprised by ChatGPT's remarkable fluency but disappointed by its lack of accuracy. In this blog post, I will express my views on ChatGPT's fickle nature of being both surprising and disappointing even for AI experts like myself.

The underlying technology of ChatGPT is nothing new. It all started with research from Google in 2017, which introduced the concept of transformer-based models for natural language processing. These models, including ChatGPT, use a self-attention mechanism to process input text and generate output text. A self-attention mechanism allows the model to weigh the relative importance of different words in the input text and generate output text that is more coherent and fluent.

However, what really sets ChatGPT apart from other transformer-based models is its sheer scale. It was trained on a massive dataset of text, which allows it to understand and respond to a wide range of topics and questions.

Evidently, this combination of transformer-based architecture and large-scale training data has resulted in an AI model that can generate text that is often indistinguishable from text written by a real person. 

From my perspective, the most compelling question is: Why does more training data have such a dramatic effect? As data scientists know, more training data also introduces more noise, which should compromise the model's performance.

As Peter Norvig, the renowned computer scientist, famously stated,  “More data beats clever algorithms, but better data beats more data." This quote highlights the importance of the quality and relevance of the data used to train and test algorithms, as well as the fact that having a large amount of data does not guarantee good results if its quality is poor. In other words, it is more important to have accurate and diverse training data than a lot of data that might be irrelevant, noisy, or biased.

Bias and Noise 

Bias, refers to systematic errors in thinking or decision-making that leads to inaccurate or irrational judgments and decisions. Bias can be caused by many factors, including cognitive heuristics, social and cultural influences, and emotional states. Bias leads to a systematic deviation from rationality and to poor decision-making.

Noise, on the other hand, refers to random variation or error in the information used to make a decision or judgment. Noise leads to unreliable or inconsistent results and makes it difficult to accurately assess the true underlying trend or signal in the data.

In short, bias refers to systematic errors in thinking or decision-making while noise refers to random or unpredictable errors or variations in information. However, while bias is a recognized and controlled issue, the impact of noise is consistently underestimated.

For example, one study, cited by Kahneman, found that there is a significant variation in insurance premium quotes, with an average difference of 55% depending on the underwriter. This means that the premium for the same insurance policy can vary greatly depending on which underwriter is handling the case. For instance, if one underwriter sets the premium at $9,500, another underwriter might set it at $16,700. This variation in premium quotes can have important consequences, not just for the client, whose premium should not depend on which underwriter is handling the case but also for the company. Overpriced contracts can lead to loss of business, while underpriced contracts can lead to financial losses for the company.

The internet is famously full of noise. It includes a vast amount of information, much of which is inconsistent, unreliable, or irrelevant. Because ChatGPT was trained on a similarly vast amount of noisy data, one would expect “garbage in, garbage out.”

"Garbage in, garbage out" means that if the input data fed into an algorithm is inaccurate, the output of the algorithm will also be inaccurate. In other words, the output can only be as good as the input data.

ChatGPT was fed garbage, so why does it only partly produce garbage? To answer that, we need to take a short trip back to the early 20th century.

How much does an ox weight?

The "weight of an ox" experiment was conducted by Francis Galton in 1906  at a county fair. Galton asked 800 people to guess the weight of an ox on display. The guesses varied widely, from 600 pounds up to 1,200 pounds.

After collecting all the guesses, however, Galton found that the average of all the guesses was 1,197 pounds, just one pound away from the actual weight of the ox, which was 1,198 pounds. This experiment demonstrated that a group of people—even those with no expertise in the subject—can collectively come to a more accurate decision than any single individual.

Galton's experiment showed that the diversity of opinions and knowledge within a group can lead to a more accurate understanding of a situation or problem. The experiment is widely cited as an early example of the “wisdom of crowds.”.

It appears to me that ChatGPT is using the wisdom of crowds to its advantage in certain cases. Through a vast amount of diverse text data, the model is able to generate meaningful communication despite the presence of noise in the input data.

In short, my hypothesis is this: Contrary to popular belief, the presence of noise in the training data for ChatGPT may actually enhance the model's quality in specific cases. If this hypothesis holds true, it would indeed be rather surprising.

It's worth noting that adding noise to training data is a common practice in machine learning to prevent overfitting, but it usually comes at the cost of decreased accuracy. The type of noise I am referring to in the rest of this article is different from this kind of artificially added noise.


The wisdom of crowds is not universal

Research has shown that the wisdom of crowds is most effective when the task at hand is relatively simple, the crowd is diverse, and the crowd is adequately large. 

On the other hand, when the task is complex, the crowd is not diverse, or the crowd is too small, the wisdom of the crowd can be misleading.

Overall, the wisdom of crowds is a complex phenomenon that is influenced by many factors. While it has been found to be true in certain situations, it is not a universal phenomenon, and it is not always applicable. Case in point:


The uniforms of the Royal Marines were, of course, red, not blue. Despite ChatGPT's remarkable confidence, it is surprisingly inaccurate. As a side note, this confidence may cause major problems as incorrect information will easily spread. Here is another example:hobbes

Hobbes’s works on church history and the history of philosophy make it clear he was firmly against the separation of government powers, either between government branches or church and state.


ChatGPT is surprisingly convincing, so much so that some high schools are considering banning it to prevent wide-scale cheating. I hypothesize this is the result of its vast and diverse training data, creating an effect similar to the wisdom of crowds. However, in instances where the wisdom of crowds does not apply, ChatGPT is confidently incorrect, as evidenced by the two examples above.

It is important to keep in mind that ChatGPT—like all current AI models—lacks a model of the world like the one humans possess. So while it can write a competent essay about Shakespeare, it struggles to answer simple questions like the one below.