If you’ve ever used a Large Language Model (LLM), you know that it can answer just about any question you throw at it. But sometimes, the answer doesn’t quite make sense. Why does this happen, and why is it crucial to check the validity of these responses?
Why LLMs Always Have an Answer
To understand why LLMs always generate an answer, we need to look at how they learn. At the core of every LLM is a neural network that is trained on vast amounts of text. This training helps the LLM learn patterns in language, such as grammar, sentence structure, and even context, allowing it to respond when it encounters new data.
However, an LLM doesn’t "understand" language in the way humans do. Instead, it predicts the most likely next word in a sequence based on the input it receives. For example, if you ask a question, the model is trained to predict the next word that should follow, then uses that prediction to guess the next word, and so on. It does this by learning patterns as ‘verb comes after subject’ and ‘plural subject must be followed by a plural verb’. Over time, these predictions form a coherent response.
Thanks to advancements in neural networks and computational power, LLMs have become capable of much more sophisticated predictions. Modern LLMs like GPT-3, for instance, not only predict words within a sentence but also use context from entire paragraphs to generate more accurate responses. When trained on specific datasets, such as question-and-answer pairs from forums, they learn patterns like "a question is typically followed by an answer."
But here’s the catch: because LLMs are trained on data where questions almost always receive answers, they’re not used to seeing responses like "I don’t know." This introduces a bias. The LLM’s training data rarely contains examples of unanswered questions, leading the model to always generate an answer, even when it doesn’t have enough information to do so correctly. When faced with a topic outside its expertise, the model may fall back on simpler linguistic patterns it has learned, which can result in an inaccurate or misleading response.
This is why validating LLM-generated content is so important - without it, there’s a risk of accepting incorrect information as fact.
Why Validation Against Benchmarks Alone Is Not Enough
The traditional approach for testing algorithms involves training a model on one dataset (A) and then evaluating its performance on a separate dataset (B). This method is widely used for assessing LLMs, much like it has been for previous models, such as those in image recognition or machine learning techniques like random forests and decision trees. To ensure reliable metrics, it’s crucial to avoid any overlap between datasets A and B; otherwise, if training and test data intersect, the evaluation results become “contaminated” and less meaningful. In smaller models with controlled, often numerical data, it’s usually clear which data was included in training. However, LLMs bring a unique challenge: it’s often unclear what data was used to train them. For many state-of-the-art models, training data remains undisclosed, and for those where it is available, the immense volume of data involved makes it difficult to fully trace its origins.
This lack of transparency, combined with the popularity of certain benchmark datasets, increases the likelihood of inadvertent overlap between training and test data, potentially skewing performance assessments. Furthermore, research from Mirzadeh et al. (2024) shows that many benchmark datasets are static and may not evaluate a model’s behaviour across different scenarios or question complexities. To address these gaps, researchers are exploring new methods to better evaluate LLMs. They add one extra clause that appears relevant to the question for a subset of questions in the GSM8K - a popular school math Q&A benchmark dataset (Cobbe et al., 2021).
Consequently, Mirzadeh et al. (2024) observed significant performance drops (up to 65%) across top models. This added clause didn’t change the required reasoning but introduced minor variations that challenged the models’ robustness. Such insights demonstrate how even small changes can impact an LLM’s output, reflecting the varied conditions it may encounter in real-world use.
How to Validate LLM-Generated Content?
In practical applications, ensuring that an LLM provides accurate answers is key. Here are some of our favorite methods we use at Notilyze to validate content and reduce the risk of errors:
1. Expert/End User Feedback
Collecting feedback from users or subject matter experts is another crucial way to ensure the accuracy of LLM responses. By making it easy for users to share why they disagree with a particular answer, developers can prioritize updates and improve the model’s performance over time. This feedback loop is essential for continuously refining the model.
2. LLM-as-a-Judge
This method involves using a second LLM to evaluate the answer given by the first one. After LLM ‘A’ generates an answer, LLM ‘B’ can assess its validity based on a set of criteria provided by experts. This approach allows for a more automated form of validation while still adhering to standards set by professionals in the field.
3. Human-in-the-Loop
Although incorporating human validation can be more resource-intensive, it is still one of the most effective ways to ensure accuracy. In many applications, LLMs assist experts by providing possible answers, which the experts then validate. This approach can significantly speed up workflows by reducing the time experts spend on research, allowing them to focus on verifying the model’s suggestions.
4. Forcing LLMs to Include References
Lastly, one of the most effective ways to increase transparency and trust is by requiring LLMs to provide references along with their answers. By including the source of their information, users can easily verify the accuracy of a response. This method is especially useful when combined with Retrieval-Augmented Generation (RAG), where the model links answers to specific internal documents. Users can instantly access the source material, adding context and clarity to the information provided.
Curious how LLMs can enhance your organization’s capabilities? Reach out to let us help you!
Contact:
Eric Mathura
E-mail: eric.mathura@notilyze.com
Mobile: +31 6 53640514
References
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., ... & Schulman, J. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Retrieved from: https://arxiv.org/pdf/2110.14168
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. arXiv preprint arXiv:2410.05229. Retrieved from: https://arxiv.org/pdf/2410.05229