Better Left Unsaid: Preventing Hallucinations by Learning Abstention (Faculty/Rising Researcher Collaboration Opportunity) - PSU Institute for Computational and Data Sciences

Better Left Unsaid: Preventing Hallucinations by Learning Abstention

PI: Dongwon Lee (IST )

Plan for funding tuition for graduate students, or the remainder of the researcher’s salary for postdoc and research faculty: two semester support at 50% RA

Abstract: Part of the reason large language models (LLMs) sometimes hallucinate is that the boundaries of their knowledge are not clearly defined. To mitigate this challenge, in this project, we propose to develop methods that enable LLMs to recognize these boundaries and respond appropriately—such as by saying “I abstain” or “I don’t know”—when faced with questions they cannot reliably answer. This form of abstention may involve fine-tuning the models to avoid overconfident or fabricated responses (i.e., hallucinations) and instead adopt behaviors that reflect genuine uncertainty or a lack of information.

Landscape: Current approaches to teaching LLMs to abstain rely heavily on instruction tuning, reinforcement learning from human feedback (RLHF), or confidence-based strategies. For instance, instruction tuning trains models on examples that explicitly guide them to say “I don’t know” when unsure, while RLHF helps models learn to prefer safe responses by optimizing for human-rated preferences. Some systems also incorporate logitbased confidence thresholds to trigger fallback responses when the model’s certainty is low. Additionally, retrieval-augmented generation (RAG) methods allow models to consult external knowledge sources, with abstention triggered when relevant information is missing or insufficient. In more recent research, special tokens like [IDK] and uncertainty-aware training objectives have been introduced to explicitly model abstention behavior. Despite these advancements, however, current systems still face several limitations. LLMs often exhibit overconfidence, generating fabricated responses even when lacking knowledge. This is partly due to their pretraining on broad corpus text, which encourages answering regardless of accuracy. Moreover, most models do not capture “epistemic uncertainty”—the awareness of what they don’t know— leading to hallucinations. Attempts to enforce abstention can also result in over-refusal, where models decline valid questions, severely reducing their utility. Their behavior can be inconsistent across different phrasings, and since LLMs lack real-time knowledge awareness, they cannot easily distinguish between known and unknown domains. These challenges highlight the need for novel methods that help LLMs reliably recognize uncertainty in a confident and consistent manner.

Exploration Plan: The project will explore several novel ideas as follows: First, we will extend RLHF to RLHF-A (Reinforcement Learning from Human Feedback with Abstention), where human evaluators assess not only the quality of the generated content but also, the model’s decision to abstain or respond. Given sufficient feedback—like how the GPT series has shown remarkable improvement in alignment—we anticipate that LLMs can develop a user-aligned abstention profile, fine-tuned for different domains, users, and contexts. Second, instead of using confidence thresholds on a single output, LLMs may run multiple lightweight decoders in parallel (with noise injection or dropout) to generate variations. This idea resembles the Ensemble techniques in Data Mining that combine multiple (weak) models to make more accurate and robust predictions than a single model alone can. If these outputs from parallel decoders are semantically inconsistent, models may flag the response as uncertain and abstain. Third, before generating content, LLMs could internally ask itself: “Based on my response, what might the original question have been?” In a sense, this idea resembles a classical technique for detectives—asking a suspect to describe an event in reverse order—to detect lies. If LLM fails to reconstruct the original intent, this mismatch can be used as a signal for insufficient understanding—thus triggering abstention. The final output could be suppressed or replaced with “I don’t know” based on semantic divergence rather than token-level probability.

Objectives: The project aims to explore a few ideas and produce a prototype with preliminary results. The participating junior researchers will have an opportunity to contribute to scientific publications in top AI venues, while PI aims to use the preliminary findings to pursue an external grant program at the NSF.