Anna D. 2026 | BASIS Independent Silicon Valley
- Project Title: Mechanistic Remedies to LLM Hallucination
- BASIS Independent Advisor: Movshovitz
- Internship Location: Algoverse
- Onsite Mentor: Kevin Zhu, Director, Algoverse
While large language models are being increasingly relied on by billions of users, they are prone to hallucination, sounding confident while fabricating "facts." Because AI models are trained and rewarded to always deliver an answer, they are naturally pressured to create a believable response even when they truly do not know the correct answer. However, to build trust between users and generative AI models, this issue must be addressed. While current research mainly focuses on either reducing hallucination probability or accurately detecting it, my project aims to create a mechanistic pipeline that simultaneously do both limiting hallucinations and informing the user. Many studies have shown that the internal representations of an LLM encode much information about an LLM's generations, including how and when it hallucinates. Mechanistically interpreting and manipulating these representation layers can help control the phenomenon of hallucination. Through combining mitigation and detection strategies including steering, prompting, and activation probing, I hope to develop a more robust system to combat hallucinations in open source LLMs. I aim to simultaneously achieve a higher rate of hallucination detection and lower rate of hallucination itself and analyze the combination of methods that achieves this.
