Week 2: Llamas can reason, but slowly - Testing reasoning methods and fighting model limitations
March 8, 2025
Hello and welcome back to another edition of my blog!
This week was an interesting one, full of twists and turns. I began implementing reasoning techniques to see which methods of augmenting the base model were most effective in generating accurate and confident responses. I worked on 3 methods – Chain-of-thought, Self-consistency, and Chain-of-verification.
Chain-of-thought uses similar examples and example answers with sound reasoning to encourage the model to think for itself, leading to more thought-out answers. Chain-of-verification uses intermediate checks on each reasoning step of the model to ensure it is correct, resulting in a more confidently accurate final response. Finally, self-consistency aggregates multiple independent model runs, taking the consensus response to reduce one-off or probabilistic errors.
Implementing the methods was the smooth part. However, I quickly realized my measly CPU was not going to be practical for running locally. With longer prompts and multiple back-and-forth tasks between the models, it would take minutes just to answer a single question. I pivoted, using the Groq API service as a temporary fix. Groq has designed a special Language Processing Unit (LPU), a chip specifically for Natural Language Processing at lightning speed. However, the free tier has strict rate limits which I still need to figure out.
Progress is coming along well, but I will have to deal with the issue of compute limitations eventually. Next week, I’ll explore quantized Llama-3 models that can be run with less weights on my computer, or other free options. Once I have that in place, I’ll begin to benchmark the capabilities of the different methods. Finally, I will begin to develop fine-tuning which will give our LLM a large swath of AP-specific data. I’ll begin scraping together various AP documents to provide the LLM a good background and guidelines for how to answer. (But remember – we can only give the LLM so much due to compute restraints, so we need to be efficient with what we provide!) This is the first step towards the AP-specific domain knowledge of our final system. Stay tuned!
Leave a Reply
You must be logged in to post a comment.