Week 1: Laying the Groundwork - Building a Uniform Testing System

February 28, 2025

Welcome back to my blog for the first real week of work!

This week, I’ve been focused on getting all the infrastructure for a successful project into place. I started off by importing the Llama 3 model family, in all of its variants. Afterwards, I began making a testing pipeline for inputting questions and having the model answer them. This pipeline includes the ability to augment the input with reasoning strategies, or change parameters like temperature or model type. I’ve also implemented a save function for the results that pushes the data to my google drive in a file, allowing me to perform analysis on the differences between generations. I’ve tested the entire system, and it is modular and robust from start to finish. Thus, I will be able to test different subjects and methods by changing a few lines of code and running it again! This will greatly help me stay organized and ensure uniformity across testing.

I’ve also been building strong foundations, reading (and re-reading!) papers that might be useful for my project. This has ranged from the Meta Llama-3 technical report to seminal developments like Chain-Of-Thought (having the model think in steps) and Self-Consistency (aggregating multiple model generations). I’ve researched how to develop a fine-tuning pipeline, which will be key when we want the model to understand CollegeBoard’s AP-specific guidelines and requirements. Finally, I’ve also read about strategies like Chain-of-Verification (forcing the model to double-check, provide evidence for its outputs) that serve to reduce the hallucination rate of models. With all this paper reading behind me, I feel confident applying my foundations to the project next week and implementing these strategies while modifying parameters and ideas within them.

Next week, I’ll begin testing different reasoning strategies to see how well they work. I’ll also be keeping a close eye on the hallucination rate, making sure my project gives the user the most accurate information possible. While the system doesn’t need to be right every time, it needs to recognize when it can be wrong, and not just make up information to pretend like it understands. Considering overall accuracy and hallucination as metrics, I’ll determine the best strategies and combinations to continue on in the project with. See you next week!

View more of Ryan L.'s posts.

Week 1: Laying the Groundwork - Building a Uniform Testing System

Reader Interactions

Leave a Reply Cancel reply