Week 5: Generate, Generate, Generate
March 27, 2026
Hello readers! Welcome back! Last week, my focus was on creating the set of test questions for my models, organizing them into a spreadsheet, and writing a Colab script to send those questions to both my base and fine-tuned models and log their responses. Though I completed the first two steps, the last wasn’t as productive as I’d hoped because I was working through some bugs in my code.
Fortunately, with some help from my external advisor, I got the code running this past weekend! The script reads all 140 questions from my spreadsheet, sends each one to both models, and places their generated answers into dedicated columns in the same sheet. This saved me loads of time (and sanity), since it eliminated the need to manually paste 140 questions into two models (280 total!) and then copy and paste their answers back into the spreadsheet.
Just to recap, I now have the answers both models produced to the exact same set of tabla-related questions. One model is the standard base Gemini model, while the other has been fine-tuned using additional information about tabla. My project now moves into the evaluation phase, where I test which model produces more accurate and more intelligible answers.
As I discussed in last week’s post, the evaluation process uses a third-party “LLM-as-a-judge.” This means I give a separate AI model all the information it needs to evaluate the models’ answers objectively. I explained my reasoning for using an AI judge instead of human evaluators in my Week 3 blog post.
This week, I began writing the evaluation script. I chose Gemini 1.5 Flash as the judging model because it’s fast, free, and can handle long contextual prompts. The script constructs a detailed instruction prompt that tells the judging model exactly how to evaluate each answer.
The prompt begins with a role description, something like: “You are an expert AI judge evaluating an AI-generated answer for accuracy.” This tells the model what perspective it should adopt. It then defines the task: for example, “Assess the AI answer based on the original question, the expected answer, and specific grading criteria. Provide a score from 1–5 along with a short explanation for your rating.” The explanation portion is especially important because it allows me to verify that the judge’s reasoning actually makes sense.
The prompt also includes several pieces of context:
- the original question asked to the models
- the model-generated answer being evaluated
- the grading criteria from my rubric
- a sample accurate answer I created last week using my dataset and NotebookLM
To say the least, this is a very long prompt; it’s much longer than anything I’ve previously asked Gemini to process. Even though the evaluation task itself is straightforward, the judge needs a lot of context to ensure the results are objective and consistent.
I plan to run the script four times:
- Evaluate the fine-tuned model’s answers for accuracy
- Evaluate the base model’s answers for accuracy
- Evaluate the fine-tuned model’s answers for instructional quality
- Evaluate the base model’s answers for instructional quality
Each run evaluates all 140 answers, so every iteration requires Gemini to process 140 evaluation prompts. Because of that, a single run takes one to two hours to complete.
You may be wondering: why is the Colab script necessary? Isn’t this just prompting an existing AI model? You’re right, it is. The reason for the script is again for efficiency. It ensures I don’t have to construct each prompt manually by copying and pasting all the context listed above 140*4 times, then loading the score and explanations back into my spreadsheet.
This week, I ran two of the four iterations: both of the accuracy evaluations. After the scores and explanations were organized in my spreadsheet, however, I realized I had overlooked something important: My evaluation prompt asked the judge to assign a score from 1–5, but it didn’t include descriptions of what each number actually represents. In other words, I never defined the scale. Because of this, the judging model effectively created its own interpretation of the scale, which makes the evaluations more subjective than I’d like.
Next week, I’ll add clear definitions for each score in the prompt and rerun all of the evaluations. I expect this to take most of Week 6, since each iteration takes so long and I can only run about one per day without exceeding the request limits for a free Gemini model.
Even though a few hours of work ended up being wasted this week (since I have to scrap the initial evaluations), I’m still happy with the progress I’ve been making. This was a pretty long post, so I truly appreciate those of you who made it to the end 🙂
Only one more week until a much-deserved spring break! See you all soon!

Leave a Reply
You must be logged in to post a comment.