Week 8: Wrapping up Phase 2
April 24, 2026
Hello there! Week 8 was crazy, bouncing between building the last few scripts and making progress on my paper and figures. Last time, I organized live transcriptions of Telugu-English text taken from playing the audio files. This week, I re-tested those scripts again to understand and record performance limits.
I revisited my implementation of MUSE and cosine similarity (read earlier description in Week 3 during Phase 1) to analyze if meaning was similar across the mixed transcription and translated segments. However, unlike earlier scripts where I could analyze translation from the audio file, “live”-playing the clips across chunks meant I needed a separate format for storing dialogue across time stamps. I ran MUSE alongside the live capture script so I could constantly verify the results produced. I was happy to see that rapid transcription and translation didn’t drastically affect the protocol performance compared to the results from Phase 1 (though MUSE did skip over some 30-second segments).
After the final testing, I started working on my other results tables and graphs. Before I started the project, I was interested in analyzing the protocol performance compared to human-transcribed and translated text as well in the same interviews. I already had the difference in cosine similarity across both transcription methods. During my first week of work, my external advisor handed me those scripts for my reference. The metrics I initially used were METEOR (Metric for Evaluation of Translation with Explicit ORdering) and BLEU (Bilingual Evaluation Understudy). METEOR accounts for word matches, stems of words, and synonyms, while BLEU checks word order and patterns.
While I initially thought they were the best options for my baseline protocol performance, I didn’t account for the future challenges, including how often the machine skips or reinterprets words in speech-to-text. METEOR and BLEU are incredibly strict about word-by-word similarity, which made the calculated values extremely low. I felt that, for the stage of the protocol in its timeline, I needed to find metrics that compare meaning similarity rather than exact word-by-word accuracy. After some more research, I found the metric fastText cosine. It refers to the same concept of cosine similarity, where cosine similarity is the general mathematical formula, but fastText accounts for words, misspellings, and unseen text. The other metric I found was BERTScore F1, which measures how similar the file or live translated English is to human text in a transformer’s (machine’s) representation.
The results from BERTScore and fastText were much more promising, with file transcription barely outperforming the live transcription. As an example, on a scale of 0-1, values for the Rashimka file were between 0.7 and 0.9, meaning strong alignment. However, the issues I analyzed throughout this whole process with word skipping and misinterpretation were highlighted again when analyzing the METEOR and BLEU results. For future weeks, I will focus on the last bit of refining so my paper represents the most-refined protocol possible in this Senior Project timespan.
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.

Hey Raghav! Haha, I completely relate to the chaos of balancing live code scripts with drafting the final paper.
Your pivot is such a smart, pragmatic data science decision. It reminds me a lot of product engineering: sometimes you realize your initial KPIs are measuring the wrong thing, and you have to adjust your metrics to measure actual value (or in your case, meaning) rather than rigid perfection. Since conversational human speech is naturally so messy, punishing the model for missing a filler word never really made sense anyway.
Good luck with the final refinements for the paper!