Week 1: Starting with Data Preparation
February 27, 2026
Hi everyone, welcome back! This is the first official week of the project! When planning this project out, I decided to dedicate the first two weeks to something that doesn’t always get much attention in machine learning: the data. Before any models can be trained or evaluated, the datasets have to be cleaned, structured, and prepared in a suitable way for the models to use. This is typically overlooked because models are usually the center of attention in such research projects. Additionally, working with data can be very annoying, as it involves a lot of tedious work.
I have been working with deliberation data that combines conversation transcripts with pre- and post-discussion survey responses. While this sounds straightforward at first, working with real data quickly makes things more complicated. Transcripts need to be cleaned without losing important context, survey responses have to be consistently formatted and matched to participants, and small inconsistencies can easily carry through the rest of the pipeline if they aren’t handled early. I have currently taken a first stab at this data-cleaning process by understanding the data and determining what I truly want to keep. The upcoming week will involve actually removing the data I don’t need so that I am left with a final dataset that can be used for training.
Alongside this, I have been reading Hands-On Machine Learning with Scikit-Learn and TensorFlow, which has been helpful in putting these early steps into context. One idea that comes up repeatedly in the book is how much impact data preparation has on the final results of a model. This is something I have already seen myself, especially from other machine-learning projects I have worked on. Many issues that look like modeling problems actually come from decisions made during data preprocessing. The reading has also reinforced the idea that preprocessing isn’t just a setup step, but a design choice. Decisions about how to handle missing values or represent survey scales need to be made carefully if results are going to be reliable and comparable.
Overall, this first week has been more about understanding the data and involved a lot of reading. It has made clear how much careful data preparation matters for reliable results. With this foundation in place, week 2 will focus on turning these decisions into code by cleaning and refining the datasets for training and experimentation. See you then!
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.

Rishi, I really like how you are emphasizing the importance of data preparation. It is often overlooked, but as you pointed out, it can make or break a machine learning project. Your careful approach to cleaning transcripts and matching survey responses shows a lot of thought, and it is great that you are connecting your work to concepts from Hands-On Machine Learning with Scikit-Learn and TensorFlow. Excited to see how your dataset evolves in Week 2 and how it sets the stage for model training!