Week 7 - Bugs
April 11, 2025
Welcome back to another edition of my blog! This week, unfortunately, went a little slower than expected due to several bugs that consumed my time. However, the past week has made it clear that preprocessing the data and processing the data specifically for the model are two separate and important tasks. Lesson learned: I didn’t devote enough time to the second one.
While repeatedly running my model on the numerical data, I realized that it was highly overfitting, regardless of what I set the hyperparameters to. After hours of exploring and debugging, I realized that the grids between my features and labels didn’t even match up! The grid for the features was a 69-square-mile grid, because I forgot to convert from degrees to square miles when extracting the data from Google Earth Engine! However, this meant I had significantly more data than I thought, and to redownload the correctly gridded data, I would need to break up my exports yearly to work around the Google Earth Engine data limit (and at this point, I decided to just use data from weekly intervals). Once I figured out how to do that, I would also need to compress the data (yes, I need to be extremely conservative with the amount of storage I use, because I only have around 20GB or so left on my MacOS). After a little bit of research, I decided to use the .parquet file format in place of .csv, which significantly reduced the size for me.
The next step was even more challenging. I had to ensure the grids of the two datasets lined up again. I wasn’t careful with this step the first time around, which led to a lot of wasted time. Therefore, I switched over from the Butte County shapefile to a simple polygon, ensuring all the way that the number of squares in each grid matched up exactly.
After processing the numerical data, I was much more careful with the image data this time around, and getting it ready for the model required me to write several Python scripts. I wrote code to process these .tif files, using metadata from the .met files to accurately crop the images to the correct geographic bounds. The metadata provided key information such as the geographical coordinates of the image corners, which allowed me to map the raster data correctly onto the region of Butte County. This step involved some trial and error with the cropping function, but after some debugging, the script successfully cropped all images to the desired area, optimizing the data for further use.
I may work a little over spring break next week to try to make up for the lost time, and if all goes well, my model should be good to go at the beginning of the next week of senior project. See you in two weeks!
Leave a Reply
You must be logged in to post a comment.