Week 4: Data Cleaning and RAM Security
April 7, 2026
This week at MITRE, I built a proof of concept demonstrating memory-level security vulnerabilities. When programs run on a computer, they temporarily store information in RAM (the computer’s short-term memory). The goal was to show the difference between encrypted and unencrypted memory when systems store sensitive information. I created a simulation that stores data in RAM and then demonstrates what an attacker could access in two scenarios: one where the memory is unencrypted (everything is readable as plain text, like reading a regular document) and one where it’s encrypted (the data is scrambled, so the attacker only sees random numbers).
The interesting part was understanding that the vulnerability isn’t necessarily in the software itself, but in the underlying infrastructure. Even secure applications can leak sensitive data if the machine they’re running on doesn’t have proper memory protection. In production environments, hardware-level encryption solutions (special security features built directly into the computer’s processor chips, like AMD SEV or Intel TDX) protect memory at the chip level. This means the data is scrambled automatically by the hardware itself, making it unreadable even to someone with physical access to the machine.
On the research side, I finished the data cleaning work this week. The Flash Eurobarometer dataset had some missing values scattered throughout; basically, some businesses didn’t answer certain survey questions. I used mean imputation for numerical variables (filling in missing numbers with the average of what other businesses reported) and mode imputation for categorical ones (filling in missing categories with whatever answer was most common). For variables where more than 30% of the data was missing, I removed them entirely rather than trying to fill in too many blanks, since that would just be guessing.
I also created some derived features, new variables calculated from existing data, to make patterns easier to spot. For example, I calculated “years in business” by subtracting the founding year from the current year, and I created a digital maturity score by combining several technology adoption questions into one overall measure of how digitally advanced each business is.
The next step was filtering the dataset into two groups: micro-businesses with 1-9 employees and larger small businesses with 10-49 employees. This is the core of what makes my research different from existing studies. I split each group into training sets (80% of the data the model learns from) and test sets (20% of the data I save to evaluate how well the model works on businesses it hasn’t seen before). This prevents the model from just memorizing patterns and lets me see if it can actually make good predictions.
I started building the logistic regression models to test the TOE framework hypotheses. Logistic regression is a statistical method that predicts yes/no outcomes, in this case, whether a business adopted AI or not, based on various factors. The idea is to see whether technology factors (like cloud computing), organization factors (like company size and skills), and environmental factors (like infrastructure quality) actually predict AI adoption, as the theory says they should, and whether those relationships differ between the smallest businesses and slightly larger ones. Next week, I’ll finish running these models and start comparing the results across the two business size groups.

Leave a Reply
You must be logged in to post a comment.