Week 10: The Final Product
May 9, 2025
Welcome back to my blog!
This project has finally come to an end.
Here is my presentation if you want to watch it!
I wanted to share the final product that I created with you! So here it is:
The AI Data Analysis
I aimed to test 5 main things with my AI experiment:
- AI’s ability to figure out women’s body shapes
- AI’s ability to estimate clothing sizes (including the measurements for bottoms)
- AI’s ability to estimate women’s specific measurements, specifically their shoulder width, their waist measurement, and their hip measurement
- AI’s ability to figure out their seasonal color palette (which is based on one’s undertones, darkness of overall coloring, and the level of contrast in one’s features)
- Is ChatGPT 4o or o4-mini more effective?
These are the questions I asked ChatGPT (some of which are modified from Week 8 post version).
- What is her body shape?
- She is [height] and [weight]. What size clothing would best fit her?
- When she is buying bottoms, and she is given the option of petite, regular, and larger lengths, which ones should she choose?
- How many inches is her shoulder width?
- How many inches is her waist?
- What is her hip measurement in inches?
- What inseams in inches would work best on her if she wants to wear pants that were made to have a regular fit that go right above her ankles, right below her ankles, and right above her heels?
- What inseams in inches would work best on her if she wants to wear pants that were made to have a baggy fit that go right above her ankles, right below her ankles, and right above her heels?
- What inseams in inches would work best on her if she wants to wear pants that were made to have a tight fit that go right above her ankles, right below her ankles, and right above her heels?
- What skirt length should she wear for a maxi skirt or a midi skirt?
- What is her seasonal color palette?
I recruited 11 females of different body shapes, sizes, ethnic origins, and ages. Each of the participants filled out a survey, which asked them about their measurements and the sizes that they usually wear. I kept all the questions in the survey, other than 4)- 6) (which were about specific measurements), mandatory. However, all participants answered all of the questions. Additionally, they all submitted videos of themselves slowly turning 360º, stopping each 90º. Through these videos, I was able to capture 4 images (one of their front, their back, their left, and their right side) of each participant. For six of the participants, I gave both ChatGPT 4o and o4-mini the pictures and videos and had them answer my questions. For the other five participants, I gave both models just the pictures to answer my questions.
Data Analysis
Here were the 3 things I looked at:
- Whether the video or the pictures predicted measurements closer to reality
- Whether 4o or o4-mini predicted measurements that are closer to reality
- The variation within data sets
The shoulder, waist, and hip measurements predicted by ChatGPT
To compare ChatGPT 4o versus o4-mini and pictures versus videos, I took the errors of each predicted measurement from the real value. For 4o, the errors for the shoulder, waist, and hip measurements were 1.856, 2.000, and 2.129, respectively. Moreover, for o4-mini, the shoulder, waist, and hip measurement errors were 2.176, 2.694, and 1.771, respectively. To compare these errors, I performed an independent samples t-test with a 95% confidence level. The p-values for shoulder, waist, and hip measurements were 0.616, 0.384, and 0.572, respectively. All of these p-values indicate that there is no statistically significant difference between 4o and o4-mini’s abilities to estimate measurements.
I repeated the same process to compare the efficacy of pictures in predicting measurements to that of videos. When pictures were imputed, the errors for shoulder, waist, and hip measurements were 1.886, 2.227, and 1.795, respectively. Additionally, when videos were analyzed by ChatGPT, the shoulder, waist, and hip measurement errors were 2.254, 2.567, and 2.233, respectively. According to an independent sample t-test with a confidence level of 95%, the p-values for shoulder, waist, and hip measurements were 0.564, 0.683, and 0.567, respectively, all of which demonstrate that there is no statistically significant difference between the efficacy of video and picture inputs in determining body measurements.
There is approximately an error of 2 to 3 inches in the measurement types for both ChatGPT models and input types (pictures and videos). Thus, using these estimations to recommend clothing sizes can cause the buyer to purchase one or two sizes larger or smaller than their actual size, which can lead to increased returns and decreased customer satisfaction.
Sizing estimates
When comparing clothing sizes, I separated numerical sizes from the small-medium-large sizing system, and I calculated the accuracy. In total, ChatGPT 4o and o4-mini incorrectly estimated participants’ sizes 76.5% and 70.6% of the time, respectively. I performed a two-proportion z-test with a confidence level of 95% to see which of the models was more effective. The p-value was 0.452, which means that there is no statistically significant difference in the efficacy of ChatGPT 4o and o4-mini in their ability to estimate sizing.
Additionally, ChatGPT 4o was equally effective in estimating numerical sizes and small-medium-large sizes. For both categories of sizing, they had a 23.5% accuracy rate. According to a two-proportion z-test with a confidence level of 95%, the p-value is 1.0, which demonstrates that there is no statistically significant difference in ChatGPT 4o’s ability to estimate small-medium-large sizes versus numerical sizes. Similarly, ChatGPT o4-mini correctly predicted small-medium-large sizes 35.3% of the time and numerical sizes 23.5% of the time. According to another two-proportion z-test with a confidence level of 95%, the p-value is 0.45, which also demonstrates that there is no statistically significant difference between o4-mini’s ability to estimate small-medium-large sizes versus numerical sizes.
Furthermore, I compared whether videos or pictures were more effective inputs for estimating sizing. I found that the accuracy of videos was 33.3% overall, and the accuracy of pictures was 22.7%. Based on a two-proportion z-test with a confidence level of 95%, the p-value was 0.343, which indicates that there is no statistically significant difference between the overall efficacy of videos and pictures at estimating sizing.
Moreover, when comparing clothing sizes, videos correctly predicted small-medium-large sizes 41.7% of the time and numerical sizes 25.0% of the time. I performed a two-proportion z-test with a 95% confidence level to determine whether video was more effective for one sizing system over the other. The p-value was 0.386, which means that there is no statistically significant difference in ChatGPT’s accuracy when using videos to estimate small-medium-large versus numerical sizes. Similarly, when pictures were imputed, ChatGPT correctly estimated both small-medium-large and numerical sizes 22.7% of the time. A two-proportion z-test with a 95% confidence interval comparing these two estimates showed a p-value of 1.0, which demonstrates that there is no difference in performance across the two sizing systems when using pictures. I also compared videos to pictures for each sizing format. For small-medium-large sizes, video-based predictions had a 41.7% accuracy rate compared to 22.7% for pictures, with a p-value of 0.247. For numerical sizes, video predictions had a 25.0% accuracy rate compared to 22.7% for pictures, with a p-value of 0.881. In both cases, there was no statistically significant difference, suggesting that neither videos nor photos provided a clear advantage in estimating clothing sizes.
Petite, tall, or regular inseams
Both 4o and o4-mini were equally consistent when it came to answering the third question: “When she is buying bottoms, and she is given the option of petite, regular, and larger lengths, which ones should she choose?” Both models always gave the same answer, so there was no variation in the answer to this question within the four trials (4o and pictures, o4-mini and pictures, 4o and the video, o4-mini and the video) of each person. Moreover, both of these models were mostly correct. However, for 3 participants who claimed in the survey that they wore regular inseams, ChatGPT consistently recommended that they wear petite. I have two possible explanations for this phenomenon:
- ChatGPT has trouble accounting for larger legs in terms of body proportions.
- Petite sizing is less available, and these participants are on the border between being petite and regular, so they choose to buy petite.
Color palette and body shape
In 16 of the 34 trials of my experiment that I ran, participants were categorized as having rectangular body shapes. Thus, eight of the eleven participants were categorized as having rectangular body shapes in at least one trial. I know that many of these rectangle body shape predictions were incorrect because I looked at the videos and was able to figure out the participants’ body shapes. Maybe the reason for this occurrence is that people more rectangular throughout the day as they eat and become more bloated. Another possible explanation is that ChatGPT cannot analyze body shapes precisely enough.
A similar trend occurred when ChatGPT analyzed seasonal color palettes. ChatGPT categorized every single participant as having some type of autumn palette at least once. In fact, in 25 of the 34 trials, participants were characterized as either true or soft autumns. This could be because ChatGPT chose the wrong regions of participants’ faces to analyze when finding their seasonal color palette. This could also be because of the lighting chosen by participants for the video recordings that they submitted. In the instructions for the video submission, I clearly instructed participants that they should be standing in natural light because it is easiest to find one’s color palette based on analyzing the contrast of their features and their undertones in natural light. However, most participants took their videos in rooms with non-natural lighting, which impacted how their undertones and the contrast within their features appeared in the videos and pictures I used as input for ChatGPT.
Pant Inseams and lengths
I calculated the variance within every trial for each participant’s regular fit inseam, baggy fit inseam, tight fitting inseam, midi skirt length, and their maxi skirt inseam. Then I averaged these variances. The average variances in regular and tight fit pant inseams were 2.917 and 2.433, respectively. This suggests that the models generally agree on these recommendations. The average variance of baggy fit inseams was 3.480, which also demonstrates the consistency of AI’s assessments. However, the average variances in midi and maxi skirt lengths were 7.200 and 5.602, respectively. Despite the increased variance, it should be noted that there is more of a range in midi and maxi skirt lengths as compared to pant inseams because of the more varied lengths skirts typically have when compared to pants. Overall, these variances suggest that the ChatGPT models are relatively consistent when recommending pant inseams and skirt lengths. Therefore, there are no noticeable advantages when it comes to using either of the ChatGPT models or input types.
Nevertheless, there was one pattern I noticed that may have skewed these results by increasing the variance. This occurred ten times. For the questions relating to pant inseams (7) – 9)), the ChatGPT recommended baggy pant inseams were supposed to be shorter than the regular length inseams, and the tighter inseams were supposed to be longer than the regular length because baggier pants fall lower on the leg and tighter pants fall higher on the leg. I tried to word questions 7) to 9) so that ChatGPT would understand the regular, baggy, or tight fit of the pants was supposed to be intended by designers rather than a personal preference of the customer (as buying a smaller pant would increase the tightness of the fit whereas buying a larger pant would increase the bagginess). However, in ten of the 34 trials of my experiment, ChatGPT either recommended longer baggy pant inseams than regular pant inseams, shorter tight inseams than regular fit inseams, or both. In eight of these ten instances, o4-mini was the model being used, and in the other two, 4o was being used. This suggests that o4-mini may be less receptive to instructions and prompting than 4o. Moreover, in three of these instances, videos were being used, while in the other seven, pictures were being used. Even so, this may not be a reflection of the efficacy of videos over pictures due to the much larger number of trials being carried out with pictures rather than videos.
Conclusions and Discussion
I would like to conclude by mentioning some of the problems I ran into during the process of this experiment. The first problem that I ran into (which I have already mentioned in a previous post) is that ChatGPT is unable to analyze videos. To solve this problem, I asked it to extract one frame per second and analyze the extracted frames. However, I noticed that ChatGPT didn’t use all the frames to answer all of the questions. I tried prompting ChatGPT by asking it to analyze every frame to answer each question, but that too was unsuccessful because that is a lot of data for ChatGPT to analyze at once.
After having analyzed six people, including myself, both models of ChatGPT were unable to extract frames from the videos, claiming that there was too much information to process. I tried decreasing the number of frames that it had to extract. I tried doing it at a different time/day, but ChatGPT was still unable to analyze the videos. Later, I learned that ChatGPT had introduced a new model, and with its introduction, 4o and o4-mini were reconfigured in a way that prevented them from processing more information.
Additionally, I had to reprompt ChatGPT o4-mini more than 4o, but I often had to reprompt both of them to answer the last question on seasonal color palettes. However, when using OpenAI products, an automated formula can be created to reprompt the AI if the desired results are not produced.
Although this is not so much a problem that occurred while experimenting as it is a limitation, the sample size for this research project was only eleven volunteers. More generalizable results could have been achieved had there been more participants and/or if the participants had been randomly selected.
Overall, there seems to be no significant difference in the efficacy of both ChatGPT models that I tested in this experiment, or the efficacy of pictures versus videos. Both ChatGPT 4o and o4-mini were most often inaccurate in predicting people’s sizes and measurements, whether the input was pictures or videos. Furthermore, both models, regardless of the input type, also lacked accuracy when estimating body shapes and color palettes. However, 4o and o4-mini with both types of inputs were relatively consistent when predicting pant inseams and skirt lengths. Ultimately, at present, ChatGPT would not make a strong personal stylist because it needs to improve upon its precision when analyzing pictures and videos.
The Business Model Outline
Unfortunately, it shows up a bit blurry, and there is nothing I can do to fix that, but essentially this is what it says:
Our brand’s core value lies in AI-powered sizing and styling recommendations, with significant investment going into developing our AI tool and an integrated chatbot to enhance customer relationships. We serve petite, professional, upper-middle-class women in their 20s and 30s, reaching them directly through social media and indirectly through Shopify. Our mission also emphasizes sustainability through partnerships with ethical textile mills, recycling programs, and transparent manufacturing. In addition to direct clothing sales, we’ll generate revenue through pre-orders, brand and influencer collaborations, and investor backing.
We plan to be more environmentally sustainable through sourcing from ethical suppliers with local production in LA via The Evans Group or Lefty Productions Co., because they have low minimum order quantities. We also plan to use eco-friendly materials like GOTS-certified cotton, recycled silk, and deadstock fabrics. Additionally, we plan to take other measures like using carbon-neutral shipping, biodegradable packaging, compostable tags and trims, and giving clothes to recycling companies at the end of their life.
We also plan to be more socially sustainable by prioritizing safe working conditions and fair wages for our employees. In terms of our social values, we’re focused on raising awareness about sustainability and promoting minimalism. We advocate for body positivity and size inclusivity, while also ensuring fair labor practices and transparency, and traceability in our supply chain. We aim to shape societal culture by encouraging conscious consumption. Our fashion is designed to fit into capsule wardrobes, supporting a lifestyle that values fewer, better pieces. Our scale of outreach includes a global presence through our social media content, with partnerships and collaborations involving influencers and stylists who align with our mission. For the end user, this translates into more than just clothing; it’s about building body confidence and connecting with a brand that shares their values.
The Collections
Summer
Fall
Anyways, I guess that’s it for this blog! Thank you so much for following me through this journey!
Leave a Reply
You must be logged in to post a comment.