Week 11 - Evaluation pt. 2

June 2, 2025

Hi and welcome to my penultimate blog post! Throughout this week, I’ve interviewed a total of 5 translators and 4 children. These kind evaluators spoke Mandarin, French, Korean, Amharic, and Burmese—during transcreation, their prompts aimed to convert the images to represent China, Canada, Korea, Ethiopia, and Myanmar.

Translators (Scores)

In their platform feedback survey, translators were asked to rank the “quality” of the transcreation and translation on a scale from 1-10. The transcreation ranking was further divided into rankings for overall image-editing quality (i.e. “did the models follow your instructions”) and cultural relevance (i.e. “did the result reflect your intended cultural conversion”).

The translators’ average “score” for translation was 5.8/10—four scores of 6 or 7, and one score of 3 from a particularly poor Amharic translation. Average scores for transcreation (both editing quality and cultural relevance) fell at 4.2/10. Four out of the five translators preferred the translated book to the transcreated one; the last translator preferred the original, untranslated version. Most translators indicated that transcreation “worsened” the quality of the book. As for the two exceptions, in one case the original page was selected nearly every time (hence, a “no change in quality” rating), and in another the translator selected “improved” despite stating that the edited images were not actually culturally relevant.

Unfortunately, three translators also added that the image-editing produced offensive content.

Translators (Interview and Observations)

Since my pool of evaluators is so small, I found that their live platform evaluation and verbal feedback were more interesting than the quantitative results of their survey responses. Throughout this evaluation, a couple “themes” stood out. In general, the models performed poorly with basic image-editing instructions, like in the following example:

Prompt: Change the teddy bear into a stuffed panda. Make the girl wear a red dress. Add paper lanterns in the background.

They also struggled with country-specific knowledge, failing to include changes like making characters wear Thanaka (a Burmese cosmetic), or adding a bottle of Yakult (a popular Korean drink) in the scene. Surprisingly, none of the image-editing models were able to recognize or include countries’ flags, with the coincidental exception of China’s flag, which was added to the image for the prompt, “Add the flag of Myanmar in the background.” There also seemed to exist an odd tradeoff between editing quality and the extent to which the user’s instructions were followed. When the editing results appeared at a glance to align with the user’s target culture or include elements that the user requested, other parts (especially people’s faces or figures) became distorted. On the other hand, the models sometimes produced images of wonderful visual quality and barely any relevance to the user’s prompt. Thus, there were very few examples that the translators were fully happy with—in most cases, they expressed reluctance to select any of the edited images.

Unexpectedly, despite its shortcomings, the platform showed a lot of potential, if the image-editing portion were to be removed. During their narrative interview, most translators expressed that it was not easy to find resources in their native tongue, either online or through a community to teach the language. In one extreme case, a translator added that their home country’s secondary education system used solely English instead of the more commonly spoken native language (Amharic); in another, the translator informed me that even though there was a prevalent community of immigrant speakers and educators (Lingala, Swahili) around them, they were “not allowed” to teach the language to children at schools. Thus, when translators (who were parents or teachers) were asked if they’d use the platform for language education, all said they’d use at least the translation portion.

Readers

In addition to the image-editing mistakes, most translators also pointed out that the translation was insufficiently localized, with abundant “word for word” translations and syntactical errors. However, since all the children had either an elementary or toddler reading level in the target language, they did not seem to pay attention to the translation errors. In fact, all children who participated in the evaluation seemed to thoroughly enjoy the readings. The children at younger reading levels (requiring parental guidance and read-alouds) stayed engaged with the digital content and could repeat some keywords in the target language; similarly, the more advanced children were able to stay focused throughout the full reading on the translated text.

The children did not show a distinct response to the (attempted) localization of the images. This may in part be due to the fact that all image-editing results were already filtered by the translators (leaving many pages unedited). For the pages with image-editing, the children’s preferences, for the most part, were not influenced by any cultural familiarity. Most of the children did in fact prefer transcreated pages to the original ones, but their reasons were driven by aesthetics, visual intrigue, or certain elements that their parent translator added specifically to suit their interests (e.g. pizza, hockey). As an example, two children preferred the following edits to the original images not because of relevance to their cultural identity (or even image-editing quality, given that the edited pictures no longer made sense in the plot), but because the new images “look cool.”

Original pages (left) versus image-edited (right) for target culture China.

Thoughts and Analysis

Based on the results of the rather small evaluation, culturally-sensitive image “transcreation” is not yet achievable with image-editing models (at least, the ones that I’ve selected). The failures seem to be classifiable within two subcategories: instruction failure and lack of knowledge.

To remedy models’ inability to perform even basic edits (like object swapping or insertion) on children’s book pages, domain-specific training on picture book art styles may be necessary. Some models had a subset of training data lie in adjacent domains, including anime art and comic book panels; ironically, those models especially struggled with maintaining the original structure and content of the image (the model with comic data defaulted its outputs to a “panel” structure, and the model with anime data would occasionally output completely irrelevant anime-style artwork).

It was particularly alarming to observe that models did not seem to possess general knowledge as basic as country flags. To achieve full “transcreation” in the future with culturally-aware image editing, the models would require specific information about the world like traditional clothing, national flowers, etc. The results shown from my evaluation came not only from poor instruction-following or entity recognition in the picture book images, but also from absence of the requested information in the models’ training data (e.g. Burmese traditional cosmetics or Korean drinks). Luckily, the platform is constructed to improve with the technology used in its services. As new image-editing models are built to handle more (practically and culturally) diverse tasks, the platform, and its transcreation service, can evolve alongside them.

View more of Sophia L.'s posts.

Week 11 - Evaluation pt. 2

Translators (Scores)

Translators (Interview and Observations)

Readers

Thoughts and Analysis

Reader Interactions

Leave a Reply Cancel reply