Week 3: 🤔
March 21, 2025
Hi reader(s)! This week’s post will be a briefer one: some updates on the pipeline and more experimentation with image-editing.
Backend updates
The translation part of my backend is very nearly complete! Here’s a breakdown:
1. User uploads a pdf, target language, source language, and target country. The user also gets to choose what pages (e.g. publishing/copyright information, which are typically too cluttered + not that useful) to exclude from translation.
2. CRAFT text detector and OpenCV help find the bounding boxes of the text and images on each page (excluding the excluded pages)—tesseract OCR extracts the text.
3. For the moment, the only translation API I’m calling is Google Translate’s, but it’s possible others will be added to expand language coverage (if necessary).
4. The reconstructed page will have the translated text adjacent to the original text. To find the best location on the page for the translation to live in, I look for the bounding box with least contrast compared to the original text paragraph.
5. Edited pages are stitched back together! Here’s a sample 🙂
Some things to be aware of:
– The OCR has trouble extracting “artistic” text, including colorful book titles.
– The translation’s quality may be dubious (check out Aindra’s project!)—translators will have to screen the results.
Drawn pictures (Bloopers part 1)

Prompt: “Give the girl black hair. Change the teddy bear on the shelf in the back to a stuffed panda.”
Image credits: “Doing My Chores” by T. Albert (MonkeyPen books)
Observe the rather amusing set of edits above, where no model has followed the prompt quite right. Model 3 is almost there, but overedits the girl to add arbitrary black patterns (except on her hair, which has only darkened). Model 4 might be the closest in terms of following the prompt, but the style has completely changed.
Intriguingly, these image-editing models do alright with realistic images or photos, such as in the example above. However, when I test out prompts—from vague “convert to __ culture” to specific “change A to B”—with children’s book pictures, the models produce funny images. A combo of exaggerated features/variability in drawn pictures (particularly human characters) and underrepresentation in training data might’ve led to this unexpected weakness in editing capabilities.
It’s possible that with the correct prompting, drawn pictures can be edited as nicely as photos. There may also be an existing model that specializes on cartoon-esque pictures like the ones in my project. I’ll be investigating these prospects alongside the construction of my platform. See you later this week!


Leave a Reply
You must be logged in to post a comment.