Week 9: Datasets

May 11, 2025

Hello everyone!

In this post, I’ll be talking about the datasets I’ve used for testing. (This is a post I probably should’ve made much earlier, oops.)

The Asian Language Treebank (ALT) Parallel Corpus

The ALT project was initiated by the Advanced Speech Translation Research and Development Promotion Center in Japan in order to promote NLP research for Asian languages. Its parallel corpus is the largest publicly available corpus of Burmese-English parallel sentences, which is the main reason I chose to use it.

The sentences themselves were originally taken from randomly selected English Wikinews articles and then manually translated into various Asian languages (as well as annotated for features such as parts of speech, but that isn’t relevant to my project). For Burmese specifically, human translators were told to exclusively use the formal register, stick to the original English as closely as possible, and avoid figurative language (Thu et al., 2016). While the latter two instructions do make it easier for me to find where the translation and reference diverge, they also make my evaluations of ALT data less applicable to real-world contexts where different sentence structures and figures of speech may be used. As I mentioned in a previous post, the ALT dataset has also been used to train at least one (and most likely all) of the translation models I’m testing (Chen et al., 2019), which may increase their accuracy compared to in real-world scenarios.

My Social Media Dataset

After discovering the issues mentioned in my Week 3 update, I began compiling monolingual English and Burmese datasets from social media. The English data all comes from Twitter/X, with most tweets coming from the U.S. trending page at the time of the dataset’s creation. Other tweets came from the profiles of public figures, including but not limited to Barack Obama, Mr. Beast, Neil DeGrasse Tyson, and Charli XCX. I specifically selected tweets that were on the less formal side in order to distinguish this data more from the ALT dataset, and tried to include a wide range of topics. In terms of slang and abbreviations, I avoided more recent terms that most adults would be unfamiliar with (essentially, anything that could be categorized as “brainrot”), but still tried to keep some of the things that make translation on social media a very different task to translating news articles (hashtags, emojis, and more well-known slang like “plz” and “lol,” for example). I also included a few tweets with grammar or spelling errors, but not to the extent that they would obscure the meaning of the sentence.

I initially planned to make my Burmese dataset with tweets as well, but I soon found out that the Burmese-speaking community on Twitter was much smaller than I anticipated – most of the trending page after I set my language to Burmese and location to Myanmar was still in other languages. I decided to head to the platform with the largest Burmese-speaking community instead and began taking sentences from Facebook posts and comments (very inefficiently – I had to explore quite a few random groups, since Facebook doesn’t have a trending page and I didn’t want to exclusively take sentences from my relatives’ posts) until my external mentor recommended that I use YouTube comments instead. My Burmese dataset is therefore a mix of Facebook posts and YouTube comments. The Burmese dataset unintentionally contains far less slang, possibly representative of demographic differences between Twitter and Facebook/YouTube, or perhaps just a difference in how Burmese speakers use the Internet (potential future research?). I tried to maintain a wide range of topics here as well.

In my next blog post, I’ll be going over some of my earlier data from the social media dataset (no spoilers for final data!).

Citations:

Chen, P., Shen, J., Le, M., Chaudhary, V., El-Kishky, A., Wenzek, G., Ott, M., & Ranzato, M. “Facebook AI’s WAT19 Myanmar-English Translation Task Submission.” (2019). Proceedings of the 6th Workshop on Asian Translation, pp. 112-122, https://aclanthology.org/D19-5213/.

Thu, Y. K., Pa, W. P., Utiyama, M., Finch, A., & Sumita, E. “Introducing the Asian Language Treebank (ALT).” (2016). Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 1574–1578, https://aclanthology.org/L16-1249/.

View more of Aindra T.'s posts.

Week 9: Datasets

Reader Interactions

Leave a Reply Cancel reply