Week 4: Topic Homogeneity

March 21, 2026

Hello everyone, and welcome back to my blog! This week, I focused on analyzing topic homogeneity across two different datasets: a dataset of X comments and a dataset of Reddit comments from the news subreddit.

Topic homogeneity measures how similar the comments with each detected community are. In other words, it tells if people in the same community are actually talking about the same thing.

How I Computed Topic Homogeneity

To calculate topic homogeneity, I started with my community graph with comments already grouped together. Then, I followed this equation for each community:

How this equation works in words is that it calculates the cosine similarity for every pair of comments possible in a community, then it takes the average of these similarity scores to create one single value showing how closely related the comments are.

This will give a value between 0 and 1 where a higher value means that discussions are more focused on the same topic, while lower values mean that communities may talk about a wider range of ideas.

Results: X Dataset

The X dataset showed relatively high homogeneity scores across most topics:

Political Campaigns and Elections: 0.728691

International Relations and Strategy: 0.673934

National Security and Defense: 0.641711

Education and Teaching: 0.634931

Research and Scientific Development: 0.634908

Economic Policy and Finance: 0.484346

Social Issues and Race Relations: 0.470881

Media, Communication, and Public Discourse: 0.466458

Overall, most communities have scores above 0.6, meaning that discussions are decently concentrated around the same ideas.

Results: Reddit (r/news)

The Reddit dataset showed significantly lower homogeneity scores:

Private Prisons: 0.468554

Tax Policy: 0.244306

Healthcare Policy: 0.200636

Donald Trump Policies: 0.193802

Economic Inequality: 0.178718

Iran Sanctions: 0.148979

Religious Freedom: 0.140398

Middle East Policy: 0.136789

Abortion Access: 0.123881

With the majority of values being below 0.25, there is more variation within each topic. Within each topic, there may be other related subtopics that people discuss, which could be the reason why scores are much lower for these communities as opposed to the communities on X.

Analyzing the Results

Higher topic homogeneity like what is seen in the X dataset means there is a higher chance of echo chambers existing, where users are constantly exposed to similar ideas within their community.

Lower homogeneity like on Reddit could mean that discussions are less isolated, and multiple viewpoints and subtopics could be discussed in the same place, which decreases the likelihood of echo chambers existing.

However, the real question is still yet to be answered. Next week, I will begin working on sentiment analysis, where I will calculate how everyone in one community truly feels about the main topic. Even with Reddit showing low scores in topic homogeneity, it could still prove to have echo chambers as I continue to work on the Echo Chamber Index.

Thank you for reading, and I will see you all next week!

Harish

View more of Harish S.'s posts.

Comments

davidz2026 says

March 23, 2026 at 5:21 am

Great work from last week Harish! I was wondering how the general format of posts of selected platforms plays a role in these scores. X has a character limit, and posts tend to be shorter. On the other hand, Reddit posts can be massive rabbit holes that constantly expand on one idea. These may affect the cosine similarity used due to the trend in length. Do you think that the structural differences in text length might be skewing the homogeneity scores down for Reddit, regardless of the actual community behavior?

Log in to Reply
- harishs2026 says
  
  March 24, 2026 at 11:08 pm
  
  Hello David, and thank you for your comment!
  
  That’s a very good point, and I thought about that as I was cleaning the data I would use. To account for that, I calculated homogeneity scores based on sentences and not full comments. That way, longer Reddit posts wouldn’t disproportionately lower the similarity score just because it has more detail or subtopics. Because of this, I think the reason Reddit had a lower score is because of how comments on Reddit naturally tend to branch out and bring in multiple topics into discussions.
  
  Log in to Reply
Anav A. says

March 23, 2026 at 5:30 pm

These are some interesting findings Harish. I believe they are a little counterintuitive to my current intuition. X has a single “For You” tab where the algorithm recommends content. This could possibly leading to some variety, as a user could have, on the lower side, 2-3 interests, leading to less of a hyper-fixation on a single topic. However on Reddit, the existence of subreddits lead to a single discussion around a certain topic, meaning the structure of the site leads to more homogeneity. However, your findings appear to be hinting to the opposite. I wonder what the gap could be.

Log in to Reply
- harishs2026 says
  
  March 24, 2026 at 10:52 pm
  
  Hello Anav, and thank you for your comment!
  
  I had a similar idea going into this part of my project, but I think the main difference is from where I got the data from. The r/news subreddit can be very broad in terms of discussed topics. People could be debating and making jokes, or even bringing up new ideas. On the other hand, X’s “For you” tab works with an algorithm that reinforces people’s online engagement patterns, which is a big trait in echo chambers. It is definitely a bit counterintuitive, but it might mean that echo chambers are more affected by how people interact within discussions rather than how the platforms organizes content.
  
  Log in to Reply
Archita S. says

March 24, 2026 at 4:05 pm

Your project is so interesting for its combination of humanities and STEM. Many of the conversations you mentioned from X and Reddit are humanities-focused, whether it be politics or social issues. Finding a way to convert this into data and numbers is fascinating and a great way to compare two entirely different topics. I can’t wait to see where this goes next.

Log in to Reply
Arjun M. says

March 25, 2026 at 11:57 pm

Hello Harish,

These numbers are extremely intriguing! I’m curious as to how the complexity of topic affects the comments and how they are pooled together. For a topic like “Economic Inequality”, it has pretty low topic homogeneity as expected as it’s such a broad topic that could cover multiple different people groups, countries, and situations. However, a topic like “Private Prisons”, which is much more specific and concrete, has a much higher topic homogeneity. I’m interested to see the extent to which the actual complexity and depth of the topic could affect this score!

Log in to Reply

Week 4: Topic Homogeneity

Reader Interactions

Comments

Leave a Reply Cancel reply