Week 2: Data Analysis
March 6, 2026
Hi everyone! Welcome back to my blog. This week, I began working with the data that will be used in my project. Since my goal is to study echo chambers, I need a dataset that can help me analyze the conversations that people have online. To collect this data, I used the X API that gives me access to posts and comments made on the site.
Using the API, I collected a random assortment of comments from different online discussions. Because of usage limits, I have initially gathered 2000 comments, but I will increase the amount of comments I get in the future.
Preparing and Structuring the Data
After collecting the data, I spent time cleaning and organizing it so that it could be properly analyzed. In this step, I formatted the data, removed empty or incomplete comments and prepared the text through creating embeddings, so it could be processed by my model.
After the data was cleaned, I converted it into a network structure. In this network, each comment acts as a node, and connections between nodes symbolize the similarity between the content of comments. For now, I have chosen to calculate this similarity using cosine similarity rather than Euclidean, but I will look into other similarity metrics in the following weeks which I will talk more about at the end of this blog.
To build this graph, I used the Python library igraph, which is great for building large networks. This library makes it easy and more efficient to analyze graphs, especially as I use the Leiden Algorithm to detect communities.
Detecting Communities in the Network
After building the graph, the next step was identifying communities in it. Communities are groups of comments that are more strongly connected to each other than the rest of the graph.
To detect these clusters, I used the Leiden Algorithm, which is a widely used method to find communities in networks. The Leiden Algorithm works by grouping nodes together in a way that maximizes how densely connected they are within the group compared to the rest of the network.
Running this algorithm helps me automate cluster detection that I will then analyze to see if they really are echo chambers.
Debugging the Community Detection
While running the Leiden Algorithm, I came across an interesting issue. When I tested it on my dataset of 2000 comments, the algorithm found 268 communities. This number is a lot higher than expected, since meaningful communities are meant to have a lot of comments in them. After looking at the graph, I realized the issue came from the network being too split into disconnected pieces. Since the algorithm can’t add comments that aren’t connected into the same community, it struggled to make big clusters.
To fix this issue, I chose to focus on the largest connected groups in the graph using the following line of code:
![]()
This line finds all the connected groups in the graph and keeps only the largest ones, which is referred to as the giant component. This made my community detection much more reasonable, with the Leiden Algorithm outputting 13 communities as opposed to 268.
Looking Ahead
Next work, I want to improve how I calculate similarities between comments. Right now, I use cosine similarity to measure similarity based on embeddings. This kind of similarity is commonly used with Natural Language Processing (NLP) because it calculates the angle between two vectors.
However, one of my main goals is to experiment with different similarity metrics to see which one can be most accurate as I detect communities. For example, I could test Euclidean Distance, which measured the straight line distance between two vectors. I could also try Manhattan Distance or Jaccard Similarity, which are all viable techniques for calculating similarity values between texts. If you have any ideas for similarity metrics I could use, I am open to suggestions as well!
Thank you for reading, and I will see you all next week,
Harish
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.

Hello Harish,
It looks like your project is progressing well! I like how you’ve already started to obtain results and find communities in a (relatively) small sample size of just 2000 comments.
13 communities seems like a much more reasonable number, but I’m curious to know what those communities are and how the Leiden algorithm actually delineates what a community is. It seems like there could be key parameters you could maybe alter that may drastically effect the resultant size and number of communities.
Cosine similarity search is the vanilla example for NLP, but that doesn’t necessarily mean it’s bad. I think experimenting with different ways to explore vector similarity may possibly yield better results!
Hello Arjun, and thank you for your comment!
In this past week, I did look into some alternative to Cosine Similarity. While BM25 and other metrics did not prove to be more accurate than Cosine Similarity, I will continue to look into other metrics to improve the outputs of my graphs.
Hi Harish, This progress looks super interesting! I had one question though. Why did you choose to base your data in X? Was it specifically because it’s an extremely polarizing platform, or just because the X API was provided easy access to large amounts of data? Looking forward to your next few weeks!
Hello Anav, and thank you for your comment!
X has proven to be one of the most polarized social media platforms that exist. As a part of a final project I did in class, I already completed the main research for Reddit, which is arguable known as the most divided platform when it comes to sports, politics, and news. The X API was convenient for me as it made it easier for me to extract comments and other metadata like times and replies that may prove useful in my research.