Week 2: Scraping Political Rhetoric from the Web
March 12, 2024
Welcome to my first weekly update of my project. This past week has been filled with plenty of data mining as well as initial data cleaning for the project. I have come to a general idea of what kind of data I will be using as well as where I will be pulling this data from.
My first point of interest this week was finding a general source from which I would be able to pull presidential speeches from. After some a short amount of surfing as well as conferring with my advisors I was able to find two good sources of presidential speech from which I could pull text from.
The first source is C-SPAN’s extensive library of official presidential debate transcripts that can be filtered by year, speaker, text type, and can be searched for specific elements. In addition to transcripts, video links are attached with each transcript. The second source is The Commission on Presidential Debates (CPD) on debates.org which contains another extensive library of presidential speeches dating back to 1960. The CPD contains transcripts for both presidential and vice-presidential debates but lack the filtering options that C-SPAN offers.
Both of these sources work great with my web scraping programs and have been easy to navigate as well. Going forward I will primarily be using the CPD because of its basic format which makes data mining much simpler. However, if I do choose to explore the auditory aspects of debates, I will be going back to C-SPAN’s database.
After identifying this general source of data, I then had to scrape the data from each of the webpages as well as make the data suitable for analysis. To do this I use a number of python libraries for browser automation and web scraping. First I had to identify the home pages on which links to each debate transcript are located. After I had this initial link, I had a starting point to then scrape each link. Using Beautiful Soup, a python library that is used to pull data out of XML and HTML files, I was able to build an array of debate links. Using this method I was able to make a list of 56 links or essentially 50 transcripts. After this, I used the Selenium library to pull text from each of the 56 links. Finally, I downloaded these transcripts as .txt files and now I have a working corpus of texts to analyze.
Leave a Reply
You must be logged in to post a comment.