COVID-19 may have taken away our in-person debate watch parties, but it’s not stopping us from making a drinking game out of it! In my latest Youtube video, I used text mining techniques to develop the _ultimate _data-driven drinking game rules for the upcoming Presidential debates. This post will walk you through exactly how I did that.

To start, I scraped the transcripts from campaign rallies, speeches, and any other events that have taken place in the last few weeks during which Biden or Trump (or both!) spoke. The full list of events I scraped can be seen in the Github repo for this project (see the “debates.csv” file).

I scraped the transcripts from rev.com (with their permission!) because it seemed to have the most exhaustive list of election 2020 events, and because the transcripts followed a standardized format, which made the scraping process easier. Here’s the function I used to scrape the transcripts:

def scrapeTranscriptFormat1(url, sep):
    html = requests.get(url)
    html = html.text
    bs = BeautifulSoup(html, "lxml")
    paragraphs = bs.findAll("p")
    for paragraph in paragraphs:
        try:
            paragraph.find('u').decompose()
        except:
            continue
    speaker = []
    speech = []
    pattern = r'\[.*?\]'
    for paragraph in paragraphs:
        try:
            speechText = paragraph.text.replace(u'\xa0', u'') 
            speechText = re.sub(pattern, '', speechText) 
            if sep == "parenthesis":
                speech.append(re.search("[0-9]{2}\)[\r\n]+(.*)", speechText).group(1).strip(" ")) 
            else:
                speech.append(re.search(":[\r\n]+(.*)", speechText).group(1).strip(" ")) ## search for speaker's speech, append to list
            speaker.append(re.search("^(.*?):", speechText).group(1)) ## search for speaker name, append to list            
        except:
            continue
    return pd.DataFrame({'name': speaker, 'speech': speech})

#text-mining #text-analysis #data-science #presidential-debates #data-analysis

The Debate Drinking Game, with data science
1.30 GEEK