DSC 180 – The Spread of Misinformation Online


Week 05 - Collecting the Data

Topics

This week, we'll collect a massive data set of ~2 million tweets on COVID.

Readings and Tasks

To study the spread of misinformation about COVID-19, we'll gather a collection of around 2 million tweets. The Panacea Lab at Georgia State University has been collecting tweets about the coronavirus since the beginning of the pandemic. These tweets are provided in the form of Tweet ID, and it is up to us to "rehydrate" the tweets.

The Panacea Lab attempts to collect every tweet about COVID. As a result, the full data set contains, on average, 4.4 million tweets per day. This is way too many tweets for us, so we will need to choose a random subset of them to hydrate. When doing so, you should sample uniformly at random, and as a result the distribution of tweet dates in your sampled data should be about the same as the population distribution.

There are two tasks this week, each very open-ended in that you'll need to make some decisions on how to fill in the details in a reasonable way:

  1. Hydrate a data set of (roughly) 2 million tweets from the period of February 15, 2020 to October 10, 2021. The only constraints are that you 1) use the "normal" data (as opposed to the "clean" data), as it includes re-tweets, and 2) you sample uniformly at random.
  2. Once you have your data set of 2 million tweets, estimate the proportion of them that contain links (URLs).

These tasks might be harder than they first appear! Even getting the tweet IDs involves a few challenges. For instance, you could get them from the GitHub repo provided by the Panacea Lab, but they are broken into daily files -- you'd need to write a script to download each of them one-at-a-time and sample. Alternatively, you could download the entire data set of tweet IDs (the Panacea Lab provides a few links), but it is a single 10 GB file that may not fit in your computer's memory.

Make your best effort at completing this by Wednesday. When we meet, we'll discuss the challenges you might have faced and talk about possible solutions.

For this week's participation, we'll use the "default" participation questions for a task.