DSC 180 – The Spread of Misinformation Online
Week 06 - Exploring the Data
Topics
This week, we'll finish collecting the entire data set and perform some exploratory analyses to understand it better.
Tasks
-
Finish downloading the 2 million tweets using the techniques discussed in the last discussion.
-
Estimate the proportion of tweets that are missing. That is, what proportion of tweets are unable to be hydrated because they have been deleted? You will likely need to do this on a sample of tweet IDs. If you do use a sample, report a 95% confidence interval for your proportion.
-
Calculate or estimate the following summary statistics using your set of 2 million tweets:
- the proportion of tweets that contain a URL
- the number of unique users
- the proportion of the data that are retweets
- a visualization (e.g., histogram) of the distribution of tweets per user.