{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Populations and Samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous chapter we ended off by talking about the \"birthday problem\" and used simulation to stumble upon a surprising result. But when we solved the problem we relied on the assumption that *every single birthday is equally likely* in our population. As a data scientist, we might like to check if that assumption is actually valid!\n",
"\n",
"What if in actuality less people are born over the summer, or perhaps more people are born nine months after Valentines day? In that case, certain birthdays would be more prevalent than others. But how can we actually find out whether or not this is true?\n",
"\n",
"One way is to ask everyone in the world what their birthday is... good luck with that! As it turns out, it's not feasible to survey billions of people. Fortunately, we can get a pretty good idea of how birthdays are distributed by collecting a *sample* of the population.\n",
"\n",
"This is precisely the concept we'll introduce in this chapter.\n",
"\n",
"Many interesting data-driven questions revolve around understanding some aspect of a 'population'. In practice, we usually can't collect data on the entire population, so instead we rely on samples to give us a representative idea of what the population actually looks like, or how it actually behaves."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Populations\n",
"\n",
"A {dterm}`population` in data science and statistics is an entire group of people, objects, or events on which we're capable of collecting data, in order to answer a specific question. The answers to questions like \"do people prefer to watch content A or content B\" or \"what is a typical price for a house\" are dependent on what specific *population* of people or houses they're referring to -- perhaps the populations about could be \"all Netflix users\", or \"all houses in the South of France\". Interestingly, a population can also refer to a group of events, such as \"all fair dice rolls\"."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Specifying a population\n",
"\n",
"When we pose a question, we must narrow in on a corresponding population. There are two steps we follow to do so.\n",
"1. What group is our question about?\n",
"\n",
" Maybe we're only interested in particular group -- like house prices specifically in the South of France. Therefore, our population would be narrowed down from all houses in the world to just houses in the South of France.\n",
" \n",
"2. What group can we collect data from?\n",
"\n",
" Our data collection is often a limiting factor -- for example, if we're running a survey hosted on our website then we can only collect data on the people who visit our site. Therefore, our population is narrowed down from all users of the internet to just users who visit our site. We must remember that our population during analysis is always limited by the data we collect.\n",
"\n",
"In the birthday problem above, we're probably interested in the population being the entire world -- are *all* birthdays actually equally likely across the globe? However, we managed to collect [a sample](https://github.com/fivethirtyeight/data/tree/master/births) which only contains births in the United States. Therefore we must narrow our population to only birthdays in the U.S. since that's the population that our data is coming from."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Population distributions\n",
"\n",
"Recall that knowing the *distribution* of some data is imperative to understanding and gaining insight from the data. The distribution is referred to as the 'shape' of the data, or simply what the data *looks* like.\n",
"\n",
"The {dterm}`population distribution` is the shape of a feature across the entire specified population. In most settings, it's considered -- unchanging -- for the duration of a study. Notice that even something like the distribution of birthdays in the U.S., which technically changes every time someone is born or dies, is unlikely to change by any substantial amount over the course of a study."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Models\n",
"\n",
"In reality, unless we fully measure an entire population, we'll never actually *know* what a true population distribution looks like.\n",
"\n",
"However, in many scenarios we tend to operate under *assumptions* about what populations look like. Statisticians (and marketing folks) love to call a set of these assumptions about the population distribution a \"{dterm}`model` for the population\".\n",
"\n",
"For example, we proposed a model for the population of U.S. birthdays, which assumes that every birthday is equally likely and thus the population distribution is *uniform*. (Remember that we're ignoring leap years for simplicity -- sorry!)\n",
"\n",
"![](uniform-birthdays.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{hiddenanswer}\n",
"---\n",
"question: |\n",
" At the time of this writing, the United States populace is approximately $330,221,340$ people. Let's call the population size $N$. Under the assumptions of our model, each of the 365 days in a year is assumed to be equally likely $P(\\text{day})=\\frac{1}{365}$. \n",
"\n",
" By using a frequentist approach to probability, $P(\\text{day})=\\frac{\\text{# people with this birthday}}{N}$, can you calculate the number of people in the U.S. with a birthday on any given day?\n",
"\n",
"answer: |\n",
" $$\n",
" \\begin{aligned}\n",
" N &= 320,221,340 \\\\\n",
" \\\\\n",
" P(\\text{day}) &= \\frac{1}{365} = \\frac{\\text{# people with this birthday}}{N} \\\\\n",
" \\\\\n",
" \\text{# people with this birthday} &= \\frac{365}{N} \\\\\n",
" &= 904,716 \\\\\n",
" \\end{aligned}\n",
" $$\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This distribution is purely the result of assumptions from our model!\n",
"\n",
"Perhaps the distribution of U.S. birthdays actually looks totally different, like a wave ![](wave-birthdays.png), or a U ![](u-birthdays.png), or it could possess gaps where mysteriously no one has a birthday ![](missing-birthdays.png). (Do you actually *know* anyone born on March 19th? Yeah, nor do I.)\n",
"\n",
"The truth is: we don't know, and we'll never know what the *true* population distribution is for most measurements since it's infeasible to collect data on every individual in the population!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{admonition} A look forward\n",
"What we *can* find out is whether or not the data we eventually collect actually fits our model.\n",
"\n",
"The samples we're about to learn about are imperative for allowing us to create \"hypothesis tests\" which can tell is if our model is believable or not!\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Probability distributions\n",
"\n",
"In many *models*, we basically end up assuming that our distribution matches some known *probability distribution*. This is useful since we know the probability of every outcome in a probability distribution!\n",
"\n",
"There are also some cases where we *do* know the true population distribution, because we know that the members of the population must follow some probabilistic process.\n",
"\n",
"For example, when working with the population, \"outcomes of all fair six-sided dice rolls\", we know *by the nature of 'fair'* that the true population distribution is the uniform probability distribution.\n",
"\n",
"![](dice-distribution.png)\n",
"\n",
"This uniform distribution arises as the true population distribution when all values in that population are equally likely. There are *loads* of other probability distributions that exist, many with their own natural-world counterparts such as the 'bell curve' shape of people's birth weight, or the 'Poisson' shape of the number of supernovae seen per year, or the 'Pareto' distribution of wealth allocation (commonly abbreviated to simply the 80-20 rule)\n",
"\n",
"![](probability-distributions.png)\n",
"\n",
"If we can convince ourselves that measurements from a particular population is the result of a probabilistic process -- such as fair dice rolls, occurrences of cosmological events, mature heights of a specific natural organism, or even combinations of multiple random inputs -- then we know that the true population distribution will match a mathematically-calculatable probability distribution!\n",
"\n",
"That being said, if we're not entirely sure whether or not a population truly arises from a probabilistic process, we can still *hypothesize* that the population distribution looks similar to a probability distribution -- this is the case when we're comparing the distribution of U.S. birthdays to the uniform distribution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Samples\n",
"\n",
"No matter what our hypotheses may be, if we want to get our hands dirty and witness first-hand what our population looks like, we must turn to the power of sampling.\n",
"\n",
"In contrast to a population, a {dterm}`sample` is a subset of individuals randomly taken from a population. While it's infeasible to collect data from every member of a population, we can easily collect data from *some* of them.\n",
"\n",
"Here's that sample of U.S. birthdays we mentioned earlier. This subset of the population only includes total births each day between the years 2000 and 2014 (inclusive) and includes $45,538,958$ birthdays (out of the roughly $320,000,000$ birthdays that exist in the country."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import babypandas as bpd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"birthdays = bpd.read_csv('../../data/us_total_births_2000-2014_no_leaps.csv')\n",
"birthdays.plot(kind='bar', x='day_of_year', y='births', width=1, legend=False)\n",
"plt.xlabel('Day of year')\n",
"plt.xticks(range(0, 366, 10), range(0, 366, 10))\n",
"plt.ylabel('Frequency')\n",
"plt.title('Sample of U.S. Total Births per Day, 2000-2014')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hmmm, our sample seems to have a few days where considerably fewer births occur. This maybe just happened by random chance that less people were born on certain days. Maybe not, though, and our sample is a good way to find out. Out of curiosity, let's check which days had the fewest births."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Top 5 lowest number of births\n",
"birthdays.sort_values('births').iloc[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{admonition} A look forward\n",
"Notice that our sample doesn't quite match the uniform model we proposed. Is this sample evidence that our uniform model is wrong? Certainly is seems very odd that a truly uniform distribution could produce a random sample with some days (like December 25, January 1, and July 4) with so few births...\n",
"\n",
"These days don't seem so 'random' either. Chances are the true population of U.S. birthdays is *not* uniform. We'll learn how we can use our sample to make decisions about our model in the Hypothesis Testing chapter.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sampling schemes\n",
"\n",
"The process of choosing individuals from a population is called *sampling*, and while there are many different ways you can sample from a population, some approaches are better than others!\n",
"\n",
"For example, if we were tasked with collecting a sample of people in the U.S., in an attempt to answer the birthday assumption, we might be tempted to just ask the people around us -- you might choose to ask your classmates, or simply the first hundred people you encounter on campus. This is called a *convenience sample* because, well, it's convenient. Unfortunately it's a bad practice.\n",
"\n",
"Poor sampling schemes, like convenience sampling, produce samples that don't accurately represent the population. Remember that our population is determined in part by the data we're able to collect -- so when conducting convenience sampling we're essentially limiting our population to individuals in our geographical area at best."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{margin}\n",
"Did you know that the birthdays on sports teams are usually more common earlier in the year than later? So, if you were about to only ask your teammates on the local hockey team -- well, don't, that would probably be a skewed sample!\n",
"\n",
"https://en.wikipedia.org/wiki/Relative_age_effect\n",
"\n",
"https://www.wired.com/2013/03/nhl-selection-bias/\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A much better sampling scheme should produce samples that are *representative* of the population -- therefore diverse and random in nature.\n",
"\n",
"By far the most common, simple, and yet extremely powerful method of sampling is called the \"*simple random sample*\". NumPy has a name for it too: `np.random.choice`.\n",
"\n",
"As you might already suspect, this representative sampling scheme boils down to one simple step: *pick individuals from the total population, completely at random*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Collecting a random sample\n",
"\n",
"NumPy's `random.choice` function should at least sound familiar to us by now, since we used it to simulate coin flips. In general, the function works by randomly picking elements from a sequence, like a list or an array -- in this case, randomly picking elements from a population.\n",
"\n",
"In the case of simulating probabilities, we're essentially sampling from a probability distribution. But, in order get some experimental data samples, we need some population data to sample from!\n",
"\n",
"How about we look at some fish? Each year in August the London Zoo measures the weight of its nearly 20,000 animal inhabitants, 7072 of which were fish as of 2019. Weight is a very powerful indicator of general health, but measuring the entire population of fish at the London Zoo is, obviously, quite an arduous process.\n",
"\n",
"At other points in the year we still want to track the population distribution of fish weights, but we don't want to measure all seven-thousand fish again -- nor do we need to! As long as we collect a representative sample we can avoid a lot of work and still understand roughly what distribution of weights looks like."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load in an array of fish weights in kilograms\n",
"fish = np.loadtxt('../../data/fish_kg.csv')\n",
"fish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above, the 'true' population of fish weights in kilograms has been loaded into a table -- but we won't look at it just yet. Let's work on sampling first.\n",
"\n",
"By calling `np.random.choice` we can don our wetsuits, pull out a single random fish from anywhere in the entire zoo, measure it and then put the fish back in its tank."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.random.choice(fish)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Sample size and replacement\n",
"\n",
"If you remember from the last chapter, `random.choice` also accepts a `size=` argument. The name of the argument, 'size', may have seemed strange in the context of simulating coin flips, but in the context of collecting a sample it fits right in. We can set the `size=` argument to our desired sampled size!\n",
"\n",
"Careful though. By default, the `random.choice` function *puts the fish back* each time it samples an individual, which poses a problem if we try to sample multiple individuals. If we're unlucky, we might measure a fish, put it back, and then swim around and accidentally grab the same fish again! In a large population, this chance is really slim so it doesn't matter if we replace or don't replace the fish, but it's good to be cognizant of when dealing with smaller populations.\n",
"\n",
"If we want to take our sample all at the same time, *without* replacing the individual after each pull. Therefore, we must remember to specify `replace=False`.\n",
"\n",
"Let's try it now by filling up our scuba tank and randomly pulling aside fifty fish."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.random.choice(fish, size=50, replace=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have the power to sample representatively with whatever sample size we want. One fish, two fish, ~~red fish, blue fish~~, one-hundred, two-thousand...\n",
"\n",
"You may have heard before that bigger sample sizes are better, so how about we go large and sample ten-thousand fish from our zoo?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"raises-exception"
]
},
"outputs": [],
"source": [
"np.random.choice(fish, size=10_000, replace=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can't.\n",
"\n",
"The error message spells out the issue nicely -- we cannot take a sample that is larger than the population if we're sampling without replacement!\n",
"\n",
"In most cases it is true that larger sample sizes are better -- we'll learn more about this in the next chapter -- but we have some notable limits. In real life, factors like cost or data availability might also limit our sample size."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Sampling from tables\n",
"\n",
"Sampling is the act of picking *individuals* from a population. Those individuals might possess multiple measurements, and it's important we keep those measurements together in case we want to compare those measurements in the future, like for measuring correlation.\n",
"\n",
"When we're sampling individuals from a table, we need to create a sample of *rows*. For this, we use the DataFrame `.sample` method. It works exactly like the random choice function, but samples with `replace=False` as the default."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import babypandas as bpd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load a table of fish weights and fish lengths!\n",
"fish_frame = bpd.read_csv('../../data/fish_kg_cm.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To sample a single row of our table, the `.sample` method can be called with no arguments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fish_frame.sample()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And if we want to specify a sample size, we can just pass in the size as the first argument without needing any keywords. Remember that `replace=False` is the default behavior for sampling from tables."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Randomly sample 50 rows without replacement\n",
"fish_frame.sample(50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sample distributions\n",
"\n",
"Just like populations, the distribution of a feature in our sample is called the {dterm}`sample distribution`. We can quickly use a visualization such as a bar chart or histogram to see what it looks like."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample = fish_frame.sample(50)\n",
"\n",
"sample.plot(kind='hist', y='WEIGHT', density=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each time we take a sample, we'll mostly end up collecting a different group of fish. As such, each time we collect an empirical sample distribution, it'll look a bit different.\n",
"\n",
"Let's take a look at a handful of sample distributions -- all coming from the same population and the same sample size."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You don't need to understand this plotting code, but congrats if you do :)\n",
"import matplotlib.pyplot as plt\n",
"\n",
"def draw_five_samples(sample_size):\n",
" \"\"\"\n",
" Collects five random samples with a specified sample size, then plots the\n",
" corresponding sample distributions side-by-side.\n",
" \"\"\"\n",
"\n",
" # Create a figure to hold five charts on the same row\n",
" fig, axes = plt.subplots(1, 5, sharey=True, figsize=(10, 2))\n",
" fig.suptitle('Sample Distributions of Weight, n=' + str(sample_size))\n",
"\n",
" # Draw a sample and histogram for each chart position\n",
" for ax in axes:\n",
" fish_frame.sample(sample_size).plot(kind='hist', y='WEIGHT', density=True,\n",
" ax=ax, legend=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"draw_five_samples(50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Sample distributions converge to the population distribution\n",
"\n",
"It's finally time to look at the true population distribution. This whole experiment was to find out if we truly can collect just a sample of fish weights and still determine what the population looks like. So how close did we get?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fish_frame.plot(kind='hist', y='WEIGHT', density=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may have gotten close, we may have been quite wrong, depending on *which sample we pulled*. Notice that there is a lot of disparity between each sample distribution, and a varying amount of disparity between the population distribution and each sample distribution.\n",
"\n",
"Here's where sample size starts to matter. Let's increase the number of fish we measure from fifty to one-hundred."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"draw_five_samples(100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Quite a bit better, right? And it's still tremendously less work than measuring the entire population of fish.\n",
"\n",
"As we increase sample size, sample distributions will increasingly resemble the population distribution they come from."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"draw_five_samples(500)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Why do sample distributions resemble their population?\n",
"\n",
"When we take a representative sample, we are picking individuals at random from the population. Recall that when more individuals share a similar value of a feature, the population distribution is considered 'denser' at that value.\n",
"\n",
"![](population-distribution.png)\n",
"\n",
"When we randomly pick an individual, we're more likely to pick individuals from denser regions (higher bars) since there are physically more individuals in our population that exist in that range of values. The opposite can be said for regions of low density (lower bars) -- if there's a low density, we're less likely to pick an individual there.\n",
"\n",
"![](population-samples.png)\n",
"\n",
"Therefore, once we've taken our full sample and plot out the sample distribution, we the sample will also exhibit more individuals from the same dense regions, and fewer individuals from the same low density regions.\n",
"\n",
"![](sample-distribution.png)\n",
"\n",
"Invariably, there is random chance involved during sampling, so in most cases we won't produce a sample distribution that is exactly the same as the population. Each sample looks a bit different from the population and from each other, but greater disparities have rarer probabilities. It's downright statistically *unlikely* that we could produce a large sample distribution that looks super different from our population distribution."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}