DataFrames

Most of the data sets that we are interested in are more complex that simple lists of numbers. For instance, consider a data set containing information about California wildfires. It might contain multiple pieces of data about each fire, including its name, size, location, and cause.

How would we store and analyze such a data set? While we could use NumPy arrays – one holding the fire’s name, another holding the size, etc. – there is a much better way: We will use a table to contain the data, like this:

name year cause acres county
0 CAMP 2018 11 - Powerline 153335.562500 Butte
1 BUTTE 2015 14 - Unknown 70846.531250 Calaveras
2 KING 2014 7 - Arson 97684.546875 El Dorado
3 ROUGH 2015 1 - Lightning 151546.812500 Fresno
4 MEGRAM 1999 1 - Lightning 125072.531250 Humboldt
... ... ... ... ... ...
45 DAY 2006 5 - Debris 161815.656250 Ventura
46 MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura
47 THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura
48 SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura
49 COUNTY 2018 14 - Unknown 89831.148438 Yolo

50 rows × 5 columns

Pandas DataFrames

Like NumPy arrays, tables are provided by a third-party extension. The Python package which provides tables is called pandas. Pandas is the tool for doing data science in Python, and it is immensely popular – as of Summer 2020, it was downloaded nearly 1 million times per day. It is without a doubt a powerful tool, and you’ll need to know how to use it if you want to do serious data science. But there’s a problem: pandas is complicated. There are numerous ways to do even the simplest tasks. This makes it hard to learn, especially if you’re new to programming.

This leaves us in an interesting situation. On one hand, we want to learn pandas, because it is the tool used by actual data scientists. On the other hand, we don’t want to be thrown into the deep end. The solution? We’ll take pandas and remove everything that isn’t absolutely necessary, resulting in something simpler and easier to learn. What’s left is still pandas – just not all of it. Because this new package is a smaller (and cuter) version of pandas, we’re calling it babypandas.

To get access to the functionality that babypandas provides, we’ll need to import it:

import babypandas as bpd

Note

We’re going to be using babypandas in the rest of this book, but it should be stressed that babypandas is pandas, just a smaller version of it. So if someone asks if you have experience working with pandas (during a job interview, for instance), you’ll be able to say “yes!”.

In babypandas (and pandas), a table is called a DataFrame (though we’ll use the two terms interchangeably). Since DataFrames are often used to store very large data sets, they are not typically created by typing their entries one by one – instead, they are usually read from a file. We’ll see how to do that in a moment, but for now we assume that we have already loaded a DataFrame into a variable called fires. If we type fires in our Jupyter notebook cell and execute it, it will display the table with nice formatting:

fires
name year cause acres county
0 CAMP 2018 11 - Powerline 153335.562500 Butte
1 BUTTE 2015 14 - Unknown 70846.531250 Calaveras
2 KING 2014 7 - Arson 97684.546875 El Dorado
3 ROUGH 2015 1 - Lightning 151546.812500 Fresno
4 MEGRAM 1999 1 - Lightning 125072.531250 Humboldt
... ... ... ... ... ...
45 DAY 2006 5 - Debris 161815.656250 Ventura
46 MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura
47 THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura
48 SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura
49 COUNTY 2018 14 - Unknown 89831.148438 Yolo

50 rows × 5 columns

If we ask for the type of fires, Python will tell us that it is a DataFrame:

type(fires)
babypandas.bpd.DataFrame

A DataFrame consists of columns and rows. Almost always, a row represents a single thing – in this case, a fire – and the columns provide different pieces of information about that thing. In this case, we have a column describing the name of the fire, another describing the cause, and so on.

We can get the number of rows and columns in a DataFrame by asking for its shape:

fires.shape
(50, 5)

This tells us that there are 50 rows and 9 columns. If for whatever reason we just wanted the number of rows, we could ask for the first element of this pair:

fires.shape[0]
50

Every row and column in a DataFrame has a label. We will use the row and column labels to refer to particular parts of the table and retrieve information from within it. The columns of the above DataFrame are labeled “year”, “name”, “cause”, and so on. The rows of the above table are simply labeled “0”, “1”, “2”, and so forth.

The Index

Together, the row labels are called the table index. By default, a table’s rows are labeled by numbering them. However, in many cases it makes more sense to label the rows in some other way. For example, each row in our current data set is a single fire. Perhaps it makes more sense to use the fire’s name as its row label. We can ask babypandas to use a particular column as the row labels with the .set_index method:

fires.set_index('name')
year cause acres county
name
CAMP 2018 11 - Powerline 153335.562500 Butte
BUTTE 2015 14 - Unknown 70846.531250 Calaveras
KING 2014 7 - Arson 97684.546875 El Dorado
ROUGH 2015 1 - Lightning 151546.812500 Fresno
MEGRAM 1999 1 - Lightning 125072.531250 Humboldt
... ... ... ... ...
DAY 2006 5 - Debris 161815.656250 Ventura
MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura
THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura
SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura
COUNTY 2018 14 - Unknown 89831.148438 Yolo

50 rows × 4 columns

The .set_index method accepts one argument: the label of the column that should be used as the index. It then creates a new DataFrame in which the index has been replaced with the information from this column; the old DataFrame is not changed. In order to save the results, we’ll need to assign the new table to a variable, like so:

fires_by_name = fires.set_index('name')
fires_by_name
year cause acres county
name
CAMP 2018 11 - Powerline 153335.562500 Butte
BUTTE 2015 14 - Unknown 70846.531250 Calaveras
KING 2014 7 - Arson 97684.546875 El Dorado
ROUGH 2015 1 - Lightning 151546.812500 Fresno
MEGRAM 1999 1 - Lightning 125072.531250 Humboldt
... ... ... ... ...
DAY 2006 5 - Debris 161815.656250 Ventura
MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura
THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura
SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura
COUNTY 2018 14 - Unknown 89831.148438 Yolo

50 rows × 4 columns

Notice that the fire names have been moved all the way to the left, and have been made bold – this is babypandas’ way of showing that these names are now the index.

Warning

The index is not a column – it is it’s own separate thing. When we use .set_index, the old index is thrown out and number of columns decreases by one.

Since we will later use row labels to refer to specific rows by name, the labels should be unique. In this case, that means that every fire should have a unique name. In this case, every fire is uniquely named, and it is fine to use the fire name as the index. Later, we’ll see a larger version of this data set in which there are multiple fires with the same name. In that case, the name should probably not be used as the index.

A table’s index is essentially an array. We can get the index by writing:

fires_by_name.index
Index(['CAMP', 'BUTTE', 'KING', 'ROUGH', 'MEGRAM', 'RANCH', 'VALLEY ',
       'ROCKY  ', 'FORK', 'RUSH', 'RAVENNA', 'WOOLSEY', 'STATION', 'FERGUSON',
       'DETWILER', 'SCARFACE', 'SOBERANES', 'INDIANS', 'KIRK', 'BASIN COMPLEX',
       'MARBLE-CONE', 'CHIPS', 'OLD', 'HARRIS 2', 'CEDAR', 'WITCH', 'LAGUNA',
       'LAS PILITAS', 'HIGHWAY 58', 'ZACA', 'REFUGIO', 'LA BREA', 'CARR ',
       'OAK', 'FRYING PAN', 'GLASS MOUNTAIN', 'KINCADE', 'CAMPBELL',
       'SKINNER MILL', 'HAPPY', 'MANTER', 'MCNALLY', 'RIM', 'CLAMPITT FIRE',
       'WHEELER #2', 'DAY', 'MATILIJA', 'THOMAS', 'SIMI FIRE', 'COUNTY'],
      dtype='object', name='name')

We can then access individual elements of the index using the same notation as used with arrays, remembering that Python starts counting from zero:

# the first element
fires_by_name.index[0]
'CAMP'
# the second element
fires_by_name.index[1]
'BUTTE'
# the last element
fires_by_name.index[-1]
'COUNTY'

Series

Getting a column with .get

We can retrieve a particular column from the table with the .get method. For instance, to get the column labeled “acres”, we would write:

fires_by_name.get('acres')
name
CAMP         153335.562500
BUTTE         70846.531250
KING          97684.546875
ROUGH        151546.812500
MEGRAM       125072.531250
                 ...      
DAY          161815.656250
MATILIJA     219999.281250
THOMAS       281790.875000
SIMI FIRE    107570.398438
COUNTY        89831.148438
Name: acres, Length: 50, dtype: float64

The result might look like a DataFrame with one column, but it’s actually a new type of object called a Series:

type(fires_by_name.get('acres'))
babypandas.bpd.Series

A Series is basically an array, but with an index. A Series represents a column in a DataFrame. This means that we can think of the columns of a DataFrame as being arrays (more or less).

Arithmetic

Because a Series is like an array, we can do similar things with it. For instance, we can perform elementwise arithmetic with a Series. Let’s try it out by converting the fire sizes from acres to square miles. Each acre is 0.0015625 square miles, so we can do the conversion with a simple multiplication:

fires_by_name.get('acres') * 0.0015625
name
CAMP         239.586816
BUTTE        110.697705
KING         152.632104
ROUGH        236.791895
MEGRAM       195.425830
                ...    
DAY          252.836963
MATILIJA     343.748877
THOMAS       440.298242
SIMI FIRE    168.078748
COUNTY       140.361169
Name: acres, Length: 50, dtype: float64

We can also perform arithmetic with two series, assuming that they are the same size.

Series methods: .mean, .max, .describe, etc.

Series objects also come with a bunch of useful methods attached, like .mean and .max. For example, the average size of a fire in this data set is:

fires_by_name.get('acres').mean()
135919.636875

The largest fire burned this many acres:

fires_by_name.get('acres').max()
410202.46875

And the earliest fire in the data set was in the year:

fires_by_name.get('year').min()
1910

A very useful Series method is .describe. It gives us a quick look at the basic statistics of the data in a particular column:

fires_by_name.get('year').describe()
count      50.000000
mean     1998.840000
std        25.220141
min      1910.000000
25%      1996.000000
50%      2007.500000
75%      2015.000000
max      2019.000000
Name: year, dtype: float64

From this, we can see that there are 50 fires in the data set, the earliest from 1910 and the latest from 2019. The 25%, 50%, and 75% refer to percentiles. That is, 25% of the fires occurred during or before 1996, and half occurred during or before 2007. This also means that half occurred between 2007 and 2019!

We will see more Series methods throughout this textbook, but only when we need to use them.

Jupyter Tip

You can ask Jupyter for some information on all of the Series methods available by writing help(bpd.Series). The methods starting with two underscores (__) are called “dunder” methods, and implement special behavior. You’re not meant to call them direcly, so you can pretty much ignore them.

Adding and removing columns

Above, we saw that we could convert the 'acres' column to square miles using a little bit of array math. But doing so doesn’t change the table. What if we want to add this column to our table?

Adding a column with .assign

Adding a column can be done with the .assign method, like this:

fires_by_name.assign(sqmiles=fires_by_name.get('acres') * 0.0015625)
year cause acres county sqmiles
name
CAMP 2018 11 - Powerline 153335.562500 Butte 239.586816
BUTTE 2015 14 - Unknown 70846.531250 Calaveras 110.697705
KING 2014 7 - Arson 97684.546875 El Dorado 152.632104
ROUGH 2015 1 - Lightning 151546.812500 Fresno 236.791895
MEGRAM 1999 1 - Lightning 125072.531250 Humboldt 195.425830
... ... ... ... ... ...
DAY 2006 5 - Debris 161815.656250 Ventura 252.836963
MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura 343.748877
THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura 440.298242
SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura 168.078748
COUNTY 2018 14 - Unknown 89831.148438 Yolo 140.361169

50 rows × 5 columns

There’s a lot going on here, so let’s break it down. First, the assign method takes a single argument: a series that will become the new column. But the way that we pass this argument is new. Instead of simply passing the argument itself, we also give the argument a name by writing sqmiles=. This will be the column’s label. Arguments written in the form argument_name=argument_value are called keyword arguments.

We can call the column anything we like, as long as it is a valid python variable name. This means that the variable name cannot contain spaces, or start with a number. If you try, you’ll get a SyntaxError:

fires_by_name.assign(square miles=fires_by_name.get('acres') * 0.0015625)
  File "<ipython-input-21-546d28f7098c>", line 1
    fires_by_name.assign(square miles=fires_by_name.get('acres') * 0.0015625)
                                    ^
SyntaxError: invalid syntax

Instead of spaces, we can use underscores:

fires_by_name.assign(square_miles=fires_by_name.get('acres') * 0.0015625)
year cause acres county square_miles
name
CAMP 2018 11 - Powerline 153335.562500 Butte 239.586816
BUTTE 2015 14 - Unknown 70846.531250 Calaveras 110.697705
KING 2014 7 - Arson 97684.546875 El Dorado 152.632104
ROUGH 2015 1 - Lightning 151546.812500 Fresno 236.791895
MEGRAM 1999 1 - Lightning 125072.531250 Humboldt 195.425830
... ... ... ... ... ...
DAY 2006 5 - Debris 161815.656250 Ventura 252.836963
MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura 343.748877
THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura 440.298242
SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura 168.078748
COUNTY 2018 14 - Unknown 89831.148438 Yolo 140.361169

50 rows × 5 columns

The second thing to note is that .assign creates an entirely new table containing the new column. It does not change the old table, as we can verify by recalling the value of fires_by_name:

fires_by_name
year cause acres county
name
CAMP 2018 11 - Powerline 153335.562500 Butte
BUTTE 2015 14 - Unknown 70846.531250 Calaveras
KING 2014 7 - Arson 97684.546875 El Dorado
ROUGH 2015 1 - Lightning 151546.812500 Fresno
MEGRAM 1999 1 - Lightning 125072.531250 Humboldt
... ... ... ... ...
DAY 2006 5 - Debris 161815.656250 Ventura
MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura
THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura
SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura
COUNTY 2018 14 - Unknown 89831.148438 Yolo

50 rows × 4 columns

Note

Wherever possible, DataFrame and Series methods return new objects instead of modifying existing ones. Creating copies like this results in code that is easier to reason about helps prevent strange bugs in your code.

In order to permanently add the column to the table, we need to save the result of .assign to a variable.

fires_with_sqmiles = fires_by_name.assign(
    sqmiles=fires_by_name.get('acres') * 0.0015625
)
fires_with_sqmiles
year cause acres county sqmiles
name
CAMP 2018 11 - Powerline 153335.562500 Butte 239.586816
BUTTE 2015 14 - Unknown 70846.531250 Calaveras 110.697705
KING 2014 7 - Arson 97684.546875 El Dorado 152.632104
ROUGH 2015 1 - Lightning 151546.812500 Fresno 236.791895
MEGRAM 1999 1 - Lightning 125072.531250 Humboldt 195.425830
... ... ... ... ... ...
DAY 2006 5 - Debris 161815.656250 Ventura 252.836963
MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura 343.748877
THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura 440.298242
SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura 168.078748
COUNTY 2018 14 - Unknown 89831.148438 Yolo 140.361169

50 rows × 5 columns

Removing a column with .drop

Columns can be removed using the .drop method. It accepts one keyword argument: columns. The argument can either be the label of a single column as a string, or a list of column labels. As with .assign, the result is a new DataFrame (a copy).

For example, to get rid of the 'sqmiles' column:

fires_with_sqmiles.drop(columns='sqmiles')
year cause acres county
name
CAMP 2018 11 - Powerline 153335.562500 Butte
BUTTE 2015 14 - Unknown 70846.531250 Calaveras
KING 2014 7 - Arson 97684.546875 El Dorado
ROUGH 2015 1 - Lightning 151546.812500 Fresno
MEGRAM 1999 1 - Lightning 125072.531250 Humboldt
... ... ... ... ...
DAY 2006 5 - Debris 161815.656250 Ventura
MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura
THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura
SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura
COUNTY 2018 14 - Unknown 89831.148438 Yolo

50 rows × 4 columns

If we didn’t want the cause or the county:

fires_with_sqmiles.drop(columns=['cause', 'county'])
year acres sqmiles
name
CAMP 2018 153335.562500 239.586816
BUTTE 2015 70846.531250 110.697705
KING 2014 97684.546875 152.632104
ROUGH 2015 151546.812500 236.791895
MEGRAM 1999 125072.531250 195.425830
... ... ... ...
DAY 2006 161815.656250 252.836963
MATILIJA 1932 219999.281250 343.748877
THOMAS 2017 281790.875000 440.298242
SIMI FIRE 2003 107570.398438 168.078748
COUNTY 2018 89831.148438 140.361169

50 rows × 3 columns

Note that the argument name (columns) is not something we can change, unlike the keyword argument name used in .assign. We must use columns=..., or else Python will not understand us. And if you don’t use the keyword name, Python will be upset:

fires_with_sqmiles.drop('county')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-7d1ab73fa32f> in <module>
----> 1 fires_with_sqmiles.drop('county')

TypeError: drop() takes 1 positional argument but 2 were given

Renaming columns

How do we rename a column? Suppose we want to rename sqmiles to square_miles. To do so, we:

  1. Add a new column with the desired name by copying the old column.

  2. Drop the old column

For instance:

fires_with_new_name = fires_with_sqmiles.assign(
    square_miles=fires_with_sqmiles.get('sqmiles')
)
fires_with_new_name.drop(columns='sqmiles')
year cause acres county square_miles
name
CAMP 2018 11 - Powerline 153335.562500 Butte 239.586816
BUTTE 2015 14 - Unknown 70846.531250 Calaveras 110.697705
KING 2014 7 - Arson 97684.546875 El Dorado 152.632104
ROUGH 2015 1 - Lightning 151546.812500 Fresno 236.791895
MEGRAM 1999 1 - Lightning 125072.531250 Humboldt 195.425830
... ... ... ... ... ...
DAY 2006 5 - Debris 161815.656250 Ventura 252.836963
MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura 343.748877
THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura 440.298242
SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura 168.078748
COUNTY 2018 14 - Unknown 89831.148438 Yolo 140.361169

50 rows × 5 columns

We can also do this in a single piece of code, without intermediate variables:

(
    fires_with_sqmiles
    .assign(square_miles=fires_with_sqmiles.get('sqmiles'))
    .drop(columns='sqmiles')
)
year cause acres county square_miles
name
CAMP 2018 11 - Powerline 153335.562500 Butte 239.586816
BUTTE 2015 14 - Unknown 70846.531250 Calaveras 110.697705
KING 2014 7 - Arson 97684.546875 El Dorado 152.632104
ROUGH 2015 1 - Lightning 151546.812500 Fresno 236.791895
MEGRAM 1999 1 - Lightning 125072.531250 Humboldt 195.425830
... ... ... ... ... ...
DAY 2006 5 - Debris 161815.656250 Ventura 252.836963
MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura 343.748877
THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura 440.298242
SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura 168.078748
COUNTY 2018 14 - Unknown 89831.148438 Yolo 140.361169

50 rows × 5 columns

Tip

You can break up long expressions by surrounding the whole expression with parentheses and inserting line breaks wherever makes sense. We’ll often break right at a method call.

This trick of applying two methods, one after the other, in one line of code is called method chaining. It works because the result of .assign is itself a table. When Python evalautes the expression, it first evaluates the .assign, then uses this table during the call to .drop.

Method chaining is useful and can save us some typing, but it can be overused. It is sometimes better to save intermediate results.

Tip

If your method-chaining code isn’t working as you’d expect, break apart the code and save intermediate variables. Print out the values of these variables to do some debugging.

Reading CSV files

As mentioned above, DataFrames are not typically created by typing their entries by hand, one-by-one. Instead, we usually download a data set in a standard format and read it from disk. One such standard format is CSV, or comma-separated values.

A CSV file is simply a text file in a certain format. Here are the first few lines of the CSV file containing our wildfire data:

name,year,cause,acres,county
CAMP,2018,11 - Powerline,153335.5625,Butte
BUTTE,2015,14 - Unknown,70846.53125,Calaveras
KING,2014,7 - Arson,97684.546875,El Dorado
ROUGH,2015,1 - Lightning,151546.8125,Fresno
MEGRAM,1999,1 - Lightning,125072.53125,Humboldt
RANCH,2018,14 - Unknown,410202.46875,Lake
VALLEY ,2015,14 - Unknown,76084.8359375,Lake
ROCKY  ,2015,9 - Miscellaneous,69438.1640625,Lake
FORK,1996,7 - Arson,83056.9453125,Lake

As it’s name suggests, a CSV file consists of values, separated by commas. The first line of the file usually contains the column labels. CSV is a widely-used format, and can be read by many pieces of software, including Excel and Google Sheets.

We can read a CSV file into a babypandas DataFrame using the bpd.read_csv function. We give this function a string containing the filepath to the CSV file we want to read. For example, our wildfire data exists in a file called calfire.csv contained in the data/ directory. We can read it into a DataFrame as follows:

calfire = bpd.read_csv('data/calfire.csv')
calfire
name year cause acres county
0 CAMP 2018 11 - Powerline 153335.562500 Butte
1 BUTTE 2015 14 - Unknown 70846.531250 Calaveras
2 KING 2014 7 - Arson 97684.546875 El Dorado
3 ROUGH 2015 1 - Lightning 151546.812500 Fresno
4 MEGRAM 1999 1 - Lightning 125072.531250 Humboldt
... ... ... ... ... ...
45 DAY 2006 5 - Debris 161815.656250 Ventura
46 MATILIJA 1932 9 - Miscellaneous 219999.281250 Ventura
47 THOMAS 2017 9 - Miscellaneous 281790.875000 Ventura
48 SIMI FIRE 2003 14 - Unknown 107570.398438 Ventura
49 COUNTY 2018 14 - Unknown 89831.148438 Yolo

50 rows × 5 columns

Modifying the DataFrame will not affect the data on disk in any way.