6. Defining and Applying Functions¶
We have seen that Python comes with a bunch of useful functions for performing common tasks. For instance, the built-in
round function rounds a number to a specified number of decimal places.
We have also seen that we can access even more functions by installing and importing a library, like NumPy or babypandas.
In some cases, however, there might not be a library providing the function that you need. Luckily, Python allows us to define our own functions. In this section, we’ll see how to create functions and apply them to tables.
6.1. Defining Functions¶
Suppose you are working with a dataset containing a bunch of street addresses, such as the following address of the University of California, San Diego:
ucsd = '9500 Gilman Dr, La Jolla, CA 92093'
Suppose we only care about the city and state. That is, we’d like to extract the string
'La Jolla, CA' from the full address. Python doesn’t come with a function that does exactly this, but we can write our own without too much work.
6.1.1. Splitting Strings¶
A typical address has several parts: the street address, the city name, the state, and the zip code. The parts are separated by commas (with the exception of the state and zip code). Python strings have a helpful
.split method which will split a string into parts according to whatever delimiter we provide. To split by a comma, we write:
['9500 Gilman Dr', ' La Jolla', ' CA 92093']
The result is a list of strings, each of them a part of the original list.
If we do not provide a delimiter, the default behavior of
.split is to split based on whitespace (such as spaces):
['9500', 'Gilman', 'Dr,', 'La', 'Jolla,', 'CA', '92093']
We can use
.split to retrieve the city and state name. Notice that when we split by commas, the city name will always be the second-to-last entry of the resulting list. This is because the last comma separates the city from the state and zip code. Remember that we can retrieve the second-to-last element of a list using square bracket notation, combined with using
-2 as the index:
city = ucsd.split(',')[-2] city
' La Jolla'
The result has a leading space that we might want to get rid of – we’ll deal with that in a moment. For now, let’s retrieve the state name. To do this, it might be easiest to split based on whitespace – then the state abbreviation will again be the second-to-last element of the list:
state = ucsd.split()[-2] state
We’d like to put the city and state together into a single string, like
'La Jolla, CA'. To do so, remember that the
+ operator concatenates strings:
city_and_state = city + ', ' + state city_and_state
' La Jolla, CA'
This is almost perfect, but let’s get rid of the leading space. We can do this with the
.strip() string method, which removes leading and trailing whitespace.
'La Jolla, CA'
Great! Putting it all together, here’s the code we used to retrieve the city and state:
city = ucsd.split(',')[-2] state = ucsd.split()[-2] city_and_state = city + ', ' + state city_and_state.strip()
'La Jolla, CA'
This code might seem simple enough, but suppose we have another address that we’d like to process:
lego = 'LEGOLAND California Resort 1 Legoland Dr, Carlsbad, CA 92008'
We could copy and paste the code above, but there is a better way: let’s define a function.
In Python, new functions are created using the
def statement. Here is an example of a function which retrieves the city and state name from an address:
def city_comma_state(address): """Return CITY, ST from an address string.""" city = address.split(',')[-2] state = address.split()[-2] city_and_state = city + ', ' + state return city_and_state.strip()
There is a lot to say about this, but first let’s test the function to see if it works. We call user-defined functions just like any other function:
city_comma_state('9500 Gilman Dr, La Jolla, CA 92093')
'La Jolla, CA'
'La Jolla, CA'
Let’s take a closer look at the anatomy of a function definition. Fig. 6.1 below shows all of the different parts.
A function definition starts with a name. Above, we’ve named our function
city_comma_state, but any valid variable name would do. A function’s name should be short but descriptive.
Next come the function’s arguments. These are the “inputs” to the function. In this case, there is only one argument: the address that will be processed. We’ll see how to define functions with more than one argument in a moment. A function can also have zero arguments, in which case we would write
def function_with_no_args():. The arguments can be named anything, as long as they are valid variable names. The arguments are surrounded by parentheses, and separated by commas.
The body of the function contains the code that will be executed when the function is called. The arguments can be used within the body of the function. The body of the function must be indented – we usually do this with the tab key.
The docstring is a piece of documentation that tells the reader what the function does. Including it is optional but recommended. If you ask Python for information on your function using
help, the docstring will be displayed!
Help on function city_comma_state in module __main__: city_comma_state(address) Return CITY, ST from an address string.
A function should usually return some value – this is done using the
return statement, followed by an expression whose value will be returned.
6.1.3. Function Behavior¶
The code we include in a function behaves differently than the code we are used to writing in a couple of key ways.
188.8.131.52. Functions are “recipes”¶
The code inside of a function is not executed until we call the function. For instance, suppose we try to do something impossible inside of a function – like dividing by zero:
def foo(): x = 1/0 return x
If you run the cell defining this function, everything will be fine: you won’t see an error. But when you call the function, Python let’s you know that you’re doing something that is mathematically impossible:
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) <ipython-input-18-c19b6d9633cf> in <module> ----> 1 foo() <ipython-input-17-477336bbf293> in foo() 1 def foo(): ----> 2 x = 1/0 3 return x ZeroDivisionError: division by zero
This is because function definition are like recipes in the sense that handing someone a recipe is not the same as following the recipe and preparing the meal.
Variables defined within a function are available only inside of the function. We can define variables inside a function just as we normally would:
def foo(): x = 42 y = 5 return x + y
If we run the function, we’ll see the number
However, if we try to use the variable
x, Python will yell at us:
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-21-6fcf9dfbd479> in <module> ----> 1 x NameError: name 'x' is not defined
This is because variables defined within a function are accessible only within the function. If we want to use that variable outside of the function, we need to pass it back to the caller using a
Note that arguments count as “variables defined within a function”. For instance:
def foo(my_argument): return my_argument + 2
If we call the function, everything will act as expected:
But if we try to access
my_argument outside of the function, Python tells us that we can’t:
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-24-785f83eb3696> in <module> ----> 1 my_argument NameError: name 'my_argument' is not defined
On the other hand, variables defined outside of a function are available inside the function. Consider for instance:
x = 42 def foo(): return x + 10
Use this behavior sparingly – it is usually better to “isolate” a function from the outside world by passing in all of the variables that it needs.
return exits the function¶
As soon as Python encounters a
return statement, it stops executing the function and returns the corresponding value. As an example, consider the code below which has three
returns. Only the first return statement will ever run:
def foo(): print('Starting execution.') return 1 print('Hey, I made it!') return 2 print('On to number three...') return 3
As we saw above, functions are somewhat isolated from the rest of the world in the sense that variables defined within them cannot be used outside of the function. The “correct” way of transmitting values back to the world is to use a
return statement. However, a common mistake is to think that
returning looks similar in a Jupyter notebook. For example, let’s define a function that both
def foo(): x = 42 y = 52 print(y) return x
When we run this function, we’ll see both values:
z = foo() z
42 is the output of the cell and can be “saved” to a variable.
52, on the other hand, is simply displayed to the screen and is afterwards lost forever. This can be checked by displaying the value of
def foo(): x = 42 y = 52 return x, y
When the function is run, it will return a tuple of two things:
A tuple is like a list, so we can use square bracket notation to retrieve each element:
We won’t usually need to return more than one thing from a function, though.
Given a year, produce the decade
Given a year, such as 1994, we’d like to retrieve the decade; in this case, 1990. At first we might think that
round is useful:
But it won’t work for years like 1997, since it will round up:
There are a few approaches that do work. One way is to use the
% operator. Remember that
x % y returns the remainder upon dividing
y. For example:
1992 % 10
To find the decade, we can simply subtract the remainder obtained by dividing by ten:
1992 - (1992 % 10)
1997 - (1997 % 10)
2000 - (2000 % 10)
Placing this code in a function makes it so we don’t have to remember this trick, and makes our code more readable:
def decade_from_year(year): return year - year % 10
Given height and width, compute the area of a triangle
We need to define a function with two variables. We do so by separating the argument names with a comma, like so:
def area_of_triangle(base, height): return 1/2 * base * height
Note that the order of the arguments matters. When
area_of_triangle(10, 5) is executed, Python assigns the value of 10 to
base and assigns the value of 5 to
height. If you wish, you can use the keyword argument form to call the function, in which case arguments can be provided in any order. This is slightly more readable, too:
Perform a frequent query
Suppose we frequently want to retrieve only those rows of a table whose entries lie between some thresholds. For instance, we
might want only those fires in
calfire from between 1995 and 2000. By writing this query into a function accepting a table, a column, and the thresholds, we make it easy to repeat:
def between(table, column, start, stop): return table[(table.get(column) >= start) & (table.get(column) < stop)]
For instance, to get only those fires from between 1995 and 2000:
between(calfire, 'year', 1995, 2000)
|6374||1995||11||SEMINOLE||2 - Equipment Use||645.149780||Riverside||-116.772176||33.890971|
|6375||1995||10||SHINN||7 - Arson||13.694411||Los Angeles||-117.679804||34.181470|
|6376||1995||8||ECHO||14 - Unknown||372.055573||Riverside||-117.209657||33.868816|
|6377||1995||10||FREEWAY FIRE NO II||14 - Unknown||1233.456909||Los Angeles||-118.372160||34.446973|
|6378||1995||12||TOWSLEY FIRE||14 - Unknown||818.184509||Los Angeles||-118.559865||34.345437|
|7333||1999||7||VINTAGE||2 - Equipment Use||34.438271||Kern||-119.521235||35.207022|
|7334||1999||8||41||10 - Vehicle||167.177612||Kern||-120.176735||35.768621|
|7335||1999||8||WASHBURN||1 - Lightning||272.034943||San Luis Obispo||-119.806037||35.132310|
|7336||1999||8||WOODLAND||1 - Lightning||995.418335||Butte||-121.699525||39.859890|
|7337||1999||8||BLOOMER||1 - Lightning||2609.674805||Butte||-121.476863||39.644791|
964 rows × 8 columns
Because this function accepts the column name, it is very reusable. We can use it to get the fires whose size is between 10,000 and 20,000 acres:
between(calfire, 'acres', 10_000, 20_000)
|16||1910||7||COYOTE CREEK||14 - Unknown||11226.824219||Ventura||-119.386960||34.421249|
|251||1924||8||UPPER DESOLATION VAL||14 - Unknown||10973.407227||El Dorado||-120.506799||38.571552|
|317||1926||7||FORT BIDWELL||9 - Miscellaneous||13100.943359||Modoc||-120.082147||41.934478|
|322||1927||8||LIEBRE||14 - Unknown||17957.339844||Los Angeles||-118.611143||34.719225|
|739||1938||8||RED CAP||14 - Unknown||14867.953125||Humboldt||-123.470725||41.174271|
|12949||2018||7||CRANSTON||7 - Arson||13229.158203||Riverside||-116.696822||33.715582|
|12954||2018||6||LIONS||1 - Lightning||13462.742188||Madera||-119.166248||37.577131|
|13290||2019||9||TABOOSE||1 - Lightning||10267.631836||Inyo||-118.348358||37.021613|
|13305||2019||7||TUCKER||9 - Miscellaneous||14184.661133||Modoc||-121.241082||41.803004|
|13365||2019||10||MARIA||14 - Unknown||10042.458984||Ventura||-119.056671||34.314244|
235 rows × 8 columns
> operators work on strings, too, we can get all of the fires whose name is between A and E:
between(calfire, 'name', 'A', 'E')
|2||1898||9||COZY DELL||14 - Unknown||2974.585205||Ventura||-119.265380||34.482316|
|9||1910||8||CRAWFORD CREEK 2||7 - Arson||497.885071||Humboldt||-123.552471||41.300052|
|10||1910||7||BLUFF CREEK||4 - Campfire||298.716553||Del Norte||-123.760361||41.430391|
|16||1910||7||COYOTE CREEK||14 - Unknown||11226.824219||Ventura||-119.386960||34.421249|
|17||1910||8||BULL CREEK||4 - Campfire||56.897217||Humboldt||-123.621988||41.174766|
|13446||2019||9||ANTELOPE||9 - Miscellaneous||167.332794||San Benito||-120.831821||36.558925|
|13451||2019||9||COW||10 - Vehicle||15.383965||Shasta||-122.038452||40.612144|
|13456||2019||9||DEER||5 - Debris||9.367375||Santa Cruz||-122.085541||37.183180|
|13457||2019||10||CABRILLO||2 - Equipment Use||61.750446||San Mateo||-122.358885||37.171839|
|13460||2019||10||CROSS||14 - Unknown||289.151428||Monterey||-120.726245||35.793698|
3620 rows × 8 columns
.apply Series Method¶
DataFrames come equipped with many useful methods, but defining our own functions allows us to make tables even more powerful. One way to use tables with functions is to pass the table into the function as one of its inputs, as we saw in the example above. In some situations, however, we don’t want to apply the function to the entire table, but rather to each entry in one of the table’s columns. In these cases, we can use the
For instance, suppose we have a table containing a
'year' column, such as the
calfire table we have been using, and we want to convert each year into the corresponding decade. We have already written a function that converts a single year to a decade:
decade_from_year. Recall how it works:
We’d like to apply this function to each entry in the
'year' column. To do so, we’ll use
0 1890 1 1890 2 1890 3 1900 4 1900 ... 13459 2010 13460 2010 13461 2010 13462 2010 13463 2010 Name: year, Length: 13464, dtype: int64
Notice the pattern here: we
.get('year') to retrieve column we wish to work with, and then
.apply(decade_from_year) to the column. The result is a Series with the same number of entries as the column containing the years. Each entry is the result of applying the function to the corresponding entry of the original column.
Note that we pass the function into
.apply without trailing parentheses. That is, we write
.apply(decade_from_year) and not
.apply method accepts the name of a function. It will then call the function many times on the given Series.
In many cases we’d like to add this new Series back to the table as a new column. We can do so with
with_decade = calfire.assign( decade=calfire.get('year').apply(decade_from_year) ) with_decade
|0||1898||9||LOS PADRES||14 - Unknown||20539.949219||Ventura||-119.367830||34.446830||1890|
|1||1898||4||MATILIJA||14 - Unknown||2641.123047||Ventura||-119.299625||34.488614||1890|
|2||1898||9||COZY DELL||14 - Unknown||2974.585205||Ventura||-119.265380||34.482316||1890|
|3||1902||8||FEROUD||14 - Unknown||731.481567||Ventura||-119.320979||34.417515||1900|
|4||1903||10||SAN ANTONIO||14 - Unknown||380.260590||Ventura||-119.253422||34.430616||1900|
|13459||2019||9||STAGE||7 - Arson||13.019149||Monterey||-121.599207||36.764065||2010|
|13460||2019||10||CROSS||14 - Unknown||289.151428||Monterey||-120.726245||35.793698||2010|
|13461||2019||9||FRUDDEN||2 - Equipment Use||11.789393||Monterey||-120.908061||35.908627||2010|
|13462||2019||9||JOLON||11 - Powerline||61.592369||Monterey||-121.010025||35.910750||2010|
|13463||2019||10||SADDLE RIDGE||14 - Unknown||8799.325195||Los Angeles||-118.516473||34.321859||2010|
13464 rows × 9 columns
.apply method is very useful for data cleaning. Data rarely comes to us in the exact form we need or prefer. For instance, we might wish to convert a year to its decade, or remove the leading number code from a fire’s cause. A common approach to doing so is to write a function capable of converting or cleaning a single entry, then
.applying this function to the entire column.
Example: clean the
cause column contains the cause of each fire as string, such as
'14 - Unknown'. The string contains a number encoding unique to the cause of the fire, but this is redundant since the cause is described immediately after. Let’s get rid of the number, leaving only the description.
First, we’ll write a function that accepts a cause and returns only the description:
def cause_description(cause): return cause.split('-')[-1].strip()
cause_description('2 - Equipment Use')
.apply the function to the
'cause' column. We’ll save it back to the table using
calfire.assign( cause=calfire.get('cause').apply(cause_description) )
|13463||2019||10||SADDLE RIDGE||Unknown||8799.325195||Los Angeles||-118.516473||34.321859|
13464 rows × 8 columns