The first step in doing data science is to collect a data set. That is, if we want to answer a question – such as, “How much money does the average data scientist make per year?” – we don’t go out and ask only one person, we survey a lot of people and analyze the results. As such, we need ways of working with large collections of data. In this section, we’ll see two data types – lists and arrays – which enable us to work with sequential data.
In Python, the simplest way to make a collection of data is by creating a list. You can do this by surrounding a group of items with square brackets,
[ ], and separating each item with a comma
salaries = [110_000, 95_000, 100_000] salaries
[110000, 95000, 100000]
Lists are their own data type:
Any type of data is allowed inside a list (including other lists), and you can include variables:
x = 42 random_stuff = ['oranges', x, True, [1, 2, 3]] random_stuff
['oranges', 42, True, [1, 2, 3]]
Python’s lists are versatile and easy to use, but they have a big problem: they are slow. As data scientists, we will be working with sequences of millions, if not billions, of entries – so speed is of the essence. Therefore, will use another type of collection to store our sequential data: the array.
Arrays are like lists, but optimized for the types of heavy calculations done in data science. They are blazing fast, and memory-efficient.
Arrays aren’t included with Python, however. Remember that Python wasn’t originally designed specifically for data scientists. Instead, it is a general purpose language, used by web developers, software engineers, and artists, too. So in order to give Python what it needs – a way of efficiently working with large sequences of numbers – a group of scientists independently developed an extension to Python called NumPy (short for “numeric python”).
Avoid the embarassment – it’s pronounced “num-pie”
To get access to arrays, we’ll need to import NumPy, just as we did with the
math module in the previous chapter:
import numpy as np
as np means that we are giving
numpy a new, shorter name that will be faster to type. Whenever we want to use function in the package, we’ll write
np. instead of
Let’s create an array. We do so by calling the
np.array() function with a list of data:
hours_slept_array = np.array([8, 7, 7, 8, 5, 8, 9]) hours_slept_array
array([8, 7, 7, 8, 5, 8, 9])
Note the square brackets! If you try to create an array without them, you’ll see an error.
Arrays are their own data type:
The array we’ve created contains numbers, but arrays can also contain other types of data, like strings or bools. But in order to maximize their efficiency, a single array should only contain a single data type.
np.array(['this', 'is', 'also', 'fine'])
array(['this', 'is', 'also', 'fine'], dtype='<U4')
Remember what happened when we evaluated expressions that contained both ints and float? The result was always a float. The same thing will happen if we try to make an array containing ints and float:
np.array([1, 2, 3.0])
array([1., 2., 3.])
If possible, NumPy will always try to convert everything you give it to the same type. That means if you give it strings and numbers, it’ll turn everything into strings!
Why is this? Because you can always convert a number into a string (just place quotes around it!), but there are only a handful some strings that can be reliably converted into a number. For the sake of consistency, NumPy turns it all into strings.
np.array([1, 2, '3'])
array(['1', '2', '3'], dtype='<U21')
Sometimes it is useful to know how many elements are in an array. We can determine this with the
arr = np.array([1, 2, 3]) len(arr)
1.3. Array Methods¶
NumPy comes full of additional functions and methods that perform a vast amount of useful calculations on arrays. Better yet: these functions and methods are fast.
The NumPy functions can be called just like we called the math functions. Once we’ve imported the
numpy library (abbreviated as
np) we can just type
np. followed by a function name to access the function. For instance, you can use the
np.mean function to calculate the average value of a sequence:
example_array = np.array([1, 1, 2, 3, 3]) np.mean(example_array)
There are loads of more complex functions, such as the
np.diff function which calculates the difference between each consecutive pair of elements:
array([0, 1, 1, 0])
Just like strings, arrays also own special methods that can perform calculations. A few useful ones are shown below:
Don’t worry, you don’t need to memorize all of the different functions/methods (there are a lot!) – we’ll include references when necessary.
Every year, the programming community forum StackOverflow surveys its users, asking them such important questions as: what is your salary? and, how many computer monitors are on your desk at home? The results are publicly available. Since many of those who respond are data scientists, we can use the data to get an idea of a typical data scientist’s salary.
salaries is a NumPy array containing the salaries of every US-based data scientist in the survey. How many were there? We can answer that with
What was the mean salary?
Nice. What about the median salary?
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-20-d0d1f0e8b82e> in <module> ----> 1 salaries.median() AttributeError: 'numpy.ndarray' object has no attribute 'median'
Oops. It turns out that there is no method called “median” in numpy. There is, however, a
Notice that the median is about $10,000 less than the mean. As a data scientist would point out, the mean is more “sensitive” to “outliers”, meaning that a few people who make a very large amount of money can skew the mean. Let’s see what the largest salary is:
2 million dollars! Remember, though: these salaries are self-reported.
1.4. Accessing array items¶
An array is an ordered sequence of items that has a beginning and an end. We can retrieve an element by specifying its index. The index of the first item in an array is zero, the index of the second item is one, and so on. For example, let’s say we have an array with three elements:
names = np.array(['Xanthippe', 'Yvonne', 'Zelda']) names
array(['Xanthippe', 'Yvonne', 'Zelda'], dtype='<U9')
To get the first element out of the array, we write:
To get the second, we write:
And to get the third (i.e., last) element, we write:
Here’s a useful trick: if you use a negative number to retrieve an element, Python starts counting from the back of the array. So, for instance, to retrieve the last element we can also write:
The array above has only three things in it, and their indices are 0, 1, and 2. What happens if we try to access the list at an index that doesn’t exist, such as 99?
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-28-db3f0aeffdb7> in <module> ----> 1 names IndexError: index 99 is out of bounds for axis 0 with size 3
1.5. Element-wise operations¶
The power of arrays really starts to shine when math is involved. Arrays have the power to quickly perform operations over each element they contain. To begin, let’s create a simple array of numbers:
array1 = np.array([1, 2, 3])
To subtract 3 from all of these numbers, we can simply write:
array1 - 3
array([-2, -1, 0])
To multiply each of the numbers by 2, we would write:
array1 * 2
array([2, 4, 6])
And so on. In practice this means we could do something like convert an entire array of temperatures measured in Fahrenheit to Celsius by writing a single expression:
temperatures_f = np.array([0.5, 32.0, 71.6, 212.0])
Remember that the formula for converting a measurement in Fahrenheit to Celsius is \(C = (F - 32) * (5/9)\). Therefore:
temperatures_c = (temperatures_f - 32) * (5 / 9) temperatures_c
array([-17.5, 0. , 22. , 100. ])
In the above example, first
(temperatures_f - 32) is evaluated and produces an array with 32 subtracted from every temperature. Then
(5 / 9) is evaluated. Then then every element in the new array is multiplied by 5/9, producing the final output array.
We can also do element-wise operations between pairs of data from two arrays.
For this to work, both arrays must have the same size. The arrays are then lined up next to eachother, and the operation is performed between every corresponding pair of elements. This is best demonstrated with some examples:
array1 = np.array([1, 2, 3]) array2 = np.array([2, 4, 6])
array1 * array2
array([ 2, 8, 18])
array1 - array2
array([-1, -2, -3])
array([ 1, 16, 729])
Both paired element-wise operations and standalone element-wise operations can be used in the same expression, since we’re always producing another array as a result of each expression.
(array1 * 2) - array2
array([0, 0, 0])
Watch out for the new errors you might encounter! Let’s see what happens if our other array isn’t the same size.
array_short = np.array([2,4]) array1 * array_short
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-39-bb60a55f36b1> in <module> 1 array_short = np.array([2,4]) ----> 2 array1 * array_short ValueError: operands could not be broadcast together with shapes (3,) (2,)
The error message is a little cryptic – what is this about “broadcasting”? Nevertheless, we can kind of understand that there is some issue with the “shape” of the two arrays not being compatible. In fact, this error is telling us that the first array has three elements but the other only has two, so the two arrays couldn’t be pushed into the same shape.
Often times it’s useful to create an array of consecutive numbers, such as:
np.array([0, 1, 2, 3, 4, 5, 6, 7])
array([0, 1, 2, 3, 4, 5, 6, 7])
Rather than write this array by hand, we can use the
np.arange function to do it for us:
array([0, 1, 2, 3, 4, 5, 6, 7])
Notice that just like indices, ranges will start at zero by default and exclude the last number. So calling
np.arange(12), for instance, will create an array with eleven elements whose first entry is 0 and whose last element is 11.
While we saw an example of the range function being called with one argument, it can be called with one, two, or three arguments:
np.arange(endpoint)Consecutive integers from 1 to endpoint (exclusive)
np.arange(start, endpoint)Consecutive numbers from start to endpoint (exclusive), increasing by 1 each step.
np.arange(start, endpoint, stepsize)Consecutive numbers from start to endpoint (exclusive), changing by stepsize each step.
Some example might make this clearer:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
array([5.5, 6.5, 7.5, 8.5, 9.5])
np.arange(0, 1, 0.2)
array([0. , 0.2, 0.4, 0.6, 0.8])
np.arange(-1, -4, -1)
array([-1, -2, -3])
The result of
np.arange is an array like any other, so we can write things like:
np.arange(5) + 3
array([3, 4, 5, 6, 7])
How would you use
np.arange to create the array containing the first 6 powers of 2: 1, 2, 4, 8, 16, and 32?
To get multiple pieces of data in one place, we create a collection. If the collection is ordered then it is a sequence.
Each item in a sequence has an index – its position, starting at zero.
Lists are the most basic sequence, and are created by surrounding a group of items with square brackets and separating each item by commas:
[item, item, ...]
Arrays are a sequence type from the NumPy library, and are created by passing a list into the
np.array([item, item, ...])
NumPy offers lots of additional functions that can be called on sequences. These can be accessed using
An item can be selected from an array by using brackets with the index of the item:
Arrays support element-wise operations, such as adding or multiplying all elements by a single number.
Arrays of the same length support paired element-wise operations between the two arrays, such as adding or multiplying each element in one array with each element in the same position of another array.
An array of numbers with constant spacing can be easily constructed using
A range will always exclude the endpoint – so
0 1 2.