Tutorial aims:
- What is NumPy?
- Basic manipulation NumPy arrays
- Masked arrays
- Reading and writing data
- Cautions when using NumPy arrays
1. What is NumPy?
So what is NumPy? According to the official website, NumPy is the fundamental package for scientific computing with Python. One trade-off of using Python is its computing speed. On the other hand, C is known for its high speed. Hence, the developers came to the conclusion of writing a package of numerical functions which is written in C, but which you can run from Python. So, without having to learn C, you can use its power in Python.
The biggest advantage of NumPy is its ability to handle numerical arrays. For example, if you have a list of values and you want to square each of them, the code in base Python will look like:
a = [1, 2, 3, 4, 5]
b = []
for i in a:
b.append(a**2)
and you will get [1, 4, 9, 16, 25]
for b
. Now, if you want to do the same with a 2-dimensional array, the base Python to do this is:
a = [[1, 2], [3, 4]]
b = [[],[]]
for i in range(len(a)):
for j in range(len(a[i])):
b[i].append(a[i][j]**2)
This would give you b
equal to [[1, 4], [9, 16]]
. To do the same with a 3D array you would need 3 nested loops and to do it in 4D would require 4 nested loops. However, with NumPy you can take the square of an array of any dimensions using the same line of code and no loops:
import numpy as np
b = np.array(a)**2
Using numpy is much faster than the base python version! It is faster to run, saving you on computing time, and faster to write, saving you time writing your code. All of this allows you to write and run code much faster, and therefore do more science in less time. Not only that, if your friend has a look at your code, they will read the code and understand you want a squared value of the array in an instant, without having to decipher what the for loop is trying to do.
NumPy serves as the basis of most scientific packages in Python, including pandas, matplotlib, scipy, etc. Hence, it would be a good idea to explore the basics of data handling in Python with NumPy. This tutorial does not come with any pre-written files, but is a follow-along tutorial. So better start typing on your IDE or IPython.
2. Basic manipulation numerical arrays
Importing the package
So let us get started. If you have the NumPy package installed, you should import it.
import numpy as np
This is a standard import statement, which uses the syntax import package as alias
. This allows us to call NumPy with np
, instead of having to keep typing the whole word numpy
each time we want to use one of the NumPy functions. The same syntax can be used with other modules such as import pandas as pd
. It is very common to use the alias np
for NumPy, but you can choose whatever alias you want.
Creating NumPy arrays
Creating arrays in NumPy is very easy. Below we create 3 different arrays. Try running these lines of code in Python.
a = np.full((5, 6), 10.0)
b = np.ones((2, 2, 2))
c = np.zeros(3)
for i in [a, b, c]:
print(i)
These are simple ways create arrays filled with different values. We created the first array, a
, which is 2D, to have 5 rows and 6 columns, where every element is 10.0. The second array b
is a 3D array of size 2x2x2, where every element is 1.0. The last array, c
, is a 1D array of size 3, where every element is 0.
You can create 2D, 3D or any-D arrays, by creating a 1D array, and reshaping it. Try entering the code below in Python.
a = np.arange(9.0).reshape(3,3)
print(a.shape)
a
will be a 2D array of size 3x3. The rows will be [0., 1., 2.], [3., 4., 5.], [6., 7., 8.]
. The print statement should return a tuple (3, 3)
.
Accessing and slicing data
Accessing NumPy array is straight-forward.
a = np.arange(30.).reshape(10, 3)
# the next line will print the first row of the array
print(a[0])
# the next line will print the last row of the array
print(a[-1])
# which should be same as:
print(a[2])
# the next line will print the value in the third row and first column
print(a[2,0])
The first print statement above should return the first row of the array, which should be [0., 1., 2.]
. In Python, counting starts from 0, so a[1]
will give you the second row instead of the first.
If you are from another language such as IDL, Matlab, Fortran, etc., you would have expected the result of the above code to slice columns of the array. The difference comes in the fact that NumPy uses C style arrays, where the most rapidly changing index comes last. In this case, rows are the least rapidly changing index, hence the slice is made on the row.
For those unfamiliar with the term, there are 2 types of arrays in computing: C and Fortran style arrays. This is basically how each element of an array are linearly allocated in memory. For a 2D array, the former will store the array row by row in a long line, while the latter stores the data column by column. When accessing the element on the ith row and jth column in a 2D array x
, you would use x[i, j]
in C style arrays and x[j, i]
in Fortran style arrays.
Going back to NumPy, you can select range of rows using:
a = np.arange(30.)
# selects the elements in positions from 11 to 20
print(a[10:20])
# selects the elements in positions from start to 10
print(a[:10])
# which should be same as:
print(a[0:10])
# selects the elements in positions from 20 to last
print(a[20:])
# which should be same as:
print(a[20:30])
Now that we explored how to slice rows, let us slice columns.
a = np.arange(30.).reshape(10,3)
# select second column
print(a[:, 1])
# select columns 1 (second) and 2 (third)
print(a[:, 1:3])
:
on its own stands for select all in this dimension. So, you are in effect selecting all rows, then selecting the first values of each row. If you have been following the tutorial so far, you should now know that if you want to access values in a 3D array, you would use x[i, j, k]
. Now consider the following code snippet.
# create 3d array with dimensions (time, latitude, longitude)
a = np.zeros((3, 3, 3))
# create 4d array with dimensions (time, height, latitude, longitude)
b = np.zeros((3, 3, 3, 3))
# add 1 to the first columns
a[:, :, 0] += 1.
b[:, :, :, 0] += 1.
When with atmospheric data, you often find you are dealing with both 3D and 4D variables. If you would like to modify all points at certain longitudes (columns), you could write them explicitly as above. Or you could write a loop like below where you execute different commands depending on the dimension of the array:
# create 3d array with dimensions (time, latitude, longitude)
a = np.zeros((3, 3, 3))
# create 4d array with dimensions (time, height, latitude, longitude)
b = np.zeros((3, 3, 3, 3))
# add 1 to the first columns
for i in [a, b]:
dimensions = len(i.shape)
if dimensions == 3:
i[:, :, 0] += 1.
elif dimensions == 4:
i[:, :, :, 0] += 1.
While there is nothing wrong with this code, and it will do what we want, there is a better method:
# create 3d array with dimensions (time, latitude, longitude)
a = np.zeros((3, 3, 3))
# create 4d array with dimensions (time, height, latitude, longitude)
b = np.zeros((3, 3, 3, 3))
# add 1 to the first columns
for i in [a, b]:
i[..., 0] += 1.
...
is called the ellipsis, and it is used to select all unspecified dimensions.
NumPy has more slicing options, such as striding (e.g. select every other rows) and select multiple points at a time using fancy indexing. Or you can pass a Boolean array (array containing True
and False
) to select values and create 1D array with them.
# get every other row
a = np.arange(27.).reshape(9, 3)
print(a[::2])
# get diagonal values in the 5x5 array using fancy indexing
b = np.arange(25).reshape(5, 5)
print(b[[0, 1, 2, 3, 4],[0, 1, 2, 3, 4]])
# use Boolean array to create 1d array of selection
c = np.array([[0., 1., 2.], [2., 3., 4.]])
d = np.array([[True, False, False], [False, True, True]])
print(c[d])
# create and use the Boolean array to selectively change data
e = np.arange(9.).reshape(3, 3)
f = e>5 # select values greater than 5.0
e[f] = 10. # make all values greater than 5.0 to 10.0
print(e)
# or you can simply do
e = np.arange(9.).reshape(3, 3)
e[e>5] = 10.0
3. Masked Arrays
Suppose you have a time series with missing data, and want to perform the row sum.
a = np.array([[np.nan, 3., 1.], [2., 8., 5.]])
print(np.sum(a, axis=1))
What is the result? The sum along the second row comes out as 15.0, but the sum along the first would give you nan
, which stands for “not a number”. If we want to do the sum along rows but ignore nan
values in the data we can used masked arrays! Try the code below.
a = np.array([[np.nan, 3., 1.], [2., 8., 5.]])
print(np.sum(np.ma.masked_invalid(a), axis=1))
Now we get 4.0 and 15.0. Another common reason to mask the data is to get rid of some values not fit for purpose. For example, if you are studying the damages caused by earthquakes in a region using historic data, and decide you want the sum of losses of the events with magnitude greater or equal to 5.0. You can simply do:
magnitude = np.array([2., 5., 6., 1.])
damage = np.array([1000., 100000., 110000, 10.])
print(np.sum(np.ma.masked_where(magnitude<5.0, damage)))
This should give you 210,000. There are numerous ways to mask your data for different purposes. All the available methods are listed in the numpy.ma documentation on the web.
4. Reading and writing data
Using standard .csv files, there are number of methods that let you read data file. While the shortest way is to use np.genfromtxt()
function, my personal favourite is to use Pandas to read it and then convert it to pure NumPy arrays.
import numpy as np
import pandas as pd
# read file as pandas.DataFrame then get values
data = pd.read_csv('your_csv_file').values
When writing the data to file, I use np.savetxt
with delimeter=','
option, or pandas.DataFrame.to_csv
.
# save csv file using numpy
np.savetxt('save_file.csv', your_data, delimeter=',', comments='')
The choice of tool is entirely up to you. The biggest reason why I tend to read csv data with Pandas is because the np.genfromtxt()
often messes up the string/integer/float format of the data, and setting them up manually can be a bit messy.
If you just want to store data, and it does not matter whether it is human-readable or not, you can choose to use the NumPy binary format.
# save one data to npy file
np.save('save_file1.npy' ,data)
# save collection of data to single npz file
dataset = {'temperature': data1,
'humidity': data2}
np.savez('save_file2.npz', **dataset)
# reopen files
open_data = np.load('save_file1.npy')
open_dataset = np.load('save_file2.npz')
# access data in dataset
temperature = open_dataset['temperature']
There are also a number of packages available to allow you to open files of different formats as NumPy arrays (e.g. netCDF4
for netCDF files).
5. Cautions when using NumPy
Some of you (with some experience in Python) might have felt something is not quite right. The culprit might be the fact that we have been able change the values of the original arrays within loops, which is not the default behaviour of Python!
Consider the following code:
c = 1.
d = c
# add 1 to d 5 times
for i in range(5):
d += 1. # d = d + 1
print(c, d)
As expected, the print statement will return (1., 6.)
.
Now consider the code below.
c = np.arange(3.) # array containing [0., 1., 2.]
d = c
# add 1 to d 5 times
for i in range(5):
d += 1. # d = d + 1
print(c, d)
You must be expecting the answer to be [0., 1., 2.] [5., 6., 7.]
. But no, both of them comes out as [5., 6., 7.]
. You should be asking why c
is changed when you specified it is d
that changes. Well, the answer is that the c = np.arange(3.)
statement creates an array in memory and assigns c
with its id to access it. So when you did d = c
, you essentially copied the id, not the actual array in memory. The code should look like the following to do what you “expect” it to do.
c = np.arange(3.) # array containing [0., 1., 2.]
d = c.copy()
# add 1 to d 5 times
for i in range(5):
d += 1. # d = d + 1
print(c, d)
ndarray.copy()
will copy the array itself and produce the result you want.
Summary
As a scientific computing tool, Python is a powerful tool, with NumPy at its heart. This tutorial hopefully would have helped people starting in their quest of using numerical analysis with Python.
Tutorial outcomes:
- You know what NumPy is used for.
- You manipulate and explore NumPy arrays using slicing
- You can create simple masks on data to ignore some of the entries
- You can read data to and write data in NumPy array format
- You know that variables copied from NumPy objects might be different from other Python objects