Goodbye Excel sheets, hello Pandas

If you can see how to solve a problem in an Excel spreadsheet, there is a good chance you can do the same in Python using its Pandas library. In today’s article we introduce Pandas and see how similar they are to spreadsheets.

The name “Pandas” comes from “Panel Data” and “Python Data Analysis” and is the standard way to handle data sets in Python.

Pre-requisites

You will Python, which is free to download and its pandas library. Check your installation works by starting python and running the following import statement:

$ pip install pandas
:
$ python
Python 3.10.9 (main, Mar  1 2023, 12:33:47) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>>

Excel to Pandas

As we have used Excel in our previous articles we are going to continue that theme by doing the same thing in Excel and Pandas. Open Excel and fill some cells as follows. This is going to be our dataset.

A simple dataset

Save this as a CSV (comma separated file). Different versions of Excel differ but it will be something like Save As on the File menu, then in the dialog change the file format from .xslx to CSV UTF-8 (Comma-delimited) (.csv). I called my file data.csv. Make sure you know where that csv file was saved, we are going to load it into Python next. If you open it in a text editor it should look something like this:

A CSV file

CSV files are very common for report formats and database dumps, so if you intend to process data it is very likely you will come across CSVs. Let’s load it into a Panda. Make sure you are in the same folder as the CSV, then run python:

$ cd example/
$ python
Python 3.10.9 (main, Mar  1 2023, 12:33:47) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>>
>>> panda = pandas.read_csv('data.csv')
>>> panda
   Column A Column B    Column C
0         1    Data1  07/12/2023
1         2    Data2  08/12/2023
2         3    Data3  09/12/2023
3         4    Data4  10/12/2023
>>>

As you can see, the contents of our panda variable is a spreadsheet-like structure containing the data from the file.

Built-in functions

There is a lot you can do with Pandas, but to start with we are going to look at their similarity to spreadsheets. In our article on spreadsheets we started with simple functions such as MAX(). What is the equivalent in our Panda?

>>> panda['Column A'].max()
4
>>>

If this didn’t work for you, make sure you used the correct type of brackets, [square] then (round), and that your column title is exactly the same as in the file, capitals, spaces and all.

In Excel, we chained together functions by referencing other cells as the arguments to our functions. For example IF(J6="MALE",AVERAGE(K6,L6),AVERAGE(M6,N6)) where we change our calculation based on a what is in cell J6. We can do the same for one “cell” in our Panda but they are really set-up to handle whole columns or rows of data. Just as we “copy down” formulae for every row in a spreadsheet we can apply a function to every row in a Panda. Let’s double every value in Column A:

>>> panda['Column A'] * 2
0    2
1    4
2    6
3    8
Name: Column A, dtype: int64

That doubled every value but did not do anything with the result, it just gets printed to the terminal. Let’s put the results into a new column:

>>> panda['Twice'] = panda['Column A'] * 2
>>> panda
   Column A Column B    Column C  Twice
0         1    Data1  07/12/2023      2
1         2    Data2  08/12/2023      4
2         3    Data3  09/12/2023      6
3         4    Data4  10/12/2023      8

Writing your own functions

In Excel we wrote our own functions, so let’s do the same in Python, starting with a function that just prints “Hello, world!”. In Python the amount of space at the start of a line is important, so ensure the p of print lines up with the f of def, and don’t forget the : at the end of the def.

>>> def hello(n):
...   print("Hello, world!")
...
>>>

Our function hello takes one argument, which it names n but it does not use. We now have a hello function. Let’s apply it to our Panda.

>>> panda['Column A'].apply(hello)
Hello, world!
Hello, world!
Hello, world!
Hello, world!
0    None
1    None
2    None
3    None
Name: Column A, dtype: object

What is going on here is that Hello, world! is being printed to the terminal each time the function is applied to a row in Column A of the Panda, and as there are four rows we see it four times. Nothing is changed by the function so we are just left with Column A which is also printed to the terminal. We could store the results back into the Panda instead:

>>> panda['Hello'] = panda['Column A'].apply(hello)
Hello, world!
Hello, world!
Hello, world!
Hello, world!
>>>

Did that work? If we look at the Panda:

>>> panda
   Column A Column B    Column C  Twice Hello
0         1    Data1  07/12/2023      2  None
1         2    Data2  08/12/2023      4  None
2         3    Data3  09/12/2023      6  None
3         4    Data4  10/12/2023      8  None

The Hello column was created but it is empty! Our function prints a string to the terminal but does not return a value. Let’s fix that.

>>> def hello(n):
...   return "Hello, world!"
...
>>> panda['Hello'] = panda['Column A'].apply(hello)
>>> panda
   Column A Column B    Column C  Twice          Hello
0         1    Data1  07/12/2023      2  Hello, world!
1         2    Data2  08/12/2023      4  Hello, world!
2         3    Data3  09/12/2023      6  Hello, world!
3         4    Data4  10/12/2023      8  Hello, world!

That’s better. We can now put anything we like into that function (and rename it appropriately of course).

What’s in a name?

Defining a function just so that we can apply it once is a bit long-winded. For one-off functions we can define the function “on the fly” in which case we do not have to name it because it is clear to Python where the function is. We are not referring to a previously defined function. A nameless function like this is called a “lambda” which is some obscure jargon from computer science. Let’s do that again with a lambda function. Python expects the function to take at least one argument i.e. the value from column A, as it applies it to each row, but the function does not have to do anything with it. We can denote unused arguments with an underscore instead of proper name such as x:

>>> panda['Hello'] = panda['Column A'].apply(lambda _: 'Hello, world!")
>>> panda
   Column A Column B    Column C  Twice          Hello
0         1    Data1  07/12/2023      2  Hello, world!
1         2    Data2  08/12/2023      4  Hello, world!
2         3    Data3  09/12/2023      6  Hello, world!
3         4    Data4  10/12/2023      8  Hello, world!
>>>

Save your working

Once we have done some processing the results are sitting in memory and will be lost when we exit our Python session. We usually want to persist them into a file. Just as we can read data into a Panda we can write data out too:

>>> panda.to_csv('results.csv')
>>> [Ctrl-D]
$ cat results.csv
,Column A,Column B,Column C,Twice,Hello
0,1,Data1,07/12/2023,2,"Hello, world!"
1,2,Data2,08/12/2023,4,"Hello, world!"
2,3,Data3,09/12/2023,6,"Hello, world!"
3,4,Data4,10/12/2023,8,"Hello, world!"

That is not quite what we were expecting. Each row has a row number, and there is an empty column title at the start too. We can tell Panda not to do that:

>>> panda.to_csv('results.csv', index=False)
>>> [Ctrl-D]
$ cat results.csv
Column A,Column B,Column C,Twice,Hello
1,Data1,07/12/2023,2,"Hello, world!"
2,Data2,08/12/2023,4,"Hello, world!"
3,Data3,09/12/2023,6,"Hello, world!"
4,Data4,10/12/2023,8,"Hello, world!"

That’s more like it. We are dealing here with CSV files but Pandas can also read and write Excel files, .xlsx.