Cambridge Spark - Data Science Courses /
September 19 2018 /

If you are new to the world of data science, Python's Pandas libraries are some of the best tools for quick data analysis. Pandas are built on Numpy, another popular Python library.

The purpose of this tutorial is to provide 3 tips and tricks with plenty of practice examples that should be part of your tool kit as a Data Scientist.

You will use data from one of the most infamous shipwrecks, the sinking of the RMS Titanic. You've almost certainly heard stories about the sinking of the Titanic in 1912 when it hit an iceberg on its maiden voyage. Astonishingly, there were not enough lifeboats available for all the passengers. This tragic tale got even more interesting and personal when I found out that one of my friend's great-grandfather was on the Titanic. It get's worse - he was the helmsman! Two facts that most people seem to remember are. 1. The women were given preference to the lifeboats over the men and as a result, much more survived. 2. Those in first class were given access to the lifeboats over those in other classes.

In this tutorial, you will have access to a subset of the data and confirm these facts for yourself.

The Kaggle team have generously provided two data sets to perform some machine learning tasks. (https://www.kaggle.com/c/titanic/data) For the purposes of this tutorial we will only use one of them and so we will only work with a partial dataset for the passenger list on the Titanic.

Let's get started!

``````import pandas as pd
import numpy as np
``````

`passengers = pd.read_csv("titanic.csv")`

Before you start on any of the Panda tools, you will want to examine the dataframe to get an overview of your dataset. You will notice straight away that although the data set has 891 entries, you have missing values for Age, Cabin and Embarked.

`<code class="sourceCode python">passengers.info()</code>`
``````<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB``````
`<code class="sourceCode python">passengers.head(<span class="dv">5</span>)</code>`
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley ... female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath ... female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

# 1 Dealing with missing values

You can see that there are missing values in the data. There are a couple of ways you can deal with this. We will only show 5 example outputs to keep the tutorial short but you can run the operations on the whole dataset if you are curious.

## 1.1 Drop the missing values.

We can use Panda's `dropna()`.

`passengers.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)`

This will drop any rows with missing values. Clearly this isn't a good idea in this instance because we will reduce the number of passengers in our data set from 891 to 183. Fortunately, Pandas doesn't make any of the changes to your dataframe object until you change the inplace=False flag to True.

What instead if we wanted to remove any columns with missing values? Modify the code in the line below. The original dataset has 12 columns. As Age, Cabin and Embarked all have null values, you should end up with only 9 columns.

Modify the line below to remove any columns with missing values.

`<code class="sourceCode python">tt.dropna().head(<span class="dv">5</span>)</code>`
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
2 1 1 Cumings, Mrs. John Bradley ... female 38 1 0 PC 17599 71.2833 C85 C
4 1 1 Futrelle, Mrs. Jacques Heath ... female 35 1 0 113803 53.1000 C123 S
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
11 1 3 Sandstrom, Miss. Marguerite ... female 4 1 1 PP 9549 16.7000 G6 S
12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.5500 C103 S

Here is the solution:

``````# Solution. Columns are axis 1.
PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare
1 0 3 Braund, Mr. Owen Harris male 1 0 A/5 21171 7.2500
2 1 1 Cumings, Mrs. John Bradley... female 1 0 PC 17599 71.2833
3 1 3 Heikkinen, Miss. Laina female 0 0 STON/O2. 3101282 7.9250
4 1 1 Futrelle, Mrs. Jacques ... female 1 0 113803 53.1000
5 0 3 Allen, Mr. William Henry male 0 0 373450 8.0500

Another option that you have is to set a threshold. So, for example, you can specify that if 11 of a row's 12 fields has a non-null value, you won't drop that row. You are now left with 733 entries from the original 891 - a definite improvement on 183 with dropping any rows with missing values. Modify the line above to confirm this is the case. Dropping missing values isn't a good option here. Let's try another of Panda's options, filling the missing values.

Modify the line below to set a threshold of 11 non-null values for a row.

`<code class="sourceCode python">passengers.dropna().head(<span class="dv">5</span>)</code>`
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
2 1 1 Cumings, Mrs. John Bradley ... female 38 1 0 PC 17599 71.2833 C85 C
4 1 1 Futrelle, Mrs. Jacques Heath ... female 35 1 0 113803 53.1000 C123 S
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
11 1 3 Sandstrom, Miss. Marguerite ... female 4 1 1 PP 9549 16.7000 G6 S
12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.5500 C103 S

Here is the solution:

``````# Solution
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley ... female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath ... female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

## 1.2 Filling the missing values

There are many statistical schools of thought on what missing or null values should be replaced with. You already know from `passengers.describe()` that the average(mean) age is 29.69. You can see that the 6th entry below have missing values replaced.

``````# The mean
``````0    22.00
1    38.00
2    26.00
3    35.00
4    35.00
5    29.69
6    54.00
7     2.00
8    27.00
9    14.00
Name: Age, dtype: float64``````

The median is another good value that we could use. Modify the code above to use the median.

Complete the following line to fill in the missing values with the median age

`<code class="sourceCode python">passengers[<span class="st">"Age"</span>].fillna(value<span class="op">=</span></code>`
``````  File "<ipython-input-17-171bf02c8336>", line 1
passengers["Age"].fillna(value=
^
SyntaxError: unexpected EOF while parsing``````

Here is the solution:

`<code class="sourceCode python">passengers[<span class="st">"Age"</span>].fillna(value<span class="op">=</span>passengers[<span class="st">"Age"</span>].median()).head(<span class="dv">5</span>)</code>`
``````0    22
1    38
2    26
3    35
4    35
Name: Age, dtype: float64
``````

# 2 The apply function

This is one of the most powerful tools available in Pandas. Apply, allows you to either use Python's built-in functions or to create your own custom function and then run it across a set of your data.

Let's take a look at an example where we create own function and then "apply" that across a set of data.

The Titanic set off from Southampton, stopping off at Cherbourg and Queenstown.

I don't know about you, but I always struggle with remembering the names of French towns and ports. To help me out, I've written a function, `get_full_city_name`, that converts the one letter code of the `Embarked` column to its full name, for all the towns, French or otherwise . Using Panda's apply function, I can then run this throughout the entire column as shown below. Note that the function `apply` is not modifying the original data source but producing a new output instead.

``````def get_full_city_name(cityCode):
if (cityCode == "S"):
return "Southampton"
elif (cityCode == "C"):
return "Cherbourg"
elif (cityCode == "Q"):
return "Queenstown"
``````

`passengers["Embarked"].apply(get_full_city_name).head(10)`

``````0    Southampton
1      Cherbourg
2    Southampton
3    Southampton
4    Southampton
5     Queenstown
6    Southampton
7    Southampton
8    Southampton
9      Cherbourg
Name: Embarked, dtype: object``````

Now let's use the `apply` function, but this time using a lambda expression. Lambda expressions allow you to create functions on the fly. This means you can create a function when you need it and then apply that to the relevant data set.

What would have been the price of the tickets on the Titanic in today's money? Taking into account inflation, prices have increased by about 106 times. In the one-liner below – and that's part of the reason why lambda functions are so useful – the lambda function takes as input the Fare from the data frame, multiplies this by 106 returning today's fare.

``````# Updating the fares for all the passengers
passengers["Fare"].apply(lambda fare: 106 * fare).head(5)``````
``````0     768.5000
1    7556.0298
2     840.0500
3    5628.6000
4     853.3000
Name: Fare, dtype: float64``````

What change would you have to make to the line below, to determine the maximum fare in today's money?

How can you modify the line below to show the highest fare?

`<code class="sourceCode python">passengers[<span class="st">"Fare"</span>].<span class="bu">apply</span>(<span class="kw">lambda</span> fare: <span class="dv">106</span> <span class="op">*</span> fare).head(<span class="dv">5</span>)</code>`
``````0     768.5000
1    7556.0298
2     840.0500
3    5628.6000
4     853.3000
Name: Fare, dtype: float64``````

Here is the solution:

`<code class="sourceCode python">passengers[<span class="st">"Fare"</span>].<span class="bu">apply</span>(<span class="kw">lambda</span> fare: <span class="dv">106</span> <span class="op">*</span> fare).<span class="bu">max</span>()</code>`
`<code>54306.895199999999</code>`

Want to find out more about the `apply` function? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply

## Recap

• Dealing with missing values (`dropna()` and `fillna()`) gives you the freedom to determine how you will deal with missing data.
• `apply()` enables you to run a function (a built-in one or one you have created) across a series of data.

I hope you have had fun exploring this data set. These are the top 3 Panda tips I recommend every Data Scientist has in their tool kit. Let me know in the comments section below if you think another Panda tool should have made the list.

Jonathan Fernandes

Jonathan has an undergraduate in Computer Science and an MBA from Warwick University. Data Science combines his love for numbers, coding and statistics. He enjoys working together with people to make sense of their data, allowing them to make more informed decisions.

## You may also like

#### Deploying a machine learning model to the web

Data scientists often have to communicate results to other people. In my case, my supervisors might want to see some...

#### An Introduction to Plotting

Introduction I'm a teaching assistant for several MSc courses, covering natural language processing, natural language...

#### Getting started with regression and decision trees

Regression analysis is one of the approaches in the Machine Learning toolbox. It is widely used in many fields but its...