Data Processing with Pandas Dataframe

Intro to Pandas DataFrame for Data Analysis

The purpose of this tutorial is to teach you how to process data with Pandas DataFrame.

At the end of this tutorial, you will be able to:

  • load a dataset,
  • explore data and rename columns,
  • check and select columns,
  • change columns’ names,
  • describe data,
  • identify missing values,
  • iterate over rows and columns,
  • group data items,
  • concenate dataframes.

 

Table of Contents:

  1. Resources
  2. Loading Data
  3. Exploring Data and Renaming Columns
  4. Checking and Selecting Columns
  5. Changing the Columns’ Names
  6. Describing Data
  7. Missing Values
  8. Iterating Over Rows and Columns
  9. Grouping
  10. Concatenating

Interested in learning more Data Analytics?

Check out our Data Analyst (L4) Apprenticeship that will teach you everything you need to know about advanced Data Analysis with Python. The apprenticeship is funded by the UK government through the Apprenticeship Levy.

Resources

For this tutorial, the libraries we will need are Python, Numpy, Pandas, and Matplotlib. The version of the libraries that we will be using in this tutorial is as follows.

Pandas

 

Loading Data

For this tutorial, we will be working on Titanic dataset. You can download it from https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html.

Download the titanic.csv file to your computer, read the data using the following piece of code:

data = pd.read_csv(“your_file_location/titanic.csv”)

Pands import

 

Exploring Data and Renaming Columns

First of all, let’s look at the first rows in the dataset to see how it looks like.

Pands 2

Note:
By entering a number into the brackets such as df.head(3), you specify how many columns to be shown.
If you leave it empty, it displays the first five rows by default.

 

Checking and Selecting Columns

Next, let’s check what columns do we have.

Pandas 3

Then, we can specify what columns to use. To do that, we select the columns. For example

Pandas 4

Then, we can print the last five rows and datatypes to see how new dataframe looks like.

Pandas 5

 

Changing the Columns’ Names

Let’s change the name of the columns. We will be working on our original data with eight columns.

Pandas 6

As can be seen above, we have successfully created the dictionary. Now, we can change the names of the columns, by passing that dict into parameter columns in rename().

Pandas 7

 

Describing Data

Pandas 9

Let’s check some basic statistics to understand our data better.

The describe functions give us descriptive statistics that summarise the count, mean, standard deviation, minimum. maximum, and quantile values. NaN values are ignored by default.

Screenshot 2019-10-24 at 13.37.39

 

Missing Values

Pandas treat None and NaN for indicating missing or null values in data. Various functions are available to detect the missing values in Pandas DataFrame such as:
isnull()
notnull()

Screenshot 2019-11-26 at 11.58.20

Note:
df.isnull() function displays all the values in the data as True or False. The True values represents the null values.
df.notnull() does the opposite of this function.

Using any(), we can see the summary of each column in terms of if there are any missing values.

Pandas 11

Let’s summarise the values according to axis=1.

Screenshot 2019-11-26 at 13.10.38

Nevertheless, what if we want to use “isnull()” function to display all rows where df has null values? In other words, what if we want to display the actual rows with null values instead of this df with True or False cells. To do that, we write the following code:

Pandas 13

 

Iterating Over Rows and Columns

Let’s start with iterating rows and using self-made functions. To iterate throw rows, we use iterrows() function. See the example below.

Pandas 14

To iterate throw columns, we use iteritems() function. See the example below.

Pandas 15

 

Grouping

Pandas groupby() function is used to split the data into groups based on criteria. In other words, grouping is used to provide a mapping of labels to group names.

Let’s group our data according to PCLASS.

Pandas 16

To resume PCLASS as a column, use reset_index.

Pandas 17

We can plot the returned dataframe.

Pandas 18

 

Concatenating

Pandas provides several functions for easily combining DataFrame. One of these functions is concat().

There are eight columns in our dataframe namely SURVIVED, PCLASS, NAME, SEX, AGE, SIBSA, PARCA, and FARE. Let’s create three different dataframes from our dataframe (df), then concat them with concat() function.

Pandas 19

Now, we have three different dataframes.

Pandas 20

Another way of combining the DataFrame is by using append() instance methods. They concatenate along axis=0.


Summary

Congratulations, you have reached the end of the Data Processing With Pandas DataFrame!

 

AUTHOR: DILEK CELIK

A PhD candidate in Computer Science and Information Systems at Birkbeck College, University of London. IBM, Stanford University and MIT certified professional in Data Science and Machine Learning with advanced Java, Python, R, Data Science and Machine Learning expertise and experiences. Teaching Assistant in Machine Learning, R, Python, and Java Modules of University College London and Birkbeck College, University of London. 

Subscribe to Our Newsletter

Subscribe now to receive our bi-weekly Data Science newsletter featuring industry news, interviews, tutorials, popular resources to develop your skills and much more!

Data Analyst Apprenticeship L4

Learn advanced data analysis skills
with a government-funded apprenticeship
June 2020 start

Subscribe to our blog