import pandas as pd
import seaborn as sns
Single Variable: Continuous
When you have a single continuous variable and want to visualise the distribution of its values in your dataset, a histogram is generally what you need. This groups the values into bins, where each bin is an interval within the range of values your variable can take. The x axis will show the interval of each bin, while the y axis shows the number of values in your dataset that fall within that interval.
Let's load in some data using seaborn's handy
load_dataset() function. The
flights dataset has three variables: two ordered categorical (
month) and one continuous (
number of passengers).
flights = sns.load_dataset('flights')
A simple histogram will show the overall distribution of the
passenger variable. This is easy to plot, as pandas dataframes have a built-in method for generating it.
import pandas as pd
import seaborn as sns
By default, pandas plots histograms using 10 bins but you could fine-tune this. Displaying more bins gives a more detailed overview of the distribution, up to a point: it all depends on how many observations you have overall and how they are distributed. You can see how using 20 bins shows more information about the distributions inside the larger 5 bins.
flights.passengers.hist(bins=5) # The blue bars
flights.passengers.hist(bins=20) # The orange bars
So the range of passenger numbers is a little over 100 to a bit over 600, with most flights towards the lower end. For a more precise overview, the
describemethod for a dataframe's columns will give general descriptive statistics.
Name: passengers, dtype: float64
For a visual representation of
describe, a boxplot will show the minimum and maximum values (the left and right whiskers), the range of values covered by the 25th to 75th percentiles (the box) and the value of the median (the line inside the box).
Single Variable: Categorical
When you have a variable which takes on named, rather than numerical, values then the most common way of representing them is with a bar chart.
Here, we'll load the
titanic dataset. Each row is a passenger on the ship, while the
class variable gives the class of that passenger's ticket.
titanic = sns.load_dataset('titanic')
Name: class, dtype: int64
You can chain
.plot(kind='bar') to the above
value_counts() method, but I prefer to use seaborn as you can directly pass it the original data. It will then do the counting for you and allow you more control over appearance. For example, if you do not like the ordering seaborn used for the x axis, then you can set it manually as a list e.g.
order=['Third', 'Second', 'First']
If you want to normalise the counts so as to see relative percentages rather than counts, then you just need to do that to the data before plotting it as a normal barplot.
titanic_normed = pd.DataFrame(titanic['class'].value_counts(normalize=True)).reset_index() sns.barplot(data=titanic_normed, x='index', y='class')
Plotting relationships between variables
Above, we only had a single variable. We examined it by looking at the frequency of values (the histogram) or by plotting descriptive statistics (the boxplot). But often we want to see how one variable is linked to another - as the value of one variable changes, what happens to the value of the other variable?
With continuous and ordered/unordered categorical variables, we have four possible combinations. Let's look at them in turn.
Continuous x continuous
mpg dataset contains information about cars, measuring their weight, fuel efficiency and so on. We might expect heavier cars to have lower fuel efficiency.
When plotting continuous variables, the one you place on the x-axis should be the independent variable. This is generally some property or value we observe. The y-axis should display the dependent variable. This is a function of the values on the x-axis and is generally something we measure for each observed value on the x-axis. Here, we will place weight on the x-axis and miles per gallon on the y-axis.
Generally, the best choice of visualisation for this is a scatterplot. Each point represents the relation between a single value on the x-axis and its corresponding y value.
mpg = sns.load_dataset('mpg')
g = sns.scatterplot(data=mpg, x='weight', y='mpg')
There are several variations on this, which are made available through seaborn's
jointplot. The default will add histograms on the margins, for each of the two variables.
sns.jointplot(data=mpg, x='weight', y='mpg')
By setting the
kind argument to
kde, you can instead plot a joint kernel density estimate, with individual density estimates on the margins.
sns.jointplot(data=mpg, x='weight', y='mpg', kind='kde')
Or you can set it to
hex and plot the values as hexagons, which represent histogram-type bins. This can be very useful if you have a lot of observations in your dataset and plotting all those points is slow or messy.
sns.jointplot(data=mpg, x='weight', y='mpg', kind='hex')
Continuous x unordered categorical
There are a few more options when it comes to jointly plotting continuous and categorical data. In general, the categorical data will go on the x-axis and you may need to change the order in which they are displayed.
Let's look at the relationship between fuel efficiency (continuous) and a car's country of origin (unordered categorical). Seaborn's will
stripplot make a separate scatterplot for each categorical variable and place it on the x axis, with its own colour. It will also stagger the points a little to help see their distribution - this can be controlled with the
sns.stripplot(data=mpg, x='origin', y='mpg', jitter=0.3)
swarmplot does the same but arranges the points so that there is no overlapping.
sns.swarmplot(data=mpg, x='origin', y='mpg')
And if you want a boxplot for each categorical variable, there is no need to do them separately and manually place them in a figure -
catplot is a great way to plot categorical x continuous data.
sns.catplot(data=mpg, x='origin', y='mpg', kind='box')
Continuous x ordered categorical
Sometimes, the categorical data will have a natural order to it. The most common of these is times or dates. This can sensibly be plotted as a line, to show how the continuous variable changes over time. Generally, the categorical data must be unique - no value should appear more than once.
gammas dataset contains fMRI measurements taken from multiple subjects. Let's look at subject 0, and see how a signal which is dependent on blood oxygen levels (BOLD signal) changed over time in various regions of interest (ROI) in the brain.
lineplot method has a
hue argument, that will seperate out the three different values for ROI and plot them as their own lines.
gammas = sns.load_dataset('gammas')
subject_0_data = gammas[(gammas.subject == 0)]
sns.lineplot(data=subject_0_data, x='timepoint', y='BOLD signal', hue='ROI')
We could also focus on a particular ROI and then see how all subjects compare by setting
sns.lineplot(data=gammas[gammas.ROI == 'IPS'], x='timepoint', y='BOLD signal', hue='subject', legend=False)
# Remove the legend as it gets in the way with the default plot size.
Categorical x categorical
The most common non-graphical way of representing two joint categorical variables is as a contingency table. Each row of the table represents a possible value of one variable, the columns of the other variable. Cells are populated with the number of observations of pairs of those values.
We can create that table using pandas'
crosstab function - just tell it which columns of a dataframe to use.
titanic = sns.load_dataset('titanic')
sex_class = pd.crosstab(titanic.sex, titanic['class'])
We can also normalise the values to show percentages, rather than counts.
sex_class_normed = pd.crosstab(titanic.sex, titanic['class'], normalize=True) * 100
This tabular data is easily to represent visually as a heatmap. This essentially colours in the cells of the table, based on their value. It can be a great way to very quickly communicate the joint distribution of two categorical variables, especially where you want to highlight the fact that some particular combinations are very high or low.
sns.heatmap(sex_class, cmap='Blues', square=True, annot=True, fmt='g')
sns.heatmap(sex_class_normed, cmap='Blues', square=True, annot=True, fmt='.2f', cbar=False)
Questions to ask before plotting
Here are the questions to ask before you start plotting:
What is the purpose of my visualisation?
- Show the relationship between variables?
- Illustrate individual distributions of variables?
What kind of variables do I have? For each variable:
- Is it continuous?
- Or is it categorical?
Besides these variables, is there some other informative distinction I want to show? Do my variables come from...
- different groups of people/individuals/companies/locations?
- different time periods?
- different experiments?
- different models?
Have I included all the necessary information?
- Descriptive title?
- Informative caption?
- Axes have suitable labels?
- Units for axes, where appropriate?
- Axes using suitable scale?
- Do I need a legend?
- Do my colours and styling aid readability?
Cheat sheet: picking a visualisation for your data
And a quick list, linking types of data to types of visualisation:
- histogram: more visual, big picture, show distribution of ranges of values
- boxplot: more statistical and detailed
- barchart: show counts or proportions of values
continuous x continuous
- scatterplot: show relation between every x and y
- basic jointplot: as above, but with marginal histograms per variable
- kde jointplot: show distribution of joint values, with individual histograms
- hex jointplot: as above, but points are now mini-histograms
continuous x unordered categorical
- stripplot: multiple scatterplots arranged on x axis
- swarmplot: as above, but no overlapping points allowed
- catplot with boxplots: replace individual plots with boxplots
- continuous x ordered categorical
- line: shows exactly what values are seen over time
- categorical x categorical
- cross-tabulate then heatmap: show relative proportions of joint variables
Look into seaborn's documentation for figure aesthetics and choosing colour palettes - these can make your visualisations look really great. The ones I did here use the default settings and could definitely be improved upon!
Think about how the plots could be improved in terms of the questions under "Have I included all the necessary information?". Seaborn makes it very easy to add titles and so on to figures.
Seaborn also makes it easy to visualise many aspects of the data at once, rather than individually as we did here. Read the documentation for jointplot and catplot to see how flexible and easy to use these methods are!
Try applying the above to real data that you have, rather than the toy datasets used here.
About the author
Alexander Robertson is a Data Science PhD student at the University of Edinburgh, where his research focuses on variation, usage and change in natural language and also emoji.