Data visualization plays a significant role in machine learning from data analysis, to model building, evaluation, testing, and selection. When it comes to visualizing data, heatmap is a widely used data visualization technique that provides a graphical representation of data. You can easily create heatmaps using Seaborn, which is a popular Python data visualization package. In this post, we will discuss how to create Heatmaps using Seaborn.
What is a heatmap?
A heatmap is a 2D graphical representation of data where each value in a matrix is represented as colors. It can be used to plot and visualize “rectangular data” when calculated values or values such as count and average are more extreme.
The Seaborn Python package can be used to create annotated heatmaps and tweak them using Matplotlib tools to match your requirements.
If you have a data set with multiple variables, you can perform exploratory data analysis by visualizing those data with heatmaps. Furthermore, you can use heatmaps to identify general patterns of a dataset quickly while using the Seaborn library to create beautiful heatmaps in Python.
Python Heatmap Code
In the following sections, we will create a Seaborn heatmap using a dataset that tracks flight delays of the US Department of Transportation.
When you open this CSV file in Excel, there is a column for each airline code and a row for each month like January = 1, February = 2, March = 3, etc.
Each record shows the average arrival delay in minutes based on different airlines and months (throughout the year 2015). At the same time, negative records indicate flights that are likely to arrive early on average. For example, the airline with the code “AA” (American Airlines flight) arrived roughly 5 minutes late on average in April, and the airline with the code “DL” arrived roughly 2 minutes early on average in January.
Import the required Python packages
At first, the following Python packages need to be imported for creating the heatmap.
|import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Load the dataset
In the below code, the read_csv command in pandas is used to load the dataset. The data can be visualized as follows:
|# Path of the file
file_path = “../input/flight_delays.csv”
# Read the file
flights = pd.read_csv(file_path, index_col=”Month”)
# Print the data
Create the Heatmap
Next, you can use the heatmap function of the Seaborn package to create the heatmap. The arguments of that function are as follows:
- annot=True – Ensure that values will be displayed for each cell on the heatmap. If you don’t add this, numbers will be removed from each cell.
- data=flights – All the entries of the flights will be used to create the heatmap.
- sns.heatmap – This indicates that the heatmap will be created.
- cmap – This is a matplotlib colormap object or name which can be used to map the values of data to the color space.
The following code can be used to create a heatmap that visualizes patterns in flights. Each cell of the heatmap is color-coded based on its corresponding value.
|# Width and height of the figure
plt.title(“Average Arrival Delay of Airlines, by Month”, )
# Heatmap which shows the average arrival delay for airlines by month
sns.heatmap(data=flights, annot=True, cmap=’RdYlGn’)
# Label of horizontal axis
Above is the Seaborn Python Heatmap for flight delays. If you have a closer look at it, you will be able to detect some patterns. Moreover, when carefully observed, you can see that the months towards the end of the year (from September to November) are comparably dark for all airlines. In fact, this means that most airlines follow the same schedule during those months.
Gridlines and Squares
There are also some other arguments that are useful when designing heatmaps. With some datasets, the color between two cells can be quite similar, making it difficult to distinguish between specific values. You can solve this issue by using linecolor and linewidth parameters when adding gridlines to the heatmap.
Likewise, you can use the square parameter to make the cells of your matrix square shape despite the size of the figure. However, it’s not necessary to use squares for cells.
In the following code, we have added a thin white line between each cell to indicate that they are separate records:
|sns.heatmap(data=flights, annot=True, cmap=’RdYlGn’, linewidth=1, linecolor=’w’, square=True)
The gridlines and squares can be used depending on the purpose of your visualization.
You can also add a correlation matrix into a heatmap which can be used to visualize some insights of the Pandas DataFrame. The cells of the heatmap represent the correlation coefficients. These correlation coefficients are the linear relationships between the variables of the Pandas DataFrame.
The Seaborn library can be used to generate the correlation matrix as well. You can use the corr method of Pandas DataFrame to calculate Pearson’s correlation coefficient between all pairs of numeric columns of the DataFrame.
|titanic_data = pd.read_csv(‘../input/titanic.csv’)
|sns.heatmap(titanic_data.corr(), annot=True, cmap =’rocket’, fmt=”.2f”)
Annotated heatmap of Pearson correlation coefficients between variables
You can see that the cells of the above heatmap contain the correlation coefficients.
Using Color Effectively
The key feature of a heatmap is the effective usage of colors to denote the size of an underlying quantity.
Seaborn allows you to use various colors to draw heatmaps. Furthermore, you can easily change those colors by indicating the cmap (colormap) parameter, which is optional. For instance, here is the code for creating a heatmap using the ‘flare’ color palette:
Seaborn offers many built-in color palettes for users to choose from. However, be careful to select the best palette based on your purpose and data.
Specifically, sequential palettes like “crest”, “flare”, “mako”, and “rocket” are ideal for displaying numerical data as in our example. Besides, the colors of sequential palettes are perceptually uniform. Therefore, the difference we notice between the two colors is proportional to the difference between the numerical values. This enables you to get a quick idea about the distribution of data values by simply glancing at the heatmap.
Your heatmap will show clear patterns if you select a proper color palette, whereas a poor palette choice will not deliver better results. For example, you can see the same heatmap created using the tab10 palette below:
As we have selected a poor palette for the above example, it’s quite difficult to figure out the relationship between different colors. So, you won’t be able to see the patterns we observed in the previous heatmap.
The reason is that the tab10 palette uses changes in hue to discern between categories. However, tab10 will be a great choice if your heatmap is categorical.
You can refer to the Seaborn documentation if you need more detailed information about selecting color palettes.
The diverging color palette can be used to design a colormap as a combination of variance between two colors. If you need to visualize both the high and low values of your data, you can draw a diverging palette like Spectral, vlag, icefire or coolwarm that can be used to highlight both extremes.
Earlier, we explained how to draw a correlation matrix, which is a special kind of heatmap. Correlations range from -1 to 1. Now, a diverging palette will perform better than a sequential one since there are two directions.
Seaborn’s .diverging_palette method enables creating colormaps with one color on each side and converging to a different color in the center.
|cmap = seaborn.diverging_palette(h_neg and h_pos, sep=value, l=value, as_cmap=True)
- h_pos and h_neg: Represent the positive and negative extends of the map, and they range between 0-359.
- l: Used to add lightness to the positive and negative extends of the map, and it ranges between 0-100.
- sep: Indicates the size of the intermediate area of the data in the heatmap.
- as_cmap: This is a boolean parameter that indicates a matplotlib colormap object if you set it as True.
You need HUSL colors for the .diverging_palette method, which requires hue, saturation, and lightness values. Visit hsluv.org to find the perfect colors for your heatmap.
To illustrate, here are the colors I have selected.
correlation = data.corr()
sns.heatmap(correlation, annot=True, cbar=True, cmap = sns.diverging_palette(145, 300, s=60, as_cmap=True))
In the above heatmap, we have drawn a correlation matrix using a diverging palette so that you can easily observe the most important correlation coefficients.
Seaborn comes with a built-in function called seaborn.mpl_palette() to indicate discrete values in a dataset with different colors. It will return discrete color patterns by plotting values in the color palette.
See the following syntax to understand it better
- value: This is the number of discrete colors that will be displayed in a palette.
- Set3: This is the name of the color palette (other colormaps here).
Let’s look at the example code below:
There, we have used the palplot method of seaborn to plot the values of the given color palette in a horizontal array. From the value, you can see that the number of discrete colors in the palette is 11.
Finally, the output of the diverging palette will look as follows:
In this post, we learned about heatmaps and how to draw them using Python and the Seaborn visualization library. So, now you know that you can easily create heatmaps using the Seaborn library, and you can even tweak those heatmaps to suit your needs. Refer to the Seaborn documentation for more details on how to create impressive heatmaps that can be used to analyze different markets.
The effectiveness of the heatmap depends on the colors you select to retrieve the information. Thus, you will be able to observe broad patterns at a glance if you select perfect colors for your heatmap.
The selection of scale and color palette plays a significant role in creating heatmaps. Moreover, there are customization options like gridlines and squares that help to indicate specific aspects of the heatmap. However, the most appropriate customizations for your heatmap will depend on your data visualization requirements.
You can find more interesting articles on our blog. In this article, we discuss which python IDE is best suited for your Data Science work, while we have another article that talks about the best Machine learning use cases.