Create a Sankey Diagram with Plotly in Python

A tutorial on visualising job search result with Sankey diagram

Yu-En Hsu
3 min readJan 6, 2021
Photo by Cathryn Lavery on Unsplash

In the post, I first explained the function input for creating a Sankey graphic with Plotly library in Python. I also provided the full script used to create my job search chart. If you are looking to create a similar graph, I hope this post can help you!

If you are a Reddit who subscribed to r/dataisbeautiful, you probably have seen these types of graphs:

My internship search journey!

Sankey diagrams were mostly used to visualise energy flow in the environmental sector. So it was quite interesting to see how Redditors use Sankey to plot job search journey.

Energy flow Sankey diagram from Eurostat (Source)

It’s always so much fun working with data that are related to myself. In another post, I exported Apple Health data and visualised my walking distance from 2016 to 2020. As I happened to log all my job applications, I had the material for the Sankey diagram.

Basic

The go.Sankey() function in Plotly requires two inputs: node for terminal points and link for flow quantity. Using the chart below as an example, there are six nodes, which are labelled node 0 to 5 corresponding to the Python index and the quantity for the node. Node 0 has 10 units, 8 of which go to Node 2 and 2 go to Node 3. The diagram also shows seven links (or flows).

Basic Sankey Diagram

Here is the code for the basic diagram:

As shown, node defines the six terminal points provided in label. So far, it’s pretty self-explanatory. However, the link element was much more confusing at first. Overall, link is a dictionary with three keys: source, target, and value. Each key has a list of numbers as the dictionary value. The length of the list indicates the number of flows in the diagram, and three lists should have the same length. For instance, in the code above, all three keys have a list of seven numbers, suggesting that the diagram has seven flows.

Once I figured the number of flows, I found the corresponding node index for the source and the target for each flow, as well as the quantity. The first flow has 8 units and goes from Node 0 to Node 2; the last one has 3 units and goes from Node 2 to Node 5. That’s it!

Full Script

Data example

I used Notion to track all my job and internship applications, so I exported the data to CSV and imported the file to Python.

The data are from early 2020, so I don’t recall the difference between Result 1 and Result 2. Regardless, any similar data would do.

The node is the unique values across all columns. I manually created the labels list, but you can use the following code. As for the flow, I used groupby().count() to get the quantity.

labels = []for col in data.columns:
labels = labels + data[col].unique().tolist()

Putting everything together:

Plotly has more examples and detailed documentation. Have a good day!

--

--

Yu-En Hsu

I am passionate about using data to make the world a better place, and I write about data science, visualisation, and machine learning.