How to Turn your Jupyter Notebook into an Interactive Data Science Report
Introduction
Jupyter notebooks are great for coding, exploring data, and adding context to what one is doing, but recently I encountered a challenge that needed a solution. What if you want to present findings using Python code in an engaging way that cuts out the code, say for a report which a wide variety of people will see some of who can’t follow the code?
Previously I had solved this issue by hiding the code blocks and exporting notebooks in a variety of formats, all of which present some drawbacks. PDF format is mainly static and looks plain. HTML can be interactive, but also doesn’t look all that interesting. You can modify an HTML file to look better but it takes some time. LaTeX documents are better than PDFs but setting up LaTeX is a headache with little payoff. Other tools like Tableau or Plotly seem to be more plot specific and I don’t find them as simple for building a report.
So I was pleasantly surprised when I came across Datapane. Datapane sells itself as “The reporting front-end for the modern data stack” and having used it for a few weeks now I can say it is my current favorite for building reports using Python or SQL. Throughout this post, I hope to introduce many features Datapane offers(but not all) that help to create a data science report that is visually appealing and effectively conveys information. To demonstrate the capabilities of Datapane I’ll create an exploratory analysis of a heart disease dataset from Kaggle and build machine learning models to predict heart disease in patients.
To start let’s cover some of the basics of Datapane. Datapane is a Python library that allows the quick and easy creation of reports, which you can either export as an HTML file or host on Datapane for free. (See Datapane as pricing varies but it does allow unlimited free public reports as of writing) These reports are interactive, sharable, allow for automation, provide authentication, and more. To create a report all one has to do is download the library, initialize it with your account API token, and then create a report in a few lines of code. It will then be uploaded to Datapane and can be shared with anyone in the world. Let’s explore the components of a Datapane report using an example I created on the heart disease dataset.
Setup
Datapane documentation has everything needed to get set up which is quick and easy. You will need a current version of Python (3.6–3.9 as of writing) and the library can be installed using either conda or pip. After the installation to access Datapane Studio which is the hosted server, the API needs to be set up with your specific API token. This will tie any CLI/API interactions to your account (don’t share the token). There is also Datapane Teams for organizations but I won’t cover that in this post so check out the docs if that interests you. Finally, you can check it is all set up using these two lines in your Jupyter Notebook/Python.
import datapane as dp
dp.ping()
Features
After setting up Datapane it is important to understand the layout and hierarchy of how reports are created within Datapane. At the topmost level is a report. A report contains all of your data, visualizations, and anything else created in Python. It also supports code, LaTeX formulas, media, embedding, attachments, and more. A report can be a single page or have multiple pages. Each page can then have different components, all of which combined convey whatever ideas your data science report wants to get across to the viewer.
Another feature Datapane offers is multiple ways to share a report, through embedding or sharing a link. You can also get email notifications when a report has been updated. And there are also permission settings, whether you want the report to be public or private.
Below is an example of a report I have embedded in this blog. My report is the analysis of the Heart Disease dataset I mentioned earlier. Look through it, interact with it, and notice how clean the aesthetics of Datapane are. While the styling can be modified I think the defaults are great as is.
Implementation
Now you may be asking yourself, great but how do I implement this? Below is the code on how I created this entire report. If you would like to see the entire code check out the Jupyter Notebook here. Datapane has a very user-friendly API, which consists of what I will call blocks (but Datapane calls components). I use the term blocks because I find it helps to visualize each component as a block, with some blocks inside others.
To start with the first block is the Report() block. This block contains everything you want in your report. The next block is Page(), where you create a Page() block for how many pages you want your report to have. In my example, I created three pages with titles Introduction, Data Analysis, and Classification. Inside each of these three pages, I then have the components for each page.
As you add blocks for a page the block gets added in below the previous one. For example, the first page has a Text() block, and a Select() block. The text block contains whatever text you would like, and takes markdown formatting. The select block allows you to select one at a time of any blocks within, in this case, I had two DataTable() components. Visually this will put the text at the top of the page with the table below the text, and so on.
Another block of note is the Group() one. This allows you to have multiple columns on the same row. So if you want three charts on one row, as I have on my second page, you just add the three components within a group (in order left to right), and specify the number of columns. The final block I will mention for now is the BigNumber() block, which is a way of displaying an important number/metric and if you want its change. This is a nice way to draw attention to any important metrics. Datapane has other components but this was a summary of the main ones. From here you can arrange these blocks in any configuration as desired to make a fantastic data science report.
Limitations
In my time using Datapane I haven’t come across many limitations yet. The first and only main drawback currently I see is the support for certain visualization libraries in Python. Datapane currently supports Matplotlib/Seaborn, Altair, Bokeh, Plotly, and Folium. While I mainly use Plotly and this covers the most common visualization libraries in Python it may not cover some of the more niche ones at this time. Although I would guess if enough people request an additional ones it would be added.
Another drawback would be language limitation but may not be much of an issue. For the data viz and data science community, the other main language used besides Python is R which Datapane doesn’t currently support. Perhaps support will be added in the future, or there are already sufficient tools for building reports in which case Datapane may not be needed.
Conclusion
Having recently come across Datapane I wanted to share an example for those who have not yet seen it. I have really enjoyed my time using it so far and can’t wait to see what features the team adds next. It feels like the best tool to create reports using Python I have come across and I hope others find it as useful as I have.