Sometimes, when introducing the idea of business intelligence for organizations, there is not a lot of clarity around what exactly business intelligence is, or what it can offer. To that end, one request our team gets from time to time is to put together a generic example of data visualization that we can share with clients. I recently received one of these requests. I thought to myself, “I like data visualization, I like hockey; let’s put those together. But how do I go from here to something I can show a client?”
I want to use open source NHL data to create a data visualization that would offer value to end users.
I decided to use sports (specifically hockey) data for a couple of reasons. First, I really like hockey. Second, sports data is great for displaying complex data models which are conceptually easy to understand. This data offers information free of industry bias making it easy to establish cause and effect. Finally, sports data is publicly available and ready to be tapped into.
A little searching led me to the somewhat undocumented NHL Stats open API. I say somewhat undocumented, since some very smart individuals have created an unofficial documentation for the API which helps the likes of me use it.
Like most web APIs, the gist of this one is that I make an HTTP GET request and the API sends me back some data. The NHL Stats API returns data in a JSON format. Now, each call to the API returns a single “object”, depending on which endpoint I am calling (e.g. calling the Team endpoint would return a “Team” object). The main endpoint I needed to call for this project was the Feed endpoint, which returns an object with all the events in a single game, along with some metadata about the game. Now, I could have visited the API with a web browser and saved the result to a JSON file, but the catch is that a single call returns data for a single game in a single season. To get all the data for games played in a single season would require me making almost 1500 calls! I needed to automate this. Plus, I ended up requiring data from more than one endpoint.
I chose to create a very basic SSIS package to automate the data extraction process. While repeatedly making GET requests isn’t exactly pushing SSIS to its limits, I picked the tool for a couple of reasons:
All the SSIS package had to do was loop through some input parameters (which seasons to download and which games), call each endpoint for each game, and save the data to local JSON files. It took a while to download, but once it was done, I had local data that my BI applications could ingest.
The next step in my process was to load the data from the local repository to my BI application; for this project I was using Microsoft’s Power BI to do both my data modelling and data visualizations. Luckily for me, Power BI has an option to load and combine files from a folder. Once I had my JSON files in a folder (grouped by source endpoint), I could apply my transformation steps in Power BI in one go and have them roll out to all the files. To start though, I just wanted to open and read all the files into memory. Unfortunately, I quickly ran into an issue.
Now, this isn’t exactly the most helpful error message, and I admit, at first I started down the wrong rabbit hole. I won’t get into all the troubleshooting details, but long story short I had one file (out of 16,000) that was not in a valid JSON structure, leading to this error. Once I had that figured out, it was easy enough to remove the offending file, and load data into Power BI.
So, what does that mean for my project? Well, it means I needed to add some of the cleaning steps of my ELT process before the data is ingested in Power BI (which I guess makes this an ETLT process, patent pending). For me, this was a simple exercise of adding a component to my SSIS package that confirms each API call resulted in a valid JSON string before saving it to a file. However, depending on the situation it could have been more complicated. The moral is, never trust your source for clean data!
JSON is a semi-structured data source. Most data visualization tools work best with relational data, meaning the next step in this process was to come up with a relational model for my data and transform my data into the new model.
There’s no one nice way to model data, and no consistent way to transform data to get there. Even with the same dataset, it can be modelled in several ways in order to better answer different questions. For this project, since I was moving from semi-structured nested data that was also quite denormalized, my strategy was to flatten the source out and then renormalize the data. I did this by sectioning the subsets into tables and building what I needed to create my relationships and hierarchies.
Before I started transforming any data, I needed to know what I wanted it to end up as (at least at a conceptual level). Rather than trying to create a full relational model, I focused on defining my facts and dimensions. For this data model, I kept it quite simple, with three facts (goals, penalties, and faceoffs) and 3 dimensions (players, teams, and seasons).
For the few lucky data visualization projects where there are defined requirements about the target audience, what they want to see, and what questions they want answered, life’s good. For the rest of us, often the main challenge to overcome is simply deciding what visualisations to display.
Referring back to the idea I used at the start this venture, all I said was that I wanted to create something that offered value to end users. To translate that idea into an example of a visualisation I could create, I went through a simplified thought process.
So now I that had an idea about what might be important to end users, I wanted to create a data visualisation with the focus being goal scoring. The next step was to figure how to do display the data.
Conveniently, some of the data that is included in the API is the location of the goal, in Cartesian coordinates. By plotting these points on a scatterplot and adding a background image of a hockey rink, we can have a nice visual of all the goals scored for a team. Add some coordinate normalization, a mirrored graph, and some filters, and the result is an interactive dashboard that lets a user select two teams and compare goal volume and locations. Now that’s something that offers interesting, data-based insights into a game.
Sometimes, we don’t know what Business Intelligence can offer us, or what we can do with it, until we try to actually do something. Creating quick prototypes and examples can be one way to see what is possible. Additionally, it can be an incredibly helpful tool in gathering requirements. Often, business users have trouble articulating what they want, but once they see what we have created, they can offer suggestions or wish lists for improvement. These steps offer one path for taking an idea for a prototype and turning it into some demoable.
Need help with gaining deeper business insights out of your data? Get in touch with our BI team.