June 20th 2020
The competition originally began on February 13 and would end on
April 6th. The challenge is “to tell a data story about college basketball through a combination of both narrative text and data exploration.” The rules for the notebook were relatively broad, so the topics you could explore were endless as long as they related to March Madness.
I began exploring the data and saw that the opportunities were basically endless. There was data about both the Men’s and Women’s Tournament and Regular Season going back over 20 years. Fearful of drowning in a sea of data, I began brainstorming topics that would allow me to be creative and produce good visualizations, as well as have a chance at finishing this notebook on time.
After exploring some data on points per game (PPG) and the track records of different teams, an idea popped into my head. It was just a passing thought at first, but then, after a few hours, I could not stop thinking about it. I had begun to deeply wonder about a connection between the men’s and women’s data.
The more I thought, the more the idea took shape—Schools that have two good teams (men and women) could be more likely to win games in the NCAA Tournament.
It was a crazy thought, I know. I didn’t even have any evidence or experience to back up the idea. But on some level, it just clicked with me. So, from there I began my exploration into what I have dubbed the “Two-Team Factor.”
I needed to first establish what exactly this factor did, and how I could explain it in my notebook. Basically, the Two-Team Factor shows us that when programs/schools (like, say Duke) have both their Men’s and Women’s teams competing in both NCAA Tournaments in the same season, their teams are more likely to win games and advance in the tournament than programs that only have one team in the tournament.
I know that is a mouthful so I tried to condense to just saying Two-Team programs and Single Team programs. Two-Team programs are special, they are an anomaly; they are schools that are lucky enough to be able
to watch both their Men’s and Women’s teams play in March Madness in the same year.
As this idea began to take shape and my fantasies about maybe even winning my first competition grew, the world suddenly changed. COVID-19 (Coronavirus) had spread to the U.S., and in early March the NCAA canceled the remaining college basketball season, including March Madness. This was a shock, and soon the prediction competitions for both the Men’s and Women’s Tournaments were canceled too.
As I worried about whether my hard work had been wasted, Kaggle announced that the analytics competition for Google Cloud & March Madness would continue and the deadline would be extended to April 30. This was great news, which also allowed many contestants from the prediction competitions to move into the analytics competition. Surprisingly, this influx of competitors only motivated me to work harder on my idea. Now that I had extra time, I could really flesh it out and make some great visualizations.
. Each library offered something different that I could use.
is the classic, the one I, and everyone else, starts out using. It is really great for being able to describe how each and every detail of your visualization will look, but this also means it can be a very complex and long block of code.
is a visualization library that is built on top of
. It allowed me to use shorter code to make my plots. However,
was really the library that I had never used before, and it became a large part of my arsenal.
is a web-based library that allowed me to build interactive plots. This means I could build plots like this:
that allow users to hover over the columns and see the different values. It also allows me to make bar charts like this that can compare data from two different variables.
As I progressed, I learned that a notebook can get out of hand very quickly. I had done this entire competition so far in one notebook. It consisted of many of my experiments and many failed attempts. As the notebook got longer and longer it became more difficult to keep up with the variables I created and the different datasets.
My original notebook had served its purpose. I was able to explore the data and reach the conclusion that the Two-Team Factor had merit. I discovered that in each round the “Two-Team” programs, as a percentage of the teams in the round, grew from under 30% in the first round to over 60% in the final rounds. This held true for both the men’s and women’s tournaments, and showed that these few teams must be winning most of their matchups.
During this entire experimentation period, the submission date was growing nearer. It caused a certain amount of stress, but just enough to motivate me. Fortunately, I had a lot of extra time due to the unfortunate event of the Coronavirus. This allowed me to really put in the extra hours in the last couple weeks. I eventually came to a point where there was no more time for exploration, and it was now time to create a beautiful notebook.
My personal requirements for a beautiful notebook were: (1) Clear and concise writing, (2) Great visualizations, and (3) understandable, concise code.
style, and others need a simple
Finding the perfect fit took some tinkering, but I think it improves the value of a visualization by a lot. Finally, I came to the code. I had decided I would use the hide feature on most of it in the final notebook for readability. But I still wanted my code to by clear, efficient, and understandable. The only problem was I had written so much unstructured code in my original notebook it was hard to piece it all together in the final one.
can be dangerous to developers, but data scientists love them. I love them! They are easy to use and great for data exploration, but as soon as you start running cells out of order things can get out of hand fast.
This was a big hurdle, but I now believe developer skills are much more valuable to a data scientist than I originally thought. Even though it is a
, we need reproducible code that has structure. This is something I want to focus on in the near future, and develop these programming skills further.
I could not have asked for a better first live competition. The other competitors were great and I was able to really dive deep into some of the skills I wanted to work on. I think this competition has really improved my dataset manipulation and visualization skills. This is something that I hope will be extremely helpful in creating EDAs in future competitions, as well as creating great notebooks to share.