Here is how you should be using data instead.
I can’t go one day without reading about big data. Companies often proudly proclaim how they have access to millions of data points (well, not anymore since a certain Ivy League-brand brought disrepute to the whole story). In the zeal to think big, we often forget why we were looking at the data in the first place.
We look at data to make decisions.
That is it, nothing else. Businesses — large multinational corporations and tiny garage startups alike — look at data solely because it helps them make informed decisions.
Companies are not run by analysts who pore over data everyday. They are run by leaders who use (a fraction of the) data for effective decision-making.
So what should we be doing with data instead? My points below may not apply to everyone equally… but if you are a startup with limited resources, then these points apply to you. So, read on.
Let’s start with some basic assumptions. Assume you have a data dump of orders on your e-commerce stores for the last two years, say about 200,000 orders from 120,000 customers from 120 cities across your country. That’s also 120,000 e-mail addresses, phone numbers, shipment data, and returns data.
What problem are you trying to solve?
Yes, that’s right. Don’t dive into the data right away. Look at what decision needs to be made. If you are working with a senior executive or founder, then ask them what challenge they are trying to solve. Don’t settle for: “I want to find out more about our orders.” Expect questions to the level of specificity like the examples below:
- What can we do to reduce the return rates on our orders? (broad)
- What can we do to bring back the customers who have not purchased from us in the last six months? (more specific)
- What ZIP codes do we get the most profits from? (quite specific)
Build your hypothesis first.
Let’s take the first question as an example and determine what a potential hypothesis could be. If the aim is to reduce the return rates on our orders, we could look at where most of the returns are coming from. But wait: don’t look at the data just yet. Is this really the best hypothesis? Do you want to know where returns are coming from, or do you want to know why the returns are happening? Since the question clearly is to find out the reasons for returns, build a hypothesis like this instead:
Customers are returning goods because shipments take too long to reach them.
Determine what data you need to prove/disprove your hypothesis.
This hypothesis is clearly going to be proved or disproved by your data. How do you test this? One way could be to check if the average time taken to deliver the order where a return occurred is higher than the average time taken to deliver closed orders (let’s say closed orders are orders that were not returned).
You could go a step further and think: what if averages skew the numbers so that returned orders and closed orders appear to have similar delivery times? That’s a valid risk. So instead map the order times on a scale and determine the (say) 25-, 50-, 70-percentile orders by delivery times.
It’s important to note here that you probably need a fraction of the data to analyse: simply order numbers, order dates, and delivery dates. You don’t care about the customer’s location, value of orders, or birth dates — at least not to test this specific hypothesis.
Ok, now you can run your analysis.
A significantly shorter analytical run remains now. You could pull all the data and test your hypothesis in 15 mins flat. Ideally, remove the extremities of the data: e.g., if some orders got deliver in two hours and some delivery attempts happened 45 days later, remove those orders from your analysis. Focus the rest of your time on weaving a story around the data, and presenting your finding. Something like this:
We looked at the time taken to deliver each order and mapped them across a percentile scale. We found that orders with delivery times in the top 10-percentile were returned 50% more than in the bottom 90-percentile. At the top 10-percentile, deliveries take 5.45 days or more. This means we have to ensure delivery to the customer in less than five and a half days.
There you go. Isn’t that more impactful?