Prediction is the keyword on everybody’s lips. Whether it is Nate Silver predicting the US election (or not), or Facebook investing heavily in VR, business decisions should be based on the world of tomorrow, not on the world of today.

Predictive analytics is taking off in the business world, and every business wants to know what their customer will do next. But how is this done? We looked at the top 20 Google results for “Predicting customer behaviour” and found only vague tips and complex academic papers. For this article, we want to provide a concrete step-by-step guide to getting the job done. In the following we provide a 5-step guide to predicting customer behaviour using data science methods.

How can we predict our customers’ future behaviours?

Step 1 — Define a clear goal:

For any prediction question, the most important step is to start with a concrete goal. Your goal must be able to produce testable predictions. Being able to say “Retail purchases will increase” is too vague. When will they increase? By how much? Your predictions will have to pass the “Clairvoyance Test”. A better goal is to “identify which customers will make a retail purchase within the next 14 days with 90% accuracy”. Of course, there are many other behaviours you may want to predict, such as customer churn, LTV or a response to a particular campaign.

Step 2 — Collect the right data:

Now you have your goal, what data will you need to achieve it? A good way to do this is to work backwards. We want to predict purchase intent, so it is going to be very handy to know about historical purchases. It might also help to know how many items customers typically buy, how often do they make transactions, do they buy during sales time, after rewards, before their birthday. The most logical predictors are usually the most informative, but there is no guarantee and choosing the right data (known as feature selection) can often be more art than science.

How much data will you need? Unfortunately there is no clear rule for the amount of data you will need, but in the case of retail trends, you will want at least 2 years of historical data to be able to incorporate seasonal trends into your model.

Step 3 — Build a model but start simple

Example code using R + Caret for model training

Next, open up your modelling tool of choice. If you use Python, try Scikit-Learn, or for R we like Max Kuhn’s Caret package. Both have the capacity to implement a large variety of complex machine learning algorithms, and while it is tempting to go fancy, the most important thing is to start with a simple model. Not because these are the simplest to fit and most interpretable, but because you can turn them around quickly. This is critical when building predictive models because this is an iterative process and the biggest gains can be made quickly. The risk of starting with a complicated model is that you don’t have time to improve on it, or worse the sophisticated model is no better than the simple model — and it’s less interpretable.

Step 4 — Test your model

Congratulations, you now have a model that makes predictions. In our case, we wanted to know which customers are likely to make a purchase in the next 14 days. The result of our model is simply the addition of a new column or variable to our data with the label ‘Purchase’ or ‘No Purchase’ for each customer in our database. It is now important to practice good data hygiene. Make sure you test your predictions on an independent dataset that has not been used in training the model. In most cases this will be based on a sample on a ‘hold out’ set of the data, typically 20%. This will give you the most reliable estimate of how your model will perform in the ‘real world’.

Step 5—Set (but don’t forget) your model

Use a REST API to make your model accessible to others in your organisation. Image from yhat.com

To make use of your model you need to give it a place to work for you. These are your deployment options:

  • Keep it local: The simplest method is to simply run the model on the machine that generated the model. This requires the least amount of work to set up.
  • Batch it: Set up a Cron job, or other automated service to run the model every hour/day/week and add the predictions to your database.
  • Let it REST: By creating a RESTful API, anyone in your organisation can easily access the model via HTTP and generate predictions in real time.

Whatever your deployment option, keep in mind that a predictive model is only as good as the data it was built on. This means that if you just let the model sit for too long, the data it was trained on will become more out of data, providing increasingly worse predictions. Always remember to keep your models trained with the most recent data or else they will quickly lose their value without you realising.

Last Word

Predictive modelling is an iterative process. Follow these 5 steps and you will be able to quickly generate predictions, but in order to stand the test of time, revisit your model frequently.

Leave a Reply