A high level explanation of a data analytics project workflow
It seems like everyone’s talking about data analytics these days, but what exactly data analytics “means” sometimes gets lost in the presentations, outcome lists and jargon. In this post I’ll explain what data analytics means, and what the shape of a data analytics project looks like.
In its most basic sense, data analytics is taking raw data and using it to draw conclusions. In a business context, this means using data to help guide thinking in areas as wide-ranging and diverse as marketing strategies, investment priorities, problem identification/characterization and customer behavior mapping.
Data analytics is a kind of multi-tool that allows data scientists to make predictions about anything that can be determined from a set of historical data. And although there are a number of different ways to go about doing data analytics, the general shape of an investigation goes like this:
Step 1: Figure out the business goal(s) for the study and narrow the scope
Before you start gathering data, running algorithms or making charts, data analytics projects almost always begin with a business discussion: what problem(s) are you trying to solve? Are you trying to find out when users are leaving your website? Do you need to better understand what factors determine which users use a particular service? Once you’ve got the larger business challenge set, the first major step of the project becomes narrowing down the scope as much as possible to try to predict a single thing.
A tried and true method is to pick a predictable outcome that can in turn be used to drive another, perhaps less tangible, outcome, whether that be attracting more users, increasing revenue, improving delivery, etc. By gaining a better understanding of when and why users quit your service, for example, you can then do things like offer them promotions that meet their needs before their engagement runs out. So by understanding something about behavior using data, you can solve a business problem (in this case, slowing user attrition).
Step 2: Make a hypothesis (about your data)
Just like in science class, you need to make an educated guess about what information is going to help you predict the desired outcome: will demographic data give you the information you need? Behavior/usage data? Environmental/economic data?
Once you’ve figured out what kind of data you need, you also need to make a hypothesis as to how much. We refer to the data used during a project as a “snapshot.” Based on the scope and goals of the study, you’ll pick a discreet set of data – say three years’ worth – to create a static, historical “snapshot” of data with which to build the model.
Step 3: Gather and "cleanse" your data
The data you use in the study may come from internal sources and analytics, from commercial sources or from public and/or free sources like data collection. Wherever it comes from, in the end, all of the data must be brought into the same environment, so it can be looked at and analyzed all together. There are a number of solutions and environments in which to do this (Hadoop, cloud-based solutions, etc.), but the important thing is to have an environment in which it’s easy to consolidate the data.
Once you’ve put together all the data from all your different sources, it’s time to engage in a process called “data cleansing.” During this process all the data is transformed into a format that a data scientist can work with. Sometimes this is as simple as making sure all the timestamps are in the same time zone and format; other times this step involves more complicated processes like back-filling blank fields with derived data.
Step 4: Train the model
Now we get to the fun part. One way to think of data analytics is to think of it like a math equation: the datasets are combined into variables on the left-hand side of the equation, and the outcome you’re trying to predict sits on the right. Because you’ve already settled on an outcome from the right side of the equation, the challenge is filling in the variables on the left to balance the equation.
Essentially, you need to figure out which of the variables are most predictive of the desired outcome. To do this, you engage in a process called “training the model,” whereby you iteratively determine which variables/fields have the most power to predict your desired outcome by experimenting against your hypothesis (what data would produce the best prediction).
Step 5: Score your model
After the model has been sufficiently trained, the output will include a “score” of some kind, telling you the accuracy of the prediction, not unlike a “percent confidence” score. The model will also give you “reason codes,” letting you know which data contributed most to any prediction.
Based upon this information, you can take different actions depending on the results of the scoring of a given prediction. For example, if you see that an undesirable outcome is correlated via reason codes to something like user skill level, that might signal a need to invest in better instructions and walk-throughs within your app.
Step 6: Driving business goals with data
Back to business. Once you’ve got your predictions, you can work the results into your workflows in any number of ways. You can take direct action on a given set of results – e.g., fix an application problem made obvious from the data – or you can generate reports and other metrics to guide your strategic planning going forward.
And since we’re now comfortable with collecting, analyzing and acting based on data, we can do one more thing with our new prediction-equation: continuously test the real-world results against our data models, and, when appropriate, tweak, update and otherwise improve our models to continue generating value as the actionable results from the original investigation are put into place, creating a virtuous cycle of data-driven goodness.
Have more questions about data analytics? Think you’ve got a project in mind? Connect with me – I’d be more than happy to help.