Many companies are striving to be more data-driven. Software development is a key part of integrating data-driven decision-making into your organization. Therefore, it helps to understand the analytics life cycle. The central part of this life cycle is analyzing the data. That could be with data analytics, a.k.a. descriptive statistics, or advanced analytics, a.k.a. inferential statistics. Both have their place, and both follow the same overall activities. Let’s look at those overall activities.
First, realize that what we’re going to say is merely a guideline. Unlike hair care, the process of analyzing data is messy, complicated, and full of trial and error. For the simple reason that we are trying to understand the world, and life is messy, complicated, and full of trial and error. But with that caveat in mind, this framework will be very useful.
One way to break out the Analytics Life Cycle at this 30,000-foot view is with five steps: Define the Question, Collect the data, Analyze the data, Present the results, and Maintain the Analytics. As the below shows, there are many other subtasks under each step.
- Business question
- Desired outcome
- Scope of data
- Organizational benefit
- Identify data
- Access data
- Extract data
- Document data
- Explore data
- Prepare data
- Create model
- Validate model
- Document model
- Write story
- Present results
- Document presentation
- Track results
- Measure deviations
- Determine business changes
- Update model
Much like managing people, data science is an art as well as a process. If there is one true answer, it can be hard to tell whether we have found it or not. The goal of performing analytics as a data scientist is to find the best answer with the information we have access to. To see this better, let’s consider two use cases where analytics have been used for a long time: Regulatory Reporting and Customer Insights.
As an example, let’s consider mortgages to purchase a home. Federal regulations, such as in the Fair Housing Act, prohibit discrimination in all aspects of purchasing a real estate loan. Suppose that we work for a financial institution that gives out loans. As data scientists at this institution, we are tasked with answering the business question of whether we are discriminating or not and then reporting our results to the federal government and management.
The first thing we do is collect the data on each person applying for a loan, usually stored in a relational database. This would include financial factors (e.g., debts as a percentage of income, percentage of value borrowed, credit score), loan characteristics (e.g., conventional vs. government, length of loan), and, of course, demographics, such as age, race, sex, national origin, marital status and religion. For each person, we also include whether they received the loan or not (decision).
It is important to account for all factors when doing this analysis, and the statistical techniques used by actual institutions can do that. We will keep it simple and consider a two-way table of race vs. decision. The analysis calculates the number of people of each race who received the loan and the number who did not. We can then calculate test statistics that determine whether the differences in the percentages of each race are statistically significant or not. If they are not statistically significant, then it could be that there is not any bias in the lending practices. It could also be that there is not enough data to show a bias that might exist. If the results are statistically significant, then there is an indication that some races are treated differently than other races (all other things being equal).
We would then create a report with descriptive statistics about the data and the results of our inferential statistical analysis. This report would go to the proper authorities for review.
In this example, the collection of the data and the presentation of the results are pretty standard. All the information is collected on the loan application, and the reporting is regulated by the federal government. The analysis itself, on the other hand, can be the most complicated and contentious step. It is hard to admit there is bias, even if it is unintentional. Since analytics can be part art, such as when cleaning data or selecting the right model, there are often competing choices (which color of green to use on your walls, for example), and people can have different, legitimate preferences.
A more complicated analysis to perform is to model customer behavior. Suppose you are a data scientist working for a telecom company. New customers increase revenue, and customers leaving reduce our revenue. Customer attrition, or churn, can be caused by many things, some of which are in our control. For example, maybe they had bad service, bad experiences with the call center, or they think the price is too high; for those reasons and many others, a customer may leave. As a data scientist at that company, we are tasked with the business goal of reducing the number of customers who leave. Under our simple model, let’s consider the steps we would take to achieve that goal.
The first step is collecting the data. In our scenario, this is probably going to be very hard. We don’t necessarily know why the customers leave, so we can’t point to the data that we want. Therefore, we have to collect all the customer interaction data points that we can. Since customers interact with many parts of our organization, we have to identify, access and combine data from each of those, probably siloed, data stores. We will want information about the customers themselves (e.g., name, address, job title, employment status, marital status and age) and data from Finance about their purchases and payments (e.g., product portfolio, payment history, length of service and lifetime value). We would also want data from their interactions with the company (e.g., website, call center, and in-store contacts). Unless the telecom company has been working for many years to create one view of the customer, and a 360-degree view at that, it will be very hard to pull all this data together into one customer record.
But when that is done(!), we start the analysis. The first steps are usually descriptive statistics to answer simple questions about customer behavior. What were the top reasons they churned? Did they churn voluntarily or not? How long do they stay customers before they leave? Then we move to advanced analytics and build models that predict their probability of churning well. Here, though, it’s not just whether they will churn but the likely reasons why they will churn.
Once we build the model, then we will apply it every month to our active customers. From the model, we get a probability of leaving. We might decide that any probability over 80% is important for us to look at. Then we determine from the model, and the descriptive statistics we did, the likely reason(s) that they will churn. This information by itself is not enough, though. We have to act on it and do so early enough that we can affect their behavior. We want to do something that stops them from leaving.
The information about the probability of, and reason for, leaving is passed on to the customer retention team. The presentation of results could, in a less mature organization, be done via a spreadsheet. In a more mature organization that has become more digital, those results will be available through an API into an interface that the customer retention team accesses. They will engage with the customer according to the likely reason for churn. Maybe they will offer a promotion to lock them in for a few more years or a gift card as a thank you and asking pardon over a recent call center debacle.
After the intervention, the results are tracked. Did they stay, or did they still leave? If they stayed, how long did they stay? The data from these kinds of questions are integrated back into the model. Just knowing whether and why they churn is not enough. Our true desired outcome is for them to remain valuable customers for a long time.
These two quite different examples show the high-level analytics life cycle. They both go through the steps of collecting the data, analyzing the data, and presenting the results, though how it is done is different for each example. Where at all possible, automation of the process is the most desired, which is where software development can have its greatest impact. Once you start combining advanced analytics and automation, you get a version of specific artificial intelligence. Once that is done for one business process, we can start on the next one. There are always more decisions that would benefit from enhanced data analysis.