Survival Data Mining for Customer Insight

Data mining techniques that have proven their worth in smaller applications are now crossing over into mainstream business computing. A practical approach will help you better understand customer behavior and reduce churn.

How long will an upgrade last? When customers upgrade to a new service, sometimes they eventually downgrade again. This is a competing risks problem, because customers might downgrade or stop during the period when they have upgraded. Upgrade survival curves are very useful for quantifying the value of the upgrade effort.

As the examples in this article show, survival data mining is a valuable tool for understanding customers and quantifying customer relationships. The basic techniques, borrowed from the statistics of medical studies, are moving far beyond the small studies that were their first applications. Survival data mining techniques are now proving their worth in the larger business world.

Calculating Hazards in a Database

Let's take a closer look at how survival data mining works with a database-in this case, running with Oracle's DBMS. Assume that a database contains one row for each customer with the following information:

  • Start_date
  • Stop_date (NULL is not stopped)
  • Other interesting variables such as stop reason, channel, and so on.

How is this data used to calculate hazards? Fortunately, SQL does most of the calculations; Oracle extensions make the full calculation possible. The first thing is to calculate the tenure and the stop flag:

SELECT ((case when stop_date is NULL then < today >
else stop_date end) - start_date) as tenure,
(case when stop_date is NULL then 0
else 1 end) as is_stopped
FROM customers

The next step is to aggregate these fields by tenure. This gives the number of customers with exactly each tenure and the number that stopped with the tenure (some customers with the tenure will still be active):

SELECT ((case when stop_date is NULL then < today >
else stop_date end) - start_date) as tenure,
count(*) as pop_at_t,
sum(case when stop_date is NULL then 0
else 1 end) as num_stopped
FROM customers
GROUP BY ((case when stop_date is NULL then < today >
else stop_date end) - start_date)

At this point, you could continue the calculation in a spreadsheet. However, Oracle's analytic functions make it possible to calculate the total population at risk, and thus the hazard. The total population is the sum of pop_at_t for all tenures greater than or equal to t. The hazard is num_stopped divided by this total. The following query does this calculation (assuming the previous query is the subquery):

SELECT tenure,
sum(pop_at_t) over
(order by tenure desc range unbounded preceding),
num_stopped /
(sum(pop_at_t) over
(order by tenure desc range unbounded preceding))
FROM < subquery >
GROUP BY tenure
ORDER BY tenure

This example shows how hazards can be calculated directly from the database. This technique, however, does require SQL extensions, such as those that support Oracle's analytic functions.

Gordon S. Linoff [[email protected]] is founder and principal of Data Miners and is an expert in data mining and data warehousing. He is coauthor of Mining the Web, Data Mining Techniques, and Mastering Data Mining, all published by John Wiley & Sons. Prior to Data Miners, Linoff worked with Thinking Machines Corp.