Using Spark to Predict Churn

4 min readSep 17, 2021

We can use Spark to manipulate large and realistic datasets to engineer relevant features for predicting churn. In this example project, we learn how to use Spark MLlib to build machine learning models with large datasets, far beyond what is possible with non-distributed technologies like “scikit-learn”.

For the sake of simplicity and budget constraints, this project will only show Spark capabilities working on a smaller subset locally rather than employing a cluster of distributed computing on the cloud, something I very much desire to happen shortly.

“Sparkify” is a fictional music streaming service generated for education purposes by Udacity. Here’s the link to my Github, where you can find all source code.

Load and Clean Dataset

To load a dataset in python with pyspark, you need to create a spark instance. We use the Spark Session to create a DataFrame, register DataFrames as tables, execute SQL expressions over tables, cache tables, and read parquet files. Here we want to load provided data:

spark = SparkSession\
 .builder \
 .master(“local”) \
 .appName(“sparkify”) \
 .config(“config option”, “config value”)\
 .getOrCreate()

The dataset did not present much necessity for cleaning. Mainly we had to drop all null values and fill repetitive not registered values such as ‘no_song’, ‘no_artist’ and 0 song length in the song, artist, and length null values.

Exploratory Data Analysis

We share some visualisations. It looks like there are slightly more Females than males active in the streaming platform having similar churn rates.

Locations Distribution is also attractive, having most of our data in California and New York. However, no location data has been used in our models to accelerate the calculation speed.

We have used the average length of time spent on the platform per user shown here below. Most users have spent more or less 200 minutes from 1 October 2018 to 3 December 2018.

The box plot distributions for all different page events existing in our dataset. We have used “Downgrade” and “Cancellation Confirmation” as the churn value label to define our model.

It also looks like Windows and Macintosh devices have been primarily used to access the Sparkify streaming platform.

Feature Engineering

We created a phase feature using Window Functions. Window functions are a way of combining the values or ranges of rows in a DataFrame. When defining the window, we can choose how to sort and group (with the partitionBy method) the rows and how wide of a window we'd like to use (described by rangeBetween or rowsBetween) to create the churn column.

def create_phase_col(log_events):# for each user order actions by between certain events
 windowval = Window.partitionBy(“userId”).orderBy(F.desc(“ts”)).rangeBetween(Window.unboundedPreceding, 0)log_events = log_events.withColumn(“phase”, F.sum(“churn”).over(windowval))return log_events

We have also created a device column by applying “User-Defined Functions”, namely udf, along with a lambda function as shown below:

def get_device_col(log_events):
 ‘’’
 create device column in log_events Spark DataFrame
 INPUT
 log_events DataFrame
 OUTPUT
 log_events DataFrame with device column
 ‘’’
 get_device = F.udf(lambda x: x.split(‘(‘)[1].replace(“;”, “ “).split(“ “)[0])log_events = log_events.withColumn(“device”, get_device(log_events[“userAgent”]))
 return log_events

Training and Evaluating

We have used three different machine learning classifiers: Logistic Regression, Random Forest and Gradient Boosted Tree evaluating with f1 metric because we deal with imbalanced labelled data. We have used 60% of our dataset for training and 40% for testing.

Logistic Regression Train: 0.935 |Logistic Regression Test : 0.896

Random Forest Train: 0.938 | Random Forest Test: 0.963

Gradient Boosted Tree Train: 1.0 | Gradient Boosted Tree Test: 0.926

Conclusion

Although there is still constrained documentation, Spark Machine Learning enables us to process large amounts of data and gain fast and cheaper insights. Spark is immediate, intuitive, includes the power of advanced analytics, offers increased access to Big Data, but algorithms to use are limited.