The Data Cleaning Process

November 25, 2020 4-minute read

The following is my solution a practice exam paper with the following brief

An important part of a data scientist’s toolbox is the ability to clean data. To assess your ability to do this, you are required to explain the key goals of data cleaning and how it is applied in tidymodels.

Write a brief report explaining data cleaning and how to apply it in tidymodels. The report should be less than two pages.

Data Cleaning 🔗︎

Why Data Cleaning 🔗︎

Cleaning data, removing skewness and outliers can result in significant increases in model performance. Some models, like tree based models, are not as affected by the irregularities of the underlying data as is a linear regression for example. It’s important then that the data cleaning is done with the chosen model in mind.

Reproducibility 🔗︎

All data cleaning steps should be reproducible. A simple way to record and create a reproducible set of actions is through a recipe in the R library tidymodels.

Missing Data 🔗︎

Data may be missing from the dataset for a few reasons. We can define 2 types of missing data as missing at random (MAR) and missing completely at random (MCAR). MCAR is associated with the data collection process. When data has gone missing in this way, it is completely random and has no relation to any other feature. Assuming the features are missing completely at random, there are a number of ways of proceeding ¹

Discard observations with any missing values.
Rely on the learning algorithm to deal with missing values in its training phase.
Impute all missing values before training.

(1) is a good option if there is only a limited amount of missing data. We want to maximise the amount of data that can be used so this would not be a good choice if there is a large amount of missing data. (2) This only applies to some models and thus isn’t always an option (3) This is the most common option. An easy way to impute missing values is through replacing them with the mean or median of the non-missing values of that feature. Another alternative is to create a new model to predict the missing values of each feature. For example using the CART method. Once the data has been imputed it is treated as though it has been observed.

Actions that can be added to the recipe include step_meanimpute() for numerical variables or step_modeimpute() for categorical variables.

Variable Conversion 🔗︎

When the dataset is first recieved it is also important to make sure the features are encoded properly. For example using the term in R as.numeric will convert a column to a numeric variable. Commonly, a feature of type ‘character’ should be converted to a factor using the command mutate_if(is.character, factor).

Other problems that may occur during this process could be a large number of factors or mis-labelled ones that can cause problems. For example 4wd and 4WD should be considered as the same factor. There may also be multiple sub-types that are not useful, or overcomplicate the analysis. For example front 6 and front 4 could both be considered front wheel drive in order to simplify the model. To do this we could use code like the following

cars2010 %>%
  mutate(
    drive = case_when(
      str_detect(drive_desc, "Front") ~ "front",
      str_detect(drive_desc, "Rear") ~ "rear",
      TRUE ~ "4WD"
    )
  )

This kind of strategy is also useful to fix data that has been entered incorrectly. One big problem that can occur during the data collection process is that data is entered differently or incorrectly to what it should be. For example if you have two different people that are typing in words like ‘Front WD’ and ‘Front’ as their input for ‘Front Wheel Drive’. This would lead to many different factors in R, that should be identical. This would need to be fixed in the data cleaning process, and can be done using the techniques displayed above.

Data Transformations 🔗︎

The model may also contain variables that are skewed or variables that are unstable for the model. In the case of skewness the box-cox method can be added to the recipe. The box-cox method identifies the best transformation of the data to minimise the skewness. This can be added to the recipe with step_BoxCox() Other transformations like centreing and scaling the data can also be useful, but also make the results more difficult to interpret as the units have been changed². These can be added to the recipe with step_center() and step_scale(), but they can both be done in one step with step_normalize()

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2009 ↩︎
M. Kuhn and K. Johnson. Applied Predictive Modeling. Springer, New York, 2013 ↩︎