Real-world data is often messy and full of missing values. As a result, data scientists spend the majority of their time <a href="https://appsilon.com/data-cleaning-in-r/" target="_blank" rel="noopener"><strong>cleaning and preparing the data</strong></a>, and have less time to focus on predictive modeling and machine learning. If there's one thing all data preparation steps share, then it's <strong>dealing with missing data</strong>. Today we'll make this process a bit easier for you by introducing 3 ways for <strong>data imputation in R</strong>. After reading this article, you'll know several approaches for imputation in R and tackling missing values in general. Choosing an optimal approach oftentimes boils down to experimentation and domain knowledge, but we can only take you so far. <blockquote>Interested in Deep Learning? <a href="https://appsilon.com/visualize-pytorch-neural-networks/" target="_blank" rel="noopener">Learn how to visualize PyTorch neural network models</a>.</blockquote> Table of contents: <ul><li><a href="#introduction">Introduction to Imputation in R</a></li><li><a href="#value-imputation">Simple Value Imputation in R with Built-in Functions</a></li><li><a href="#mice-imputation">Impute Missing Values with MICE</a></li><li><a href="#missforest">Imputation with missForest Package</a></li><li><a href="#summary">Summary</a></li></ul> <hr /> <h2 id="introduction">Introduction to Imputation in R</h2> In the simplest words, imputation represents a process of replacing missing or <code>NA</code> values of your dataset with values that can be processed, analyzed, or passed into a machine learning model. There are numerous ways to perform imputation in R programming language, and choosing the best one usually boils down to domain knowledge. <b>Picture this</b> - there's a column in your dataset that stands for the amount the user spends on a phone service X. Values are missing for some clients, but what's the reason? Can you impute them with a simple mean? Well, you can't, at least not without asking a business question first - <b>Why are these values missing?</b> Most likely, the user isn't using that phone service, so imputing missing values with mean would be a terrible, terrible idea. Let's examine our data for today. We'll use the training portion of the Titanic dataset and try to impute missing values for the <code>Age</code> column: Imports: <pre><code class="language-r">library(ggplot2) library(dplyr) library(titanic) library(cowplot) <br>titanic_train$Age </code></pre> You can see some of the possible values below: <img class="size-full wp-image-16877" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d2bbc2d0cc186c274b72_ddef6828_1.webp" alt="Image 1 - Possible Age values of the Titanic dataset" width="1300" height="592" /> Image 1 - Possible Age values of the Titanic dataset There's a fair amount of <code>NA</code> values, and it's our job to impute them. They're most likely missing because the creator of the dataset had no information on the person's age. If you were to build a machine learning model on this dataset, the best way to evaluate the imputation technique would be to measure classification metrics (accuracy, precision, recall, f1) after training the model. But before diving into the imputation, let's visualize the distribution of our variable: <pre><code class="language-r">ggplot(titanic_train, aes(Age)) + geom_histogram(color = "#000000", fill = "#0099F8") + ggtitle("Variable distribution") + theme_classic() + theme(plot.title = element_text(size = 18)) </code></pre> The histogram is displayed in the figure below: <img class="size-full wp-image-16879" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d2bcd16df45f371e3dc9_ce7b2fa3_2.webp" alt="Image 2 - Distribution of the Age variable" width="2174" height="1782" /> Image 2 - Distribution of the Age variable <b>So, why is this important?</b> It's a good idea to compare variable distribution before and after imputation. You don't want the distribution to change significantly, and a histogram is a good way to check that. <blockquote>Don't know a first thing about histograms? <a href="https://appsilon.com/ggplot2-histograms/" target="_blank" rel="noopener">Our detailed guide with ggplot2 has you covered</a>.</blockquote> We'll now explore a suite of basic techniques for imputation in R. <h2 id="value-imputation">Simple Value Imputation in R with Built-in Functions</h2> You don't actually need an R package to impute missing values. You can do the whole thing manually, provided the imputation techniques are simple. We'll cover constant, mean, and median imputations in this section and compare the results. The <code>value_imputed</code> variable will store a <code>data.frame</code> of the imputed ages. The imputation itself boils down to replacing a column subset that has a value of <code>NA</code> with the value of our choice. This will be: <ul><li><b>Zero</b>: constant imputation, feel free to change the value.</li><li><b>Mean (average)</b>: average age after when all <code>NA</code>'s are removed.</li><li><b>Median</b>: median age after when all <code>NA</code>'s are removed.</li></ul> Here's the code: <pre><code class="language-r">value_imputed <- data.frame( original = titanic_train$Age, imputed_zero = replace(titanic_train$Age, is.na(titanic_train$Age), 0), imputed_mean = replace(titanic_train$Age, is.na(titanic_train$Age), mean(titanic_train$Age, na.rm = TRUE)), imputed_median = replace(titanic_train$Age, is.na(titanic_train$Age), median(titanic_train$Age, na.rm = TRUE)) ) value_imputed</code></pre> We now have a dataset with four columns representing the age: <img class="size-full wp-image-16881" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d2bdd16df45f371e3fa9_8f817a52_3.webp" alt="Image 3 - Results of the basic value imputation" width="978" height="416" /> Image 3 - Results of the basic value imputation Let's take a look at the variable distribution changes introduced by imputation on a 2x2 grid of histograms: <pre><code class="language-r">h1 <- ggplot(value_imputed, aes(x = original)) + geom_histogram(fill = "#ad1538", color = "#000000", position = "identity") + ggtitle("Original distribution") + theme_classic() h2 <- ggplot(value_imputed, aes(x = imputed_zero)) + geom_histogram(fill = "#15ad4f", color = "#000000", position = "identity") + ggtitle("Zero-imputed distribution") + theme_classic() h3 <- ggplot(value_imputed, aes(x = imputed_mean)) + geom_histogram(fill = "#1543ad", color = "#000000", position = "identity") + ggtitle("Mean-imputed distribution") + theme_classic() h4 <- ggplot(value_imputed, aes(x = imputed_median)) + geom_histogram(fill = "#ad8415", color = "#000000", position = "identity") + ggtitle("Median-imputed distribution") + theme_classic() <br>plot_grid(h1, h2, h3, h4, nrow = 2, ncol = 2)</code></pre> Here's the output: <img class="size-full wp-image-16883" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d2be3778b40ed077f5ab_6c78818b_4.webp" alt="Image 4 - Distributions after the basic value imputation" width="2470" height="1840" /> Image 4 - Distributions after the basic value imputation All imputation methods severely impact the distribution. There are a lot of missing values, so setting a single constant value doesn't make much sense. Zero imputation is the worst, as it's highly unlikely for close to 200 passengers to have the age of zero. Maybe mode imputation would provide better results, but we'll leave that up to you. <h2 id="mice-imputation">Impute Missing Values in R with MICE</h2> MICE stands for <i>Multivariate Imputation via Chained Equations</i>, and it's one of the most common packages for R users. It assumes the missing values are missing at random (MAR). The basic idea behind the algorithm is to treat each variable that has missing values as a dependent variable in regression and treat the others as independent (predictors). You can learn more about MICE in <a href="https://www.researchgate.net/publication/44203418_MICE_Multivariate_Imputation_by_Chained_Equations_in_R" target="_blank" rel="nofollow noopener">this paper</a>. The R <code>mice</code> packages provide many <a href="https://www.rdocumentation.org/packages/mice/versions/3.14.0/topics/mice" target="_blank" rel="nofollow noopener">univariate imputation methods</a>, but we'll use only a handful. First, let's import the package and subset only the numerical columns to keep things simple. Only the <code>Age</code> attribute contains missing values: <pre><code class="language-r">library(mice) <br>titanic_numeric <- titanic_train %>% select(Survived, Pclass, SibSp, Parch, Age) <br>md.pattern(titanic_numeric)</code></pre> The <code>md.pattern()</code> function gives us a visual representation of missing values: <img class="size-full wp-image-16885" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d2bf842ddef80cb68800_1a85356c_5.webp" alt="Image 5 - Missing map" width="1116" height="560" /> Image 5 - Missing map Onto the imputation now. We'll use the following MICE imputation methods: <ul><li><b>pmm</b>: Predictive mean matching.</li><li><b>cart</b>: Classification and regression trees.</li><li><b>laso.norm</b>: Lasso linear regression.</li></ul> Once again, the results will be stored in a <code>data.frame</code>: <pre><code class="language-r">mice_imputed <- data.frame( original = titanic_train$Age, imputed_pmm = complete(mice(titanic_numeric, method = "pmm"))$Age, imputed_cart = complete(mice(titanic_numeric, method = "cart"))$Age, imputed_lasso = complete(mice(titanic_numeric, method = "lasso.norm"))$Age ) mice_imputed</code></pre> Let's take a look at the results: <img class="size-full wp-image-16887" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d2c04e6cc45db1f9f65b_4dfc929f_6.webp" alt="Image 6 - Results of MICE imputation" width="956" height="426" /> Image 6 - Results of MICE imputation It's hard to judge from the table data alone, so we'll draw a grid of histograms once again (copy and modify the code from the previous section): <img class="size-full wp-image-16889" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d2c1519e83c01afa246d_936d178c_7.webp" alt="Image 7 - Distributions after the MICE imputation" width="2582" height="1944" /> Image 7 - Distributions after the MICE imputation The imputed distributions overall look much closer to the original one. The CART-imputed age distribution probably looks the closest. Also, take a look at the last histogram - the age values go below zero. This doesn't make sense for a variable such as age, so you will need to correct the negative values manually if you opt for this imputation technique. That covers MICE, so let's take a look at another R imputation approach - Miss Forest. <h2 id="missforest">Imputation with R missForest Package</h2> The Miss Forest imputation technique is based on the <a href="https://appsilon.com/r-mnist-random-forests/" target="_blank" rel="noopener">Random Forest</a> algorithm. It's a non-parametric imputation method, which means it doesn't make explicit assumptions about the function form, but instead tries to estimate the function in a way that's closest to the data points. In other words, it builds a random forest model for each variable and then uses the model to predict missing values. You can learn more about it by reading the <a href="https://academic.oup.com/bioinformatics/article/28/1/112/219101" target="_blank" rel="nofollow noopener">article by Oxford Academic</a>. Let's see how it works for imputation in R. We'll apply it to the entire numerical dataset and only extract the age: <pre><code class="language-r">library(missForest) <br>missForest_imputed <- data.frame( original = titanic_numeric$Age, imputed_missForest = missForest(titanic_numeric)$ximp$Age ) missForest_imputed</code></pre> There's no option for different imputation techniques with Miss Forest, as it always uses the random forests algorithm: <img class="size-full wp-image-16891" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d2c18b0a609886a94101_e491d3f8_8.webp" alt="Image 8 - Results of the missForest imputation" width="582" height="426" /> Image 8 - Results of the missForest imputation Finally, let's visualize the distributions: <img class="size-full wp-image-16893" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d2c2e9d089718abd0fe4_e1d18d05_9.webp" alt="Image 9 - Distributions after the missForest imputation" width="2338" height="1598" /> Image 9 - Distributions after the missForest imputation It looks like Miss Forest gravitated towards a constant value imputation since a large portion of values is around 35. The distribution is quite different from the original one, which means Miss Forest isn't the best imputation technique we've seen today. <hr /> <h2 id="summary">Summary of Imputation in R</h2> And that does it for three ways to impute missing values in R. You now have several new techniques under your toolbelt, and these should simplify any data preparation and cleaning process. The imputation approach is almost always tied to domain knowledge of the problem you're trying to solve, so make sure to ask the right business questions when needed. For a homework assignment, we would love to see you build a classification machine learning model on the Titanic dataset, and use one of the discussed imputation techniques in the process. <i>Which one yields the most accurate model? Which one makes the most sense?</i> Feel free to share your insights in the comment section below and to reach us on Twitter - <a href="http://twitter.com/appsilon" target="_blank" rel="nofollow noopener">@appsilon</a>. We'd love to hear from you. <blockquote>Looking for more guidance on Data Cleaning in R? <a href="https://appsilon.com/data-cleaning-in-r/" target="_blank" rel="noopener">Start with these two packages</a>.</blockquote>

###### Contact us!

**Damian Rodziewicz**