Data Cleaning in R: 2 R Packages to Clean and Validate Datasets
Real-world datasets are messy. Unless the dataset was created for teaching purposes, it's likely you'll have to spend hours or even tens of hours cleaning it before you can show it on a dashboard. That's where two packages for data cleaning in R come into play - <code>janitor</code> and <code>data.validator</code>. And today you'll learn how to use them together.
If you're a software engineer, think of data cleaning and validation as writing and testing code. Think of data cleaning as coding an app - it takes a huge amount of time to get it working correctly. On the other hand, you can't be sure it'll work as expected until you've tested it properly (validation). They're not two separated concepts, but one is rather an extension of the other.
Regardless if you're a software engineer or a data scientist, combining these two is the way to go.
<blockquote>Join the biggest R Shiny event of the year - <a href="https://appsilon.com/2022-appsilon-shiny-conference/" target="_blank" rel="noopener noreferrer">2022 Appsilon Shiny Conference</a>.</blockquote>
Table of contents:
<ul><li><a href="#data-cleaning">Data Cleaning in R with the Janitor Package</a></li><li><a href="#data-validation">Data Validation in R with the data.validator Package</a></li><li><a href="#summary">Summary</a></li></ul>
<hr />
<h2 id="data-cleaning">Data Cleaning in R with the Janitor Package</h2>
So, what is <code>janitor</code>? Put simply, it's an R package that has simple functions for examining and cleaning dirty data. It can format data frame column names, isolate duplicate and partially duplicate records, isolate empty and constant data, and much more!
We'll use <code>janitor</code> extensively through this section to clean custom datasets, and isolate duplicates of the well-known Iris dataset.
<h3>Cleaning column names</h3>
Imagine you had a dataset with terribly-formatted column names. Would you clean them by hand? Well, that's an option if you only have a couple of them. Real-world datasets oftentimes have hundreds of columns, so the by-hand approach is a no-go.
The <code>janitor</code> package has a nifty <code>clean_names()</code> function, and it's used to reformat data frame column names.
The snippet below creates a data frame with inconsistent column names - some are blank, repeated, or have unwanted characters. <code>janitor</code> cleans them instantly:
<script src="https://gist.github.com/darioappsilon/4f63df46e4cbff749156d4bcb02f0ec1.js"></script>
<img class="size-full wp-image-11594" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abc76ff6678a8b970c73_1.webp" alt="Image 1 - Cleaning column names with janitor" width="1170" height="108" /> Image 1 - Cleaning column names with janitor
<h3>Cleaning column names - Approach #2</h3>
There's another way you could approach cleaning data frame column names - and it's by using the <code>make_clean_names()</code> function.
The snippet below shows a tibble of the Iris dataset:
<script src="https://gist.github.com/darioappsilon/8fd2674d6b2dcd713076317c95636c4e.js"></script>
<img class="size-full wp-image-11596" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abc86ff6678a8b970d1c_2.webp" alt="Image 2 - The default Iris dataset" width="1304" height="364" /> Image 2 - The default Iris dataset
Separating words with a dot could lead to messy or unreadable R code. It's preferred to use underscores instead. <code>janitor</code> can do it automatically for you:
<script src="https://gist.github.com/darioappsilon/7001ee89335c67a45960e43b0ed85bc7.js"></script>
<img class="size-full wp-image-11598" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abc9e45a3a59ab7ef6f8_3.webp" alt="Image 3 - Iris dataset with cleaned column names" width="1310" height="362" /> Image 3 - Iris dataset with cleaned column names
The column names are now much more consistent with what you'd find in other datasets.
<h3>Removing empty data</h3>
It's not rare to get a dataset with missing values. Filling them isn't always straightforward. Approaches like average value imputation are often naive, and you should have good domain knowledge before using them.
Sometimes, it's best to remove missing values altogether. The <code>remove_empty()</code> function does just that - either for rows, columns, or both. Take a look at an example:
<script src="https://gist.github.com/darioappsilon/416e4dd17e98c55e51202c181d149716.js"></script>
<img class="size-full wp-image-11600" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abca5e32948c871ae1ff_4.webp" alt="Image 4 - Removing empty data with janitor" width="182" height="202" /> Image 4 - Removing empty data with janitor
Easy, right? Feel free to experiment with different options for the <code>which</code> parameter to get the hang of it.
<h3>Removing constant data</h3>
A column with only one unique value is just useless. It provides no value for analysis, visualization, and even training machine learning models. It's best to remove such columns entirely. Use the <code>remove_constant()</code> function for the task:
<script src="https://gist.github.com/darioappsilon/c73d10fb0ad5e433577047e0c6445859.js"></script>
<img class="size-full wp-image-11602" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abcbf99b81a7b8ed3f09_5.webp" alt="Image 5 - Removing constant data with janitor" width="176" height="208" /> Image 5 - Removing constant data with janitor
<b>Keep in mind:</b> Only remove constant data if you're 100% certain other values are not possible. For example, maybe you're looking at a small sample of a larger dataset that originally has multiple values in the given column. Be extra careful.
<h3>Isolating duplicate date</h3>
Missing data is no fun, but duplicates can even be worse! Two rows with identical values convey the same information. If they're in the dataset by accident, they might skew your analysis and models if left untouched.
Luckily, <code>janitor</code> comes with a <code>get_dupes()</code> function you can use to check. The code snippet below considers a value as duplicate only if values for all columns are identical:
<script src="https://gist.github.com/darioappsilon/2502831ccce18c8b8aa04f903ae28ceb.js"></script>
<img class="size-full wp-image-11604" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abcb719cbbf522f1b2a3_6.webp" alt="Image 6 - Isolating duplicate values (1)" width="1580" height="162" /> Image 6 - Isolating duplicate values (1)
You can also specify the columns that will be used when checking for duplicates:
<script src="https://gist.github.com/darioappsilon/c7c3a971a0d8a943b82822f13ed155b3.js"></script>
<img class="size-full wp-image-11606" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abcc1eed2d5d672dbd08_7.webp" alt="Image 7 - Isolating duplicate values (2)" width="1638" height="392" /> Image 7 - Isolating duplicate values (2)
As you can see, we get a much larger duplicate base the second time, just because fewer columns were used for the check.
<h3>Janitor package - summary</h3>
The <code>janitor</code> package is extremely powerful when it comes to data cleaning in R. We've explored basic functionality which is enough to clean most datasets. So, what's the next step?
As mentioned earlier, the next step is <b>data validation</b>. It will make sure all <i>test</i> cases have passed.
<h2 id="data-validation">Data Validation in R with the data.validator Package</h2>
Appsilon's <a href="https://github.com/Appsilon/data.validator" target="_blank" rel="noopener noreferrer">data.validator</a> is a go-to package for scalable and reproducible data validation. You can use it to validate a dataset in the IDE, and you can even export an interactive report. You'll learn how to do both.
For simplicity's sake, we'll use the Iris dataset for validation. You're free to use any dataset and any validation condition.
You'll have to start by creating a report object and then using validation functions, such as <code>validate_if()</code> and <code>validate_cols()</code> to validate conditions:
<script src="https://gist.github.com/darioappsilon/4ce624ff337fa604eaafdece111eda28.js"></script>
<img class="size-full wp-image-11608" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abcd9767c3154f5ec651_8.webp" alt="Image 8 - Validation report printed to the console" width="1882" height="224" /> Image 8 - Validation report printed to the console
It looks like one validation failed with three violations. You can't see more details in the console, unfortunately. But what you can do is create an HTML report instead:
<script src="https://gist.github.com/darioappsilon/5e4e16dad8fd8821995945a30c690aa0.js"></script>
<img class="size-full wp-image-11610" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abce6ff6678a8b970f95_9.webp" alt="Image 9 - Validation report as an HTML document (1)" width="2280" height="1904" /> Image 9 - Validation report as an HTML document (1)
Unlike with the console option, now you can click on the <i>Show</i> button to get detailed insights into why the validation check failed:
<img class="size-full wp-image-11612" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abd58644ca583e45e4f2_10.webp" alt="Image 10 - Validation report as an HTML document (2)" width="2280" height="1904" /> Image 10 - Validation report as an HTML document (2)
As you can see, the <code>Sepal.Width</code> column was outside the given range in these three instances, so the validation check failed.
<blockquote>Want to learn more about data.validator? <a href="https://appsilon.com/data-validation-with-data-validator-an-open-source-package-from-appsilon/" target="_blank" rel="noopener noreferrer">Read our complete guide on Appsilon blog</a>.</blockquote>
<hr />
<h2 id="summary">Summary of Data Cleaning in R</h2>
Long story short - it's crucial to clean and validate your dataset before continuing with analysis, visualization, or predictive modeling. Today you've seen how to approach this task with two highly-capable R packages. They work best when used together - <code>janitor</code> for data cleaning and <code>data.validator</code> for validation.
For a homework assignment, we recommend you download any messy dataset of your choice and use two discussed packages for cleaning and validation. Share your results with us on Twitter - <a href="https://twitter.com/appsilon" target="_blank" rel="noopener noreferrer">@appsilon</a>. We'd love to see what you come up with.
<blockquote>Are you completely new to R? <a href="https://appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer">These are 6 R packages you must learn as a beginner.</a></blockquote>