R {targets}: How to Make Reproducible Pipelines for Data Science and Machine Learning

Reading time:

time

min

June 27, 2023

The R {targets} package is a pipeline tool for statistics, data science, and machine learning in R. The <a href="https://github.com/ropensci/targets" target="_blank" rel="noopener">package</a> allows you to write and maintain reproducible R workflows in pipelines that run only when necessary (e.g., either data or code has changed). The best part is - you'll learn how to use it today.

By the end of the article, you'll have an entire pipeline for loading, preparing, splitting, training and evaluating a logistic regression model on the famous <a href="https://www.kaggle.com/c/titanic" target="_blank" rel="noopener">Titanic dataset</a>. You'll see just how easy it is to use <code>targets</code>, and why it might be a good idea to use the package on your next data science project.
<blockquote>But what is Logistic Regression? <a href="https://appsilon.com/r-logistic-regression/" target="_blank" rel="noopener">Read our complete guide for machine learning beginners</a>.</blockquote>
Table of contents:
<ul><li><a href="#what">What Is R {targets} and Why Do You Need It?</a></li><li><a href="#ml-pipeline">Let's Code a Machine Learning Pipeline</a></li><li><a href="#targets-pipeline">Machine Learning Pipeline in {targets} - Plain English Explanation</a></li><li><a href="#summary">Summing up R {targets}</a></li></ul>

<hr />

<h2 id="what">What Is R {targets} and Why Do You Need It?</h2>
Data science and machine learning boil down to experimentation. Each experiment can take from seconds to days to complete, and by the end, the results might not be valid (if you update the code or the data).

That's exactly the issue <code>targets</code> aims to solve. It enables you to maintain a reproducible workflow, learns how your pipeline fits together, skips the tasks that are already up to date, and runs only the necessary computations. This way, you save both time and compute power.

The package allows you to do your research and experiments entirely within R, which is a rare sight even in late 2022. Most pipeline tools are either language agnostic or Python-specific, so seeing something with native R support is always welcoming.

The projects in which you'll use <code>targets</code> will have the following structure:
<pre> ├── _targets.R
├── data.csv
├── R/
│ ├── functions.R
</pre>
The first file - <code>_targets.R</code> is created by the package, but more on that later. The <code>data.csv</code> serves as a dummy example of a data source. You don't need it, as you can load data from a database or web instead. The only file you actually need is <code>R/functions.R</code>, as it will contain all the R functions that form your pipeline.

Create the folder and file if you haven't already, and let's start writing a couple of functions for training a machine learning model.
<h2 id="ml-pipeline">Let's Code a Machine Learning Pipeline</h2>
As mentioned in the introduction section, we'll write a machine learning pipeline for training and evaluating a logistic regression model on the Titanic dataset.

The pipeline will be spread over five R functions in <code>R/functions.R</code> file. Here's a description of what each function does:
<ul><li><code>get_data()</code> - Returns the training subset of the Titanic dataset. We'll use only this subset throughout the article.</li><li><code>prepare_data()</code> - A lengthy function that is used to clean up the dataset, extract meaningful features, drop unused ones, and impute missing values.</li><li><code>train_test_split()</code> - Splits our dataset into training and testing subsets (80:20).</li><li><code>fit_model()</code> - Fits a logistic regression model to the training set.</li><li><code>evaluate_model()</code> - Returns a confusion matrix of the test set.</li></ul>
It's quite a long file, so feel free to just copy/paste it from the snippet below (don't forget to install the packages if necessary):
<pre><code class="language-r">library(titanic)
library(dplyr)
library(modeest)
library(mice)
library(caTools)
library(caret)
<br>get_data <- function() {
return(titanic_train)
}
<br>prepare_data <- function(data) {
# Convert empty strings to NaN
data$Cabin[data$Cabin == ""] <- NA
data$Embarked[data$Embarked == ""] <- NA
<br> # Title extraction
data$Title <- gsub("(.*, )|(\\..*)", "", data$Name)
rare_titles <- c("Dona", "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer")
data$Title[data$Title == "Mlle"] <- "Miss"
data$Title[data$Title == "Ms"] <- "Miss"
data$Title[data$Title == "Mme"] <- "Mrs"
data$Title[data$Title %in% rare_titles] <- "Rare"
<br> # Deck extraction
data$Deck <- factor(sapply(data$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
<br> # Drop unused columns
data <- data %>%
select(-c(PassengerId, Name, Cabin, Ticket))
<br> # Missing data imputation
data$Embarked[is.na(data$Embarked)] <- mlv(data$Embarked, method = "mfv")
factor_vars <- c("Pclass", "Sex", "SibSp", "Parch", "Embarked", "Title")
data[factor_vars] <- lapply(data[factor_vars], function(x) as.factor(x))
impute_mice <- mice(data[, !names(data) %in% c("Survived")], method = "rf")
result_mice <- complete(impute_mice)
<br> # Assign to the original dataset
data$Age <- result_mice$Age
data$Deck <- result_mice$Deck
data$Deck <- as.factor(data$Deck)
<br> return(data)
}
<br>train_test_split <- function(data) {
set.seed(42)
sample_split <- sample.split(Y = data$Survived, SplitRatio = 0.8)
train_set <- subset(x = data, sample_split == TRUE)
test_set <- subset(x = data, sample_split == FALSE)
<br> return(list("train_set" = train_set, "test_set" = test_set))
}
<br>fit_model <- function(data) {
model <- glm(Survived ~ ., data = data, family = "binomial")
return(model)
}
<br>evaluate_model <- function(model, data) {
probabilities <- predict(model, newdata = data, type = "response")
pred <- ifelse(probabilities > 0.5, 1, 0)
cm <- confusionMatrix(factor(pred), factor(data$Survived), positive = as.character(1))
return(cm)
}</code></pre>
Next, we'll verify each function works as advertised before adding it to the pipeline.
<blockquote>Do you have incomplete data? If you prefer using R instead of Python Pandas, <a href="https://appsilon.com/imputation-in-r/" target="_blank" rel="noopener">here are three R packages for data imputation</a>.</blockquote>
<h3>Evaluating our machine learning pipeline by hand</h3>
Let's start by loading the data and displaying the head of the dataset:
<pre><code class="language-r">data <- get_data()
head(data)</code></pre>
<img class="size-full wp-image-15987" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aef806cc67965bdf4dc7_8284a40f_1-1.webp" alt="Image 1 - Head of the Titanic dataset" width="1155" height="156" /> Image 1 - Head of the Titanic dataset

The dataset is quite messy and full of missing values by default. The <code>prepare_data()</code> function is here to fix that:
<pre><code class="language-r">data_prepared <- prepare_data(data)
head(data_prepared)</code></pre>
<img class="size-full wp-image-15989" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aef82814a08d133da18d_54c4a2c7_2-1.webp" alt="Image 2 - Titanic dataset in a machine learning ready form" width="747" height="203" /> Image 2 - Titanic dataset in a machine learning ready form

Next, we'll split the dataset into training and testing subsets. Once done, the dimensions of both are printed:
<pre><code class="language-r">data_splitted <- train_test_split(data_prepared)
dim(data_splitted$train_set)
dim(data_splitted$test_set)</code></pre>
<img class="size-full wp-image-15991" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aef97f844f0151b41b35_95c708e2_3-1.webp" alt="Image 3 - Dimensions of training and testing subsets" width="333" height="105" /> Image 3 - Dimensions of training and testing subsets

Let's fit a logistic regression model to the test set and print the model summary:
<pre><code class="language-r">data_model <- fit_model(data_splitted$train_set)
summary(data_model)</code></pre>
<img class="size-full wp-image-15993" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aefa537ef68dd1d0cc56_9038a1bb_4-1.webp" alt="Image 4 - Summary of a logistic regression model" width="674" height="374" /> Image 4 - Summary of a logistic regression model

There's a lot more to the summary, but the important thing is that we don't get any errors. The final step is to print the confusion matrix:

Eval test:
<pre><code class="language-r">data_cm <- evaluate_model(data_model, data_splitted$test_set)
data_cm</code></pre>
<img class="size-full wp-image-15995" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aefac683397a52ddfbff_02880959_5-1.webp" alt="Image 5 - Model confusion matrix on the test set" width="475" height="417" /> Image 5 - Model confusion matrix on the test set

And that's it - everything works, so let's bring it over to <code>targets</code> next.
<h2 id="targets-pipeline">Machine Learning Pipeline in R {targets} - Plain English Explanation</h2>
By now, you should have a <code>R/functions.R</code> file created and populated with five functions for training and evaluating a logistic regression model.

The question remains - <b>How to add these functions to a <code>targets</code> pipeline</b>? The answer is simple, just run the following from the R console:
<pre><code class="language-r">targets::use_targets()</code></pre>
<img class="size-full wp-image-15997" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aefb2814a08d133da202_c017d5fd_6-1.webp" alt="Image 6 - Initializing targets" width="1000" height="207" /> Image 6 - Initializing {targets}

The package will automatically create a new file for you - <code>_targets.R</code>. It's a normal R script:

<img class="size-full wp-image-15999" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aefb7f844f0151b41cc1_408c1795_7-1.webp" alt="Image 7 - _targets.R file" width="1552" height="1349" /> Image 7 - _targets.R file

Inside it, you'll have to modify a couple of things:
<ul><li><b>Line 12</b> - Add all external R packages your script needs.</li><li><b>Line 28</b> - Replace the dummy pipeline with the actual one.</li></ul>
Each step of the pipeline has to be placed inside the <code>tar_target()</code> function. The function accepts the step name and the command it will run. Keep in mind that the name isn't surrounded by quotes, which means you can pass the result of one step to the other.

Here's the entire modified <code>_targets.R</code> file:
<pre><code class="language-r"># Created by use_targets().
# Follow the comments below to fill in this target script.
# Then follow the manual to check and run the pipeline:
# https://books.ropensci.org/targets/walkthrough.html#inspect-the-pipeline # nolint
<br># Load packages required to define the pipeline:
library(targets)
# library(tarchetypes) # Load other packages as needed. # nolint
<br># Set target options:
tar_option_set(
packages = c("tibble", "titanic", "dplyr", "modeest", "mice", "caTools", "caret"), # packages that your targets need to run
format = "rds" # default storage format
# Set other options as needed.
)
<br># tar_make_clustermq() configuration (okay to leave alone):
options(clustermq.scheduler = "multicore")
<br># tar_make_future() configuration (okay to leave alone):
# Install packages {{future}}, {{future.callr}}, and {{future.batchtools}} to allow use_targets() to configure tar_make_future() options.
<br># Run the R scripts in the R/ folder with your custom functions:
tar_source()
# source("other_functions.R") # Source other scripts as needed. # nolint
<br># Replace the target list below with your own:
list(
tar_target(data, get_data()),
tar_target(data_prepared, prepare_data(data)),
tar_target(data_splitted, train_test_split(data_prepared)),
tar_target(data_model, fit_model(data_splitted$train_set)),
tar_target(data_cm, evaluate_model(data_model, data_splitted$test_set))
)</code></pre>
That's all we need to do, preparation-wise. We'll now check if there are any errors in the pipeline.
<h3>Check the pipeline for errors</h3>
Each time you write a new pipeline or make changes to an old one, it's a good idea to check if everything works. That's where the <code>tar_manifest()</code> function comes in.

You can call it directly from the R console:
<pre><code class="language-r">targets::tar_manifest(fields = command)</code></pre>
Here's the output you should see if there are no errors:

<img class="size-full wp-image-16001" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aefc01be0f23ad2f2531_1ad01f40_8-1.webp" alt="Image 8 - Tar manifest output" width="729" height="205" /> Image 8 - Tar manifest output

The <code>tar_manifest()</code> function is used for creating a manifest file that describes the targets and their dependencies in a computational pipeline. The manifest file serves as a blueprint for the pipeline and guides the execution and management of targets.

Everything looks good here, but that's not the only check we can do.
<h3>R {targets} dependency graph</h3>
Let's discuss an additional, more visual test. The <code>tar_visnetwork()</code> function renders the dependency graph which shows a natural progression of the pipeline from left to right.

When you call this function for the first time, you'll likely be prompted to install the <code>visnetwork</code> library (press 1 on the keyboard):
<pre><code class="language-r">targets::tar_visnetwork()</code></pre>
<img class="size-full wp-image-16003" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aefcc0dc47f4ac1e474e_0d9c6927_9-1.webp" alt="Image 9 - Targets dependency graph" width="1325" height="665" /> Image 9 - {targets} dependency graph

You can clearly see the dependency relationship - the output of each pipeline element is passed as an input to the next element. The functions displayed as triangles on the left are connected with their respective pipeline element.

Next, let's finally run the pipeline.
<h3>Run targets pipeline</h3>
Running the pipeline also boils down to calling a single function - <code>tar_make()</code>:
<pre><code class="language-r">targets::tar_make()</code></pre>
<img class="size-full wp-image-16005" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aefd34bd616f39b98553_44225f12_10-1.webp" alt="Image 10 - Pipeline progress" width="360" height="156" /> Image 10 - Pipeline progress

If you don't see any errors (chunks of red text), it means your pipeline was successfully executed. You should see a new <code>_targets</code> folder created:

<img class="size-full wp-image-16007" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f6557b216c23ed2054c_11-1.webp" alt="Image 11 - Contents of the _targets folder" width="906" height="548" /> Image 11 - Contents of the _targets folder

The <code>_targets/objects</code> is a folder containing the output of each pipeline step. We can load the results of the last step (confusion matrix) and display it in the R console:
<pre><code class="language-r">targets::tar_read(data_cm)</code></pre>
<img class="size-full wp-image-16009" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aefe7f844f0151b41f08_f15f4849_12-1.webp" alt="Image 12 - Confusion matrix of a logistic regression model" width="479" height="367" /> Image 12 - Confusion matrix of a logistic regression model

Yes, it's that easy! But one of the best <code>targets</code> features is that it won't run a pipeline step for which either the code or the data haven't changed. We can verify this by re-running the pipeline:
<pre><code class="language-r">targets::tar_make()</code></pre>
<img class="size-full wp-image-16011" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f6757b216c23ed20685_13-1.webp" alt="Image 13 - Pipeline progress - all steps skipped" width="367" height="160" /> Image 13 - Pipeline progress - all steps skipped

And that's the power of the R <code>targets</code> package. Let's make a short recap next.

<hr />

<h2 id="summary">Summing up R {targets}</h2>
Today you've learned the basics of the {targets} package through a hands-on example. You now have a machine learning pipeline you can modify to your liking. For example, you could load the dataset from a file instead of a library, you could plot the confusion matrix, or print variable importance. Basically, there's no limit to what you can do. The modifications you'd have to make in <code>_targets.R</code> file is minimal, and you can easily figure them out by referencing the <a href="https://books.ropensci.org/targets/" target="_blank" rel="noopener">documentation</a>.

<i>What's your favorite approach to writing machine learning pipelines in R? Are you still using the <code>drake</code> package?</i> Please let us know in the comment section below. We also encourage you to move the discussion to Twitter - <a href="http://twitter.com/appsilon" target="_blank" rel="noopener">@appsilon</a>. We'd love to hear your input.
<blockquote>Building a Shiny app for your machine learning project? <a href="https://appsilon.com/build-a-ci-cd-pipeline-for-shiny-apps/" target="_blank" rel="noopener">Add automated validation and implment a CI/CD pipeline with Posit Connect and GitLab-CI</a>.</blockquote>

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.

Book the Audit