R Dplyr vs. DuckDB - How to Enhance Your Data Processing Pipelines with R DuckDB

Reading time:

time

min

tutorial

By:

Dario Radečić

May 7, 2024

At Appsilon, we love dplyr. It’s an amazing package that makes it easier for newcomers to dive into data analytics, as the syntax is super simple to follow, and there are many guides and books out there. The question is - Does dplyr stand a chance as datasets get large?

In a single word, no. Throwing a single-threaded language with a package built for convenience at the wall to see what sticks will reveal one harsh truth - almost nothing does. On the other side of the equation, you have your traditional databases for handling huge volumes of data. But these also come with their fair share of problems - from provisioning, managing, data access, and data movement.

As a developer, you need something in between. Something that’s built for data analytics and runs in memory. That’s where DuckDB chimes in, and today you’ll learn how to work with DuckDB in R!

Dplyr can’t handle the load but you still want to use it? Try dtplyr - it’s faster brother that uses data.table backend.

Table of contents:

What is DuckDB? A Crash Course
R DuckDB Benchmark - dplyr vs. Duck DB
Summing up R DuckDB

What is DuckDB? A Crash Course

Let us throw a couple of buzzwords and acronyms on you. DuckDB is an open-source, embedded, in-process, relational, OLAP DBMS. If you didn’t understand any of that, let’s break it down:

Open-source - DuckDB is released under the MIT License, which means the source code is available to the public. You can download it, modify it, and use it however you see fit.
Embedded - This database is designed to be embedded with other applications rather than functioning as a standalone database server. For developers, this means you won’t need external dependencies or separate installations.
In-Process - DuckDB operates in-process within the application that uses it. In other words, it runs within the same memory space as the application, which makes communication and data exchange seamless.
Relational - Data is stored in a tabular format (rows and columns). It supports SQL for table creation, data insertion, and data access.
OLAP - The database is designed for data analysis (OnLine Analytical Processing). Data is organized by columns. All data associated with a column is stored near one another, which means the database is optimized for fast reading and computing on columns.
DBMS - DuckDB is a full-fledged DataBase Management System. It provides comprehensive functionality for storing, retrieving, and managing structured data. It also packs features for transaction management, concurrency control, and query optimization.

In terms of further DuckDB categorization and comparison, the following image paints a clear picture:

Image 1 - DuckDB categorization (Source: Gábor Szárnyas)

This means DuckDB is a lot like your cloud data warehouses when it comes to data analysis, but doesn’t require a dedicated server, and can be shared as a single file - just like SQLite.

You now know what DuckDB is, but what are the things that make it so great? Do developers dislike it for some reason? That’s what we’ll answer next.

DuckDB Advantages

DuckDB thrives in many areas and is loved by data professionals who want to analyze larger-than-memory datasets quickly. This section goes over the main advantages of DuckDB when compared to traditional RDBMS systems.

‍Simplicity. Do you know which DBMS is most widely used today? It’s SQLite! It’s simple to install, embed and share. DuckDB adopts many of SQLite’s core ideas. It has no external dependencies, no external server software to install and maintain, and is completely embedded within a host process. You can analyze data on your disk and perform aggregations even without copying it to DuckDB first.

‍Portability. DuckDB is available for Linux, macOS, and Windows for x86 and ARM architectures. You can run it on something as small as a Raspberry Pi, web browser, or mobile device, all thanks to DuckDB-Wasm. If you’re wondering about programming language support, there are APIs for R, Python, Java, C, C++, Go, Rust, JavaScript, and others.

‍Speed. It’s designed to support analytical query workloads, or long-running queries that process large portions of data. It achieves this by utilizing and columnar-vectorized query execution engine in which large bunches of values are processed in one go.

‍Extensions. The flexible extension mechanism of DuckDB allows you to define new data types, functions, file formats, and much more. For example, you can install an extension to access data stored on AWS S3, among other things.

‍Data Management Features. DuckDB provides transactional guarantees (ACID) through its custom, bulk-optimized Multi-Version Concurrency Control. You can run the database in memory or store it in a single file. You can leverage secondary indexes to speed up single table entry lookups.

DuckDB Disadvantages

Just like all the other database systems, DuckDB isn’t a be-all-end-all solution that will magically fix your data problems. Here are a couple of disadvantages you need to consider:

‍High Volumes of Transactions. DuckDB is designed for read-heavy operations and analytical workloads. If your workload consists of many inserts and updates, or even streaming, DuckDB isn’t a good match (e.g., webshops, order tracking, and similar).

‍Database Access Mode. As of now (April 2024), you can configure DuckDB in 2 ways for concurrency - one process can read and write to the database, or multiple processes can read from the database, but no one can write. “Solving” the second one is in theory possible, but the implementation is left up to you. Read more about it on official docs.

‍Limited Data Options. You can load data into DuckDB by using INSERT statements, or through CSV, Excel, Parquet, and JSON files. If you want to bring data from other relational databases, you can go with MySQL, SQlite, and PostgreSQL. Not a short list by any stretch of the imagination, but leaves out most enterprise platforms. When it comes to the type of data you can actually store, the focus is on tabular datasets, so graphs, vectors, and similar structures won’t work.

‍Not Distributed. One of the main benefits of DuckDB is that it behaves a lot like SQLite. You can’t distribute workload across multiple nodes. Solutions such as Spark, which might be slower on smaller datasets when compared to DuckDB, are often better solutions if you need a distributed system.

‍Community and Support. DuckDB is relatively new (2019) and is still in the 0.X version. The documentation is well-written, but don’t expect to find as many tutorials and usage examples as for database vendors that have been out for decades.

Bottom Line

To conclude, DuckDB has a lot of pros and cons. If you’re only familiar with workloads on small-ish datasets and don’t need to distribute the workload at all, it’s safe to assume DuckDB will be faster than anything R has to offer.

That being said, the main selling points of DuckDB can come up as a downside if you’re working with distributed systems, existing enterprise databases (Oracle, SQL Server), or in general need a system that’s optimized for inserts and updates instead of selects.

It doesn’t mean DuckDB is a bad solution - it’s perfect for what it was designed to do. As with any technology, you just need to be aware of its limitations.

Up next, let’s explore how to use DuckDB in R and see how it compares to a more widely used option - dplyr.

R DuckDB Benchmark - dplyr vs. Duck DB

Now it’s time to get into the fun stuff. The goal of this section is to see how both dplyr and DuckDB handle reading and aggregating around 38 million rows spread between 12 Parquet files.

You’ll need a couple of packages to follow along, so we’ll show you how to install them first.

Package Installation

Most of the R packages you’ll need today install through a simple install.packages() command. However, the arrow package requires an additional repos argument as of April 2024:

install.packages("arrow", repos = c("https://apache.r-universe.dev"))
install.packages(c("dplyr", "duckdb", "lubridate"))

Dataset Info

As for the dataset, we’ve downloaded monthly Yellow Taxi Parquet files for 2023 from the TLC Trip Record Dataset:

Image 2 - TLC Trip Record Dataset homepage

After downloading, you’ll end up with 12 Parquet files that have about 38.3 million rows (19 columns) and take about 630 MB of disk space:

Parquet file format provides superb compression, but 600 MB distributed over this many rows will surely require some processing time in R. Let's see just how much!

Benchmark Results

The arrow package allows you to create an R dataset from multiple Parquet files. It’s a necessary part of the dplyr pipeline, and a far better alternative to reading them individually. With DuckDB, you don’t have these issues, as the read_parquet() or parquet_scan() can read every Parquet file on a provided path.

As for the benchmark, we want to compute the number of trips and median statistics (distance, fare amount, tip amount) for every day of 2023. It’s a simple operation, but will require some date manipulation at first.

What’s neat about DuckDB is that it allows you to use all dplyr verbs, meaning that only the data reading process has to change. In addition, DuckDB requires you to call collect() as the last chained function, as no data reading or processing will happen before it:

library(arrow)
library(duckdb)
library(dplyr)
library(lubridate)

# DPLYR
res_dplyr <- as.data.frame(
  Scanner$create(
    open_dataset(sources = "")
  )$ToTable()
) %>%
  mutate(
    ride_date = as.Date(tpep_pickup_datetime),
    ride_year = year(tpep_pickup_datetime)
  ) %>%
  filter(ride_year == 2023) %>%
  group_by(ride_date) %>%
  summarise(
    median_distance = median(trip_distance),
    median_fare_amount = median(fare_amount),
    median_tip = median(tip_amount),
    ride_count = n()
  ) %>%
  arrange(ride_date)


# DUCKDB
res_duckdb <- tbl(
  dbConnect(duckdb()),
  "read_parquet('', hive_partitioning = true)"
) %>%
  mutate(
    ride_date = as.Date(tpep_pickup_datetime),
    ride_year = year(tpep_pickup_datetime)
  ) %>%
  filter(ride_year == 2023) %>%
  group_by(ride_date) %>%
  summarise(
    median_distance = median(trip_distance),
    median_fare_amount = median(fare_amount),
    median_tip = median(tip_amount),
    ride_count = n()
  ) %>%
  arrange(ride_date) %>%
  collect()

Alternatively, you can write the DuckDB pipeline entirely in SQL:

res_duckdb_sql <- dbGetQuery(
  conn = dbConnect(duckdb()),
  statement = "
    SELECT
      ride_date,
      ROUND(MEDIAN(trip_distance), 2) AS median_distance,
      ROUND(MEDIAN(fare_amount), 2) AS median_fare_amount,
      ROUND(MEDIAN(tip_amount), 2) AS median_tip,
      COUNT(*) AS ride_count
    FROM (
      SELECT
        STRFTIME(tpep_pickup_datetime, '%Y-%m-%d') AS ride_date,
        DATE_PART('year', tpep_pickup_datetime) AS trip_year,
        trip_distance,
        fare_amount,
        tip_amount
      FROM parquet_scan('/Users/dradecic/Desktop/data/*parquet')
      WHERE trip_year = 2023
    )
    GROUP BY ride_date
    ORDER BY ride_date
  "
)

Both will do the same job, so the SQL option for DuckDB is worth checking out if you’re new to R and dplyr.

Whichever option you choose, you’ll end up with the following data frame:

In essence, data is grouped on a daily level, and summary statistics are calculated for each day. Simple, but remember that we’re dealing with over 38 million rows here.

But what about the runtime results? Let’s take a peek:

Image 5 - Dplyr vs DuckDB runtime results

The numbers are correct - we’ve triple-checked.

‍DuckDB is almost 20 times faster than Dplyr, all while requiring minimal changes to the code. Well, at least if you don’t go via the SQL route. The difference is impressive, and it means you can run R data pipelines faster, pull data for machine learning faster, and even make Shiny apps that run blazing fast on huge amounts of data.

Want to make a bar chart like ours? Make sure to check our guide on stunning visualizations with R and ggplot2.

Summing up R DuckDB

To conclude, R language has the misfortune of having its most popular data wrangling package being single-threaded. It’s a setback, sure, and you’re bound to have to learn new technologies as you start working with large datasets.

‍DuckDB can help. It’s not perfect. It doesn’t work with all data sources. It’s not designed for inserts and updates. And it’s relatively new. But if it fits your business requirements and the tech stack you’re using, you won’t need much else. It’s fast - even 20 times faster than dplyr, supports dplyr verbs, and even allows you to write all of your data aggregations in SQL!

‍What are your thoughts on DuckDB in R? Are you using another dplyr alternative instead? Make sure to let us know by joining the discussion in our community.

Looking for alternative R data processing frameworks? Read our guide on speeding up your data processing pipelines.

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.

Book the Audit