Fast Data Loading from Files to R

Reading time:

time

min

April 11, 2017

<em><strong>Updated</strong>: June 6, 2022.</em> Loading large data frames when building Shiny Apps can have a significant impact on the app initialization time. When we ran into this issue in a recent project, we decided to conduct a review of the available methods for reading data from CSV files (as provided by our client) to <a href="https://appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer">R</a>. In this article, we will identify the most efficient of these methods using benchmarking and explain our workflow. <blockquote>Want to use R and Python together in your Project? <a href="https://appsilon.com/use-r-and-python-together/">Our complete guide has you covered</a>.</blockquote> We will compare the following: <ol><li><code class="highlighter-rouge">read.csv</code> from <code class="highlighter-rouge">utils</code>, which was the standard way of reading <strong>csv</strong> files to R in RStudio,</li><li><code class="highlighter-rouge">read_csv</code> from <code class="highlighter-rouge">readr</code> which replaced the former method as a standard way of doing it in RStudio,</li><li><code class="highlighter-rouge">load</code> and <code class="highlighter-rouge">readRDS</code> from <code class="highlighter-rouge">base</code>, and</li><li><code class="highlighter-rouge">read_feather</code> from <code class="highlighter-rouge">feather</code> and <code class="highlighter-rouge">fread</code> from <code class="highlighter-rouge">data.table</code>.</li></ol> <hr /> <h2 id="data">R Fast Data Loading - The Dataset</h2> To kick things off, we have to generate a random dataset that's fairly large: <pre><code class="language-r">set.seed(123) df <- data.frame( replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)), replicate(10, stringi::stri_rand_strings(1000, 5)) ) <br>head(df)</code></pre> For reference, this is what the dataset looks like: <img class="size-full wp-image-13277" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b27064b2d248d6c9eda86d_1.webp" alt="Image 1 - Artificially created dataset" width="1760" height="306" /> Image 1 - Artificially created dataset Once created, we'll create variables to hold saving locations for all four file formats - CSV, Feather, RData, and RDS: <pre><code class="language-r">path_csv <- "./assets/data/fast_load/df.csv" path_feather <- "./assets/data/fast_load/df.feather" path_rdata <- "./assets/data/fast_load/df.RData" path_rds <- "./assets/data/fast_load/df.rds"</code></pre> From here, we can load in the required R packages and dump the datasets to disk: <pre><code class="language-r">library(feather) library(data.table) <br>write.csv(df, file = path_csv, row.names = F) write_feather(df, path_feather) save(df, file = path_rdata) saveRDS(df, path_rds)</code></pre> Next, we can check the resulting file sizes: <pre><code class="language-r">files <- c("./assets/data/fast_load/df.csv", "./assets/data/fast_load/df.feather", "./assets/data/fast_load/df.RData", "./assets/data/fast_load/df.rds") info <- file.info(files) info$size_mb <- info$size / (1024 * 1024) print(subset(info, select = c("size_mb")))</code></pre> Here's the output from the <code>print</code> statement: <img class="size-full wp-image-13279" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270659551644329c1c7ae_2.webp" alt="Image 2 - File size comparison" width="714" height="228" /> Image 2 - File size comparison Both CSV and Feather format files take up much more storage space. CSV takes up 6 times and Feather 4 times more space as compared to RDS and RData. <blockquote>Can you write R in... Excel? <a href="https://appsilon.com/r-and-excel/">Without any trouble - Here's our detailed guide</a>.</blockquote> <h2 id="benchmark">R Fast Data Loading - Benchmark and Results</h2> We will use the <code class="highlighter-rouge">microbenchmark</code> library to compare the read times in 10 rounds for the following methods: <ul><li>utils::read.csv</li><li>readr::read_csv</li><li>data.table::fread</li><li>base::load</li><li>base::readRDS</li><li>feather::read_feather</li></ul> Here's the entire code snippet you need to run the benchmark: <pre><code class="language-r">library(microbenchmark) <br>benchmark <- microbenchmark( readCSV = utils::read.csv(path_csv), readrCSV = readr::read_csv(path_csv, progress = F), fread = data.table::fread(path_csv, showProgress = F), loadRdata = base::load(path_rdata), readRds = base::readRDS(path_rds), readFeather = feather::read_feather(path_feather), times = 10 ) print(benchmark, signif = 2)</code></pre> Below you'll find the results obtained on an M1 Pro 16" MacBook Pro: <img class="size-full wp-image-13281" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b27066ebfca615e7a81a41_3.webp" alt="Image 3 - Benchmark results" width="824" height="342" /> Image 3 - Benchmark results And the <strong>winner</strong> is… Feather! However, using Feather requires prior conversion of the file to the feather format. Using <code class="highlighter-rouge">load</code> or <code class="highlighter-rouge">readRDS</code> can improve performance (second and third place in terms of speed) and has an added benefit of storing a smaller/compressed file. In both cases, it is necessary to first convert the file to the proper format. When it comes to reading from the CSV format, <code class="highlighter-rouge">fread</code> significantly beats <code class="highlighter-rouge">read_csv</code> and <code class="highlighter-rouge">read.csv</code>, and thus is the best option to read a CSV file. <blockquote>Supercharge your R Shiny dashboards with 1<a href="https://appsilon.com/apache-arrow-in-r-supercharge-r-shiny-dashboards/" target="_blank" rel="noopener noreferrer">0x faster data loading with Apache Arrow in R</a>.</blockquote> Ultimately, we chose to work with Feather files. The CSV to Feather conversion process is quick and we did not have a strict limitation on storage space in which case either the RDS or RData formats could probably have been a more appropriate choice. The final workflow was: <ol><li>Reading a CSV file provided by our customer using <code class="highlighter-rouge">fread</code>,</li><li>Writing it to Feather using <code class="highlighter-rouge">write_feather</code>, and</li><li>Loading a Feather file on app initialization using <code class="highlighter-rouge">read_feather</code>.</li></ol> The first two tasks were done once and outside of the Shiny App context. There is also quite an interesting benchmark done by <a href="https://gist.github.com/hadley/6353939" target="_blank" rel="noopener noreferrer">Hadley on reading complete files to R</a>. Please note that if you use functions defined in that post, you will end up with a character-type object and will have to apply string manipulations to obtain a commonly and widely used <strong>dataframe</strong>. If you run into any issues, as an <a class="c-link" href="https://wordpress.appsilon.com/appsilon-data-science-is-now-an-rstudio-full-service-certified-partner/" target="_blank" rel="noopener noreferrer" aria-describedby="slack-kit-tooltip">RStudio Full Certified Partner</a>, our team at Appsilon is ready to answer your questions about loading data into R and other topics related to R Shiny, Data Analytics, and Machine Learning. We're experts in this area, and we'd love to chat - <a href="https://appsilon.com/#contact">you can reach out to us at any time</a>. <hr /> <h2>Follow Us for More</h2><ul><li>Follow <a href="https://twitter.com/appsilon">@Appsilon</a> on Twitter</li><li>Follow Appsilon on <a href="https://www.linkedin.com/company/appsilon" target="_blank" rel="noopener noreferrer">LinkedIn</a></li><li>Learn more about our R Shiny <a href="https://appsilon.com/opensource/" target="_blank" rel="noopener noreferrer">open source</a> packages</li></ul>

Fast Data Loading from Files to R

Open source, pharma, and AI insights - once a week.

Share Your Data Goals with Us