Working with pretty big data in R

Exploring file format options in R.

Date Posted February 26, 2019 Last Updated December 13, 2024

Author Laura DeCicco

Reading Time 15 minutes Share

Introduction

The vast majority of the projects that my data science team works on use flat files for data storage. Sometimes, the files get a bit large, so we create a set of files…but basically we’ve been fine without wading into the world of databases. Recently however, the data involved in our projects are creeping up to be bigger and bigger. We’re still not anywhere in the “BIG DATA (TM)” realm, but big enough to warrant exploring options. This blog explores the options: csv (both from readr and data.table), RDS, fst, sqlite, feather, monetDB. One of the takeaways I’ve learned was that there is not a single right answer. This post will attempt to lay out the options and summarize the pros and cons.

In a blog post that laid out similar work: sqlite-feather-and-fst and continued here , Karl Broman discusses his journey from flat files to “big-ish data”. I’ve taken some of his workflow, added more robust analysis for fst and monetDB, and used my own data.

TLDR!

		Read Time (sec)
File Format	Read Method	Complete	Filter	Group & Summarize	Write Time (sec)	File Size (MB)
rds	bulk	57	62	48.7	66	1281
rds compressed	bulk	69	52	50.8	80	55
csv (readr)	limited partial	27	21	17.3	110	704
csv.gz (readr)	limited partial	105	112	97.9	112	66
csv (fread)	limited partial	14	8	6.4	4	504
feather	limited partial	11	3	0.8	6	818
fst	limited partial	6	4	0.5	10	989
fst compressed	limited partial	6	3	0.4	20	123
sqlite	partial	45	14	1.6	55	464
monetDB	partial	8	2	1.3	58	360

“Too long…didn’t read” summary:

There are a lot of useful code examples below, but if you want to jump ahead to the final results, this table summarizes the results. This table lists all of the packages that were tested, the time it took to read and write some sample data, and the file size.

Shared Database

First question: should we set up a shared database?

A database is probably many data scientist’s go-to tool for data storage and access. There are many database options, and discussing the pros and cons of each can fill a semester-long college course. This post will not cover those topics.

Our initial question was: when should we even consider going through the process of setting up a shared database? There’s overhead involved, and our group would either need a spend a fair amount of time getting over the initial learning-curve or spend a fair amount of our limited resources on access to skilled database administrators. None of these hurdles are insurmountable, but we want to make sure our project and data needs are worth those investments.

If a single file can be easily passed around to coworkers, and loaded entirely in memory directly in R, there doesn’t seem to be any reason to consider a shared database. Maybe the data can be logically chunked into several files (or 100’s….or 1,000’s) that make collaborating on the same data easier. What conditions warrant our “R-flat-file-happy” group to consider a database? I asked and got great advice from members of the rOpenSci community. This is what I learned:

“Identify how much CRUD (create, read, update, delete) you need to do over time and how complicated your conceptual model is. If you need people to be interacting and changing data a shared database can help add change tracking and important constraints to data inputs. If you have multiple things that interact like sites, species, field people, measurement classes, complicated requested_date concepts etc then the db can help.” Steph Locke
“One thing to consider is whether the data are updated and how, and by single or multiple processes.” Elin Waring
“I encourage people towards databases when/if they need to make use of all the validation logic you can put into databases. If they just need to query, a pattern I like is to keep the data in a text-based format like CSV/TSV and load the data into sqlite for querying.” Bryce Mecum
“I suppose another criterion is whether multiple people need to access the same data or just a solo user. Concurrency afforded by DBs is nice in that regard.” James Balamuta

All great points! In the majority of our data science projects, the focus is not on creating and maintaining complex data systems…it’s using large amounts of data. Most if not all of that data already come from other databases (usually through web services). So…the big hurdles involved in setting up a shared database for our projects at the moment seems unnecessary.

Flat Files

OK, so we don’t need to buy an Oracle license. We still want to make a smart choice in the way we save and access the data. We usually have one to many file(s) that we share between a few people. So, we’ll want to minimize the file size to reduce that transfer time (we have used Google drive and S3 buckets to store files to share historically). We’d also like to minimize the time to read and write the files. Maintaining attributes such as column types is also ideal.

I will be using a large, wide, data frame to test the data.table, readr, fst, feather, sqlite, and MonetDBLite data import functions.

I also considered the sparklyr and vroom packages. sparklyr looks and sounds like an appealing option especially for “really big data”. However, I was not able to get my standard examples presented here to work. The dependency on a specific version of Java made me nervous (at least, at the time of writing this blog post). So, while it might be an attractive solution, there was a bit too much of a learning curve for the needs of our group. At the time of writing this blog, vroom was still very much in development, and I found there to be a few bugs still being worked out.

The sample data I used was from an Apache server log. The columns are a mix of factors, characters, numerics, dates, and logicals. Keep in mind that your own personal “biggish” data frame and your hardware might have different results. Let’s start by loading the whole file into memory.

biggish <- readRDS("test.rds")

nrow(biggish)

## [1] 3731514

ncol(biggish)

## [1] 38

Read, write, and files size

Using the “biggish” data frame, I’m going to write and read the files completely in memory to start. Because we are often shuffling files around (one person pushes up to an S3 bucket and another pulls them down for example), I also want to compare compressed files vs not compressed when possible.

If you can read in all your data at once, read/write time and file size should be enough to help you choose your file format. There are many instances in our “biggish” data projects that we don’t always need nor want ALL the data ALL the time. I will also compare how long it takes to pull a subset of the data by pulling out a date, numeric, and string, and then do some filtering. Some of the functions to read in data (fst, fread, feather) can read in specific columns without loading the whole file initially. These functions will read and filter/summarize the data much quicker since less data is in memory from the start. The true database options (sqlite, monetDB) will rely on the databases to do the processing outside of R (so, they also will ultimately read in less data).

First, I’ll show individually how to do each of these operations. The end of this post will include a table summarizing all the information. It is generated using the microbenchmark package.

RDS No Compression

We’ll start with the basic R binary file, the “RDS” file. saveRDS has an argument “compress” that defaults to TRUE. Not compressing the files results in a bigger file size, but quicker read and write times.

library(dplyr)

file_name <- "test.rds"
# Write:
saveRDS(biggish, file = file_name, compress = FALSE)
# Read:
rds_df <- readRDS(file_name)

RDS files must be read entirely in memory so the “Read & Filter” and “Read & Group & Summarize” times will be driven by the “Read” timing. However, I will use 2 examples throughout to test the timings. The examples are deliberately set up to test some dplyr basic verbs and various data types, as well as tricky situations like timezones.

min_bytes <- 100000
param_cd <- "00060"
group_col <- "statecd"
service <- "dv"

# Read and Filter:
read_filter <- readRDS(file_name) %>%
  filter(bytes > !!min_bytes,
         grepl(!!param_cd, parametercds)) %>%
  select(bytes, requested_date, parametercds)

# Read and Group and Summarize:
read_group_summary <- readRDS(file_name) %>%
  filter(service == !!service, 
         !is.na(!!sym(group_col)),
         requested_date > as.POSIXct("2016-10-02 00:00:00", 
                                     tz = "America/New_York")) %>%
  mutate(requested_date = as.Date(requested_date)) %>%
  group_by(.dots = c(group_col, "requested_date")) %>%
  summarize(MB = sum(as.numeric(bytes), na.rm = TRUE)/10^6)

Format	Read	Write	Read & Filter	Read & Group & Summarize
rds	57.1	66.3	61.7	48.7

Timing in seconds.

RDS Compression

file_name <- "test_compressed.rds"
# Write:
saveRDS(biggish, file = file_name, compress = TRUE)
# Read:
rds_compressed_df <- readRDS(file_name)

The “Read and Filter” data files will be the same process as “RDS No Compression”.

Format	Read	Write	Read & Filter	Read & Group & Summarize
rds compression	68.9	80.3	51.8	50.8

Timing in seconds.

readr No Compression

The readr package fits nicely in the tidyverse.

library(readr)
file_name <- "test.csv"
# Write:
write_csv(biggish, path = file_name)
# Read:
readr_df <- read_csv(file_name, progress = FALSE)
attr(readr_df$requested_date, "tzone") <- "America/New_York"

readr includes arguments “col_types” and “col_names” to only load specific columns into memory. This improves the load time if there are many columns that aren’t needed. If there’s a known, continuous set of rows, you can use the arguments “skip” and “n_max” to pull just what you need. However, that is not flexible enough for most of our needs, so I am not including that in this evaluation. Thanks to Jim Hester for clarifying how to use the col_types arguments, especially in the case of using a variable “group_col”.

min_bytes <- 100000
param_cd <- "00060"
group_col <- "statecd"
service <- "dv"

# Read and Filter:
read_filter_readr <- read_csv(file_name, 
                              progress = FALSE,
                              col_types = cols_only("bytes" = col_integer(),
                                            "requested_date"=col_datetime(),
                                            "parametercds"=col_character())) %>%
  filter(bytes > !!min_bytes,
         grepl(!!param_cd, parametercds)) 

attr(read_filter_readr$requested_date, "tzone") <- "America/New_York"

# Read and Group and Summarize:
read_group_summary_readr <- read_csv(file_name, 
                                     progress = FALSE,
                                     col_types = rlang::list2(bytes = "d",
                                        requested_date = "T",
                                        service = "c",
                                        !!group_col := "c",
                                        .default = col_skip())) %>%
  filter(service == !!service, 
         !is.na(!!sym(group_col)),
         requested_date > as.POSIXct("2016-10-02 00:00:00")) %>%
  mutate(requested_date = as.Date(requested_date)) %>%
  group_by(.dots = c(group_col, "requested_date")) %>%
  summarize(MB = sum(as.numeric(bytes), na.rm = TRUE)/10^6)

Format	Read	Write	Read & Filter	Read & Group & Summarize
readr	27.3	109.6	21.3	17.3

Timing in seconds.

readr Compression

library(readr)
file_name <- "test_readr.csv.gz"
# Write:
write_csv(biggish,  path = file_name)
# Read:
readr_compressed_df <- read_csv(file_name, progress = FALSE)

The “Read and Filter” data files will be the same process as “readr No Compression”.

Format	Read	Write	Read & Filter	Read & Group & Summarize
readr compression	105	111.7	111.5	97.9

Timing in seconds.

fread No Compression

library(data.table)
library(fasttime)
file_name <- "test.csv"
# Write:
fwrite(biggish, file = file_name)
# Read:
fread_df <- fread(file_name, 
                  data.table = FALSE, 
                  na.strings = "") %>%
  mutate(requested_date = fastPOSIXct(requested_date, tz = "America/New_York"))

fread includes arguments “select”/“drop” to only load specific columns into memory. This improves the load time if there are many columns that aren’t needed. If there’s a known, continuous set of rows, you can use the arguments “skip” and “nrows” to pull just what you need. However, that is not flexible enough for most of our needs, so I am not including that in this evaluation.

Also, I am keeping this analysis as a “data.frame” (rather than “data.table”) because it is the system our group has decided to stick with.

min_bytes <- 100000
param_cd <- "00060"
group_col <- "statecd"
service <- "dv"

# Read and Filter:
read_filter_fread <- fread(file_name, na.strings = "",
                            data.table = FALSE,
                            select = c("bytes","requested_date","parametercds")) %>%
  filter(bytes > !!min_bytes,
         grepl(!!param_cd, parametercds)) %>%
  mutate(requested_date = fastPOSIXct(requested_date, tz = "America/New_York"))

# Read and Group and Summarize:
read_group_summary_fread <- fread(file_name,na.strings = "",
                             data.table = FALSE,
                             select = c("bytes","requested_date","service",group_col)) %>%
  mutate(requested_date = fastPOSIXct(requested_date, 
                                      tz = "America/New_York")) %>%
  filter(service == !!service, 
         !is.na(!!sym(group_col)),
         requested_date > as.POSIXct("2016-10-02 00:00:00")) %>%
  mutate(requested_date = as.Date(requested_date)) %>%
  group_by(.dots = c(group_col, "requested_date")) %>%
  summarize(MB = sum(as.numeric(bytes), na.rm = TRUE)/10^6)

Format	Read	Write	Read & Filter	Read & Group & Summarize
fread	14.4	3.8	7.9	6.4

Timing in seconds.

Note! I didn’t explore adjusting the nThread argument in fread/fwrite. I also didn’t include a compressed version of fread/fwrite. Our crew is a hodge-podge of Windows, Mac, and Linux, and we try to make our code work on any OS. Many of the solutions for combining compression with data.table functions looked fragile on the different OSes. The data.table package has an open GitHub issue to support compression in the future. It may be worth updating this script once that is added.

feather Compression

library(feather)
file_name <- "test.feather"
# Write:
write_feather(biggish, path = file_name)
# Read:
feather_df <- read_feather(file_name)

read_feather includes an argument “columns” to only load specific columns into memory. This improves the load time if there are many columns that aren’t needed.

min_bytes <- 100000
param_cd <- "00060"
group_col <- "statecd"
service <- "dv"

# Read and Filter:
read_filter_feather <- read_feather(file_name, 
        columns = c("bytes","requested_date","parametercds")) %>%
  filter(bytes > !!min_bytes,
         grepl(!!param_cd, parametercds))

# Read and Group and Summarize:
read_group_summarize_feather <- read_feather(file_name, 
                      columns = c("bytes","requested_date","service",group_col)) %>%
  filter(service == !!service, 
         !is.na(!!sym(group_col)),
         requested_date > as.POSIXct("2016-10-02 00:00:00", 
       tz = "America/New_York")) %>%
  mutate(requested_date = as.Date(requested_date)) %>%
  group_by(.dots = c(group_col, "requested_date")) %>%
  summarize(MB = sum(as.numeric(bytes), na.rm = TRUE)/10^6)

Format	Read	Write	Read & Filter	Read & Group & Summarize
feather	10.6	5.6	2.8	0.8

Timing in seconds.

For the same reason as fread, I didn’t try compressing the feather format. Both data.table and feather have open GitHub issues to support compression in the future. It may be worth updating this script once those features are added.

fst No Compression

library(fst)
file_name <- "test.fst"

# Write:
write_fst(biggish, path = file_name, compress = 0)
# Read:
fst_df <- read_fst(file_name)

min_bytes <- 100000
param_cd <- "00060"
group_col <- "statecd"
service <- "dv"

# Read and Filter:
read_filter_fst <- read_fst(file_name, 
        columns = c("bytes","requested_date","parametercds")) %>%
  filter(bytes > !!min_bytes,
         grepl(!!param_cd, parametercds)) 

# Read and Group and Summarize:
read_group_summarize_fst <- read_fst(file_name, 
                      columns = c("bytes","requested_date","service",group_col)) %>%
  filter(service == !!service, 
         !is.na(!!sym(group_col)),
         requested_date > as.POSIXct("2016-10-02 00:00:00", 
       tz = "America/New_York")) %>%
  mutate(requested_date = as.Date(requested_date)) %>%
  group_by(.dots = c(group_col, "requested_date")) %>%
  summarize(MB = sum(as.numeric(bytes), na.rm = TRUE)/10^6)

Format	Read	Write	Read & Filter	Read & Group & Summarize
fst	6.2	9.6	3.8	0.5

Timing in seconds.

fst Compression

library(fst)
file_name <- "test_compressed.fst"

# Write:
write_fst(biggish, path = file_name, compress = 100)
# Read:
fst_df <- read_fst(file_name)

The “Read and Filter” and “Read and Group and Summarize” retrievals will be the same process as in “fst No Compression”.

Format	Read	Write	Read & Filter	Read & Group & Summarize
fst compression	6.3	19.6	3.3	0.4

Timing in seconds.

SQLite

SQLite does not have a storage class set aside for storing dates and/or times.

library(RSQLite)

file_name <- "test.sqlite"

sqldb <- dbConnect(SQLite(), dbname=file_name)

# Write:
dbWriteTable(sqldb,name =  "test", biggish,
             row.names=FALSE, overwrite=TRUE,
             append=FALSE, field.types=NULL)
# Read:
sqlite_df <- tbl(sqldb,"test") %>% 
  collect() %>%
  mutate(requested_date = as.POSIXct(requested_date, 
                                     tz = "America/New_York",
                                     origin = "1970-01-01"))

Things to notice here, you can’t just use grep.

min_bytes <- 100000
param_cd <- "00060"
group_col <- "statecd"
service <- "dv"

# Read and Filter:
read_filter_sqlite <- tbl(sqldb,"test") %>% 
  select(bytes, requested_date , parametercds) %>%
  filter(bytes > !!min_bytes,
         parametercds %like% '%00060%') %>%
  collect() %>%
  mutate(requested_date = as.POSIXct(requested_date, 
                                     tz = "America/New_York",
                                     origin = "1970-01-01"))

# Read and Group and Summarize:
filter_time <- as.numeric(as.POSIXct("2016-10-02 00:00:00", tz = "America/New_York"))

read_group_summarize_sqlite <- tbl(sqldb,"test") %>%
  select(bytes, requested_date, service, !!group_col) %>%
  filter(service == !!service, 
         !is.na(!!sym(group_col)),
         requested_date > !! filter_time) %>%
  mutate(requested_date = strftime('%Y-%m-%d', datetime(requested_date, 'unixepoch'))) %>%
  group_by(!!sym(group_col), requested_date) %>%
  summarize(MB = sum(bytes, na.rm = TRUE)/10^6) %>%
  collect()

dbDisconnect(sqldb)

Format	Read	Write	Read & Filter	Read & Group & Summarize
sqlite	45.3	54.9	14.2	1.6

Timing in seconds.

It is important to note that this is the first “Read and Filter” and “Read and Group and Summarize” solution that is completely done outside of R. So when you are getting data that pushes the limits (or passes the limits) of what you can load directly into R, this is the first basic solution.

MonetDB

library(MonetDBLite)
library(DBI)

file_name <- "test.monet"

con <- dbConnect(MonetDBLite(), dbname = file_name)

# Write:
dbWriteTable(con, name =  "test", biggish,
             row.names=FALSE, overwrite=TRUE,
             append=FALSE, field.types=NULL)
# Read:
monet_df <- dbReadTable(con, "test")
attr(monet_df$requested_date, "tzone") <- "America/New_York"

min_bytes <- 100000
param_cd <- "00060"
group_col <- "statecd"
service <- "dv"

# Read and Filter:
read_filter_monet <- tbl(con,"test") %>% 
  select(bytes, requested_date , parametercds) %>%
  filter(bytes > !!min_bytes,
         parametercds %like% '%00060%') %>%
  collect() 

attr(read_filter_monet$requested_date, "tzone") <- "America/New_York"

# Read and Group and Summarize:
# MonetDB needs the time in UTC, formatted exactly as:
# 'YYYY-mm-dd HH:MM:SS', hence the last "format" commend:
filter_time <- as.POSIXct("2016-10-02 00:00:00", tz = "America/New_York")
attr(filter_time, "tzone") <- "UTC"
filter_time <- format(filter_time)

read_group_summarize_monet <- tbl(con,"test") %>%
  select(bytes, requested_date, service, !!group_col) %>%
  filter(service == !!service, 
         !is.na(!!sym(group_col)), 
         requested_date > !! filter_time) %>% 
  mutate(requested_date = str_to_date(timestamp_to_str(requested_date, '%Y-%m-%d'),'%Y-%m-%d')) %>%
  group_by(!!sym(group_col), requested_date) %>%
  summarize(MB = sum(bytes, na.rm = TRUE)/10^6) %>%
  collect()

dbDisconnect(con, shutdown=TRUE)

Format	Read	Write	Read & Filter	Read & Group & Summarize
MonetDB	8	58.1	1.7	1.3

Timing in seconds.

Again, it is important to note that the “Read and Filter” and “Read and Group and Summarize” solutions are completely done outside of R. So when you are getting data that pushes the limits (or passes the limits) of what you can load directly into R, this is another good solution. There also appears to be a lot more flexibility in using date/times directly in MonetDB compared to SQLite.

Comparison

		Read Time (sec)
File Format	Read Method	Complete	Filter	Group & Summarize	Write Time (sec)	File Size (MB)
rds	bulk	57	62	48.7	66	1281
rds compressed	bulk	69	52	50.8	80	55
csv (readr)	limited partial	27	21	17.3	110	704
csv.gz (readr)	limited partial	105	112	97.9	112	66
csv (fread)	limited partial	14	8	6.4	4	504
feather	limited partial	11	3	0.8	6	818
fst	limited partial	6	4	0.5	10	989
fst compressed	limited partial	6	3	0.4	20	123
sqlite	partial	45	14	1.6	55	464
monetDB	partial	8	2	1.3	58	360

Note that sqlite and MonetDB are the only formats here that allow careful filtering and calculate summaries without loading the whole data set (classified as a “partial” read method). So if our “pretty big data” gets “really big”, those will formats will rise to the top. If you can read in all the rows without crashing R, readr, fread, feather, and fst are fast!

Another consideration, who are your collaborators? If everyone’s using R exclusively, this table on its own is a fine way to judge what format to pick. If your collaborators are half R, half Python…you might favor feather since that format works well in both systems.

Collecting this information has been a very useful activity for helping me understand the various options for saving and reading data in R. I tried to pick somewhat complex queries to test out the capabilities, but I acknowledge I’m not an expert on databases and file formats. I would be very happy to hear more efficient ways to perform these analyses.

Disclaimer

Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Keywords:

Tags:

The Hydro Network-Linked Data Index
November 2, 2020
Introduction updated 11-2-2020 after updates described here . updated 9-20-2024 when the NLDI moved from labs.waterdata.usgs.gov to api.water.usgs.gov/nldi/ The Hydro Network-Linked Data Index (NLDI) is a system that can index data to NHDPlus V2 catchments and offers a search service to discover indexed information.
Read More...
Reproducible Data Science in R: Flexible functions using tidy evaluation
December 17, 2024
Overview This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.
Read More...
Mapping water insecurity in R with tidycensus
December 9, 2024
Water insecurity can be influenced by number of social vulnerability indicators—from demographic characteristics to living conditions and socioeconomic status —that vary spatially across the U.
Read More...
Reproducible Data Science in R: Iterate, don't duplicate
July 18, 2024
Overview This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.
Read More...
Reproducible Data Science in R: Writing better functions
June 17, 2024
Overview This blog post is part of a series that works up from functional programming foundations through the use of the targets R package to create efficient, reproducible data workflows.
Read More...

Working with pretty big data in R

Exploring file format options in R.

Introduction

TLDR!

Shared Database

Flat Files

Read, write, and files size

RDS No Compression

RDS Compression

readr No Compression

readr Compression

fread No Compression

feather Compression

fst No Compression

fst Compression

SQLite

MonetDB

Comparison

Disclaimer

Categories:

Keywords:

Tags:

Share:

Related Posts

The Hydro Network-Linked Data Index

Reproducible Data Science in R: Flexible functions using tidy evaluation

Mapping water insecurity in R with tidycensus

Reproducible Data Science in R: Iterate, don't duplicate

Reproducible Data Science in R: Writing better functions