How to load data quickly into R? - r

I have some R scripts, where I have to load several dataframe in R as quickly as possible. This is quite important as reading the data is the slowest part of the procedure. E.g.: plotting from different dataframes. I get the data in sav (SPSS) format, but I could transform it to any format as suggested. Merging the dataframes is not an option unfortunately.
What could be the fastest way to load the data? I was thinking of the following:
transform from sav to binary R object (Rdata) in the first time, and later always load this, as it seems a lot quicker than read.spss.
transform from sav to csv files and reading data from those with given parameters discussed in this topic,
or is it worth setting up a MySQL backend on localhost and load data from that? Could it be faster? If so, can I also save any custom attr values of the variables (e.g. variable.labels from Spss imported files)? Or this should be done in a separate table?
Any other thoughts are welcome. Thank you for every suggestion in advance!
I made a little experiment below based on the answers you have given, and also added (24/01/2011) a quite "hackish" but really speedy solution loading only a few variables/columns from a special binary file. The latter seems to be the fastest method I can imagine now, that is why I made up (05/03/2011: ver. 0.3) a small package named saves to deal with this feature. The package is under "heavy" development, any recommendation is welcome!
I will soon post a vignette with accurate benchmark results with the help of microbenchmark package.

Thank you all for the tips and answers, I did some summary and experiment based on that.
See a little test with a public database (ESS 2008 in Hungary) below. The database have 1508 cases and 508 variables, so it could be a mid-sized data. That might be a good example to do the test on (for me), but of course special needs would require an experiment with adequate data.
Reading the data from SPSS sav file without any modification:
> system.time(data <- read.spss('ESS_HUN_4.sav'))
user system elapsed
2.214 0.030 2.376
Loading with a converted binary object:
> save('data',file='ESS_HUN_4.Rdata')
> system.time(data.Rdata <- load('ESS_HUN_4.Rdata'))
user system elapsed
0.28 0.00 0.28
Trying with csv:
> write.table(data, file="ESS_HUN_4.csv")
> system.time(data.csv <- read.csv('ESS_HUN_4.csv'))
user system elapsed
1.730 0.010 1.824
Trying with "fine-tuned" csv loading:
> system.time(data.csv <- read.table('ESS_HUN_4.csv', comment.char="", stringsAsFactors=FALSE, sep=","))
user system elapsed
1.296 0.014 1.362
Also with package sqldf, which seems to load csv files a lot faster:
> library(sqldf)
> f <- file("ESS_HUN_4.csv")
> system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F, sep="\t")))
user system elapsed
0.939 0.106 1.071
And also loading the data from a MySQL database running on localhost:
> library(RMySQL)
> con <- dbConnect(MySQL(), user='root', dbname='test', host='localhost', password='')
> dbWriteTable(con, "data", as.data.frame(data), overwrite = TRUE)
> system.time(data <- dbReadTable(con, 'data'))
user system elapsed
0.583 0.026 1.055
> query <-('SELECT * FROM data')
> system.time(data.sql <- dbGetQuery(con, query))
user system elapsed
0.270 0.020 0.473
Here, I think we should add the two system.time reported, as connecting to the data also counts in our case. Please comment, if I misunderstood something.
But let us see if querying only some variables, as eg. while plotting we do not need all the dataframe in most cases, and querying only two variables is enough to create a nice plot of them:
> query <-('SELECT c1, c19 FROM data')
> system.time(data.sql <- dbGetQuery(con, query))
user system elapsed
0.030 0.000 0.112
Which seems really great! Of course just after loading the table with dbReadTable
Summary: nothing to beat reading the whole data from binary file, but reading only a few columns (or other filtered data) from the same database table might be also weighted in some special cases.
Test environment: HP 6715b laptop (AMD X2 2Ghz, 4 Gb DDR2) with a low-end SSD.
UPDATE (24/01/2011): I added a rather hackish, but quite "creative" way of loading only a few columns of a binary object - which looks a lot faster then any method examined above.
Be aware: the code will look really bad, but still very effective :)
First, I save all columns of a data.frame into different binary objects via the following loop:
attach(data)
for (i in 1:length(data)) {
save(list=names(data)[i],file=paste('ESS_HUN_4-', names(data)[i], '.Rdata', sep=''))
}
detach(data)
And then I load two columns of the data:
> system.time(load('ESS_HUN_4-c19.Rdata')) +
> system.time(load('ESS_HUN_4-c1.Rdata')) +
> system.time(data.c1_c19 <- cbind(c1, c19))
user system elapsed
0.003 0.000 0.002
Which looks like a "superfast" method! :) Note: it was loaded 100 times faster than the fastest (loading the whole binary object) method above.
I have made up a very tiny package (named: saves), look in github for more details if interested.
UPDATE (06/03/2011): a new version of my little package (saves) was uploaded to CRAN, in which it is possible to save and load variables even faster - if only the user needs only a subset of the available variables in a data frame or list. See the vignette in the package sources for details or the one on my homepage, and let me introduce also a nice boxplot of some benchmark done:
This boxplot shows the benefit of using saves package to load only a subset of variables against load and read.table or read.csv from base, read.spss from foreign or sqldf or RMySQL packages.

It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore.
Any kind of text file is a different story, as you include invariably an overhead : each time you read in the text file, the data has to be transformed to the binary R object. I'd forget about them. They are only useful for porting datasets from one application to another.
Setting up a MySQL backend is very useful if you need different parts of the data, or different subsets in different combinations. Especially when working with huge datasets, the fact that you don't have to load in the whole dataset before you can start selecting the rows/columns, can gain you quite some time. But this only works with huge datasets, as reading a binary file is quite a bit faster than searching a database.
If the data is not too big, you can save different dataframes in one RData file, giving you the opportunity to streamline things a bit more. I often have a set of dataframes in a list or in a seperate environment (see also ?environment for some simple examples). This allows for lapply / eapply solutions to process multiple dataframes at once.

If it's at all possible, have the data transformed into a csv or other "simple" format to make reading as fast as possible (see Joris' answer). I import csv files en masse with the apply function, something along the lines of:
list.of.files <- as.list(list.files("your dir"))
lapply(list.of.files, FUN = function(x) {
my.object <- read.table(...) # or some other function, like read.spss
})

I am pretty happy with RMySQL. I am not sure whether I got your question the right way, but labels should not be a problem. There are several convenience functions that just use the default SQL table and row names, but of course you can use some SQL statements.
I would say (apart from large datasets that justify the hustle) one of the main reasons to use RMySQL is being familiar more familiar with SQL syntax than with R data juggling functions. Personally I prefer GROUP BY over aggregate. Note, that using stored procedures from inside R does not work particularly well.
Bottom line... setting up an MySQL localhost is not too much effort – give it a try! I cannot tell exactly about the speed, but I have the feeling there's a chance it's faster. However, I will try and get back here.
EDIT: here's the test... and the winner is: spacedman
# SQL connection
source("lib/connect.R")
dbQuery <- "SELECT * FROM mytable"
mydata <- dbGetQuery(con,dbQuery)
system.time(dbGetQuery(con,dbQuery))
# returns
#user system elapsed
# 0.999 0.213 1.715
save.image(file="speedtest.Rdata")
system.time(load("speedtest.Rdata"))
#user system elapsed
#0.348 0.006 0.358
File Size was only about 1 MB here. MacBook Pro 4 GB Ram 2.4 GHZ Intel Core Duo, Mac OSX 10.6.4, MySQL 5.0.41
Just never tried that, because I work with bigger dataset usually and loading is not the issue, rather processing... if there are time issues at all. +1 for the Q!

Related

How to batch read 2.8 GB gzipped (40 GB TSVs) files into R?

I have a directory with 31 gzipped TSVs (2.8 GB compressed / 40 GB uncompressed). I would like to conditionally import all matching rows based on the value of 1 column, and combine into one data frame.
I've read through several answers here, but none seem to work—I suspect that they are not meant to handle that much data.
In short, how can I:
Read 3 GB of gzipped files
Import only rows whose column matches a certain value
Combine matching rows into one data frame.
The data is tidy, with only 4 columns of interest: date, ip, type (str), category (str).
The first thing I tried using read_tsv_chunked():
library(purrr)
library(IPtoCountry)
library(lubridate)
library(scales)
library(plotly)
library(tidyquant)
library(tidyverse)
library(R.utils)
library(data.table)
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
#Define a function to filter data as it comes in.
call_back <- function(x, pos){
unique(dplyr::filter(x, .data[["type"]] == "purchase"))
}
raw_data <- files %>%
map(~ read_tsv_chunked(., DataFrameCallback$new(call_back),
chunk_size = 5000)) %>%
reduce(rbind) %>%
as_tibble() # %>%
This first approach worked with 9 GB of uncompressed data, but not with 40 GB.
The second approach using fread() (same loaded packages):
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
bind_rows(map(str_c("gunzip - c", files), fread))
That looked like it started working, but then locked up. I couldn't figure out how to pass the select = c(colnames) argument to fread() inside the map()/str_c() calls, let alone the filter criteria for the one column.
This is more of a strategy answer.
R loads all data into memory for processing, so you'll run into issues with the amount of data you're looking at.
What I suggest you do, which is what I do, is to use Apache Spark for the data processing, and use the R package sparklyr to interface to it. You can then load your data into Spark, process it there, then retrieve the summarised set of data back into R for further visualisation and analysis.
You can install Spark locally in your R Studio instance and do a lot there. If you need further computing capacity have a look at a hosted option such as AWS.
Have a read of this https://spark.rstudio.com/
One technical point, there is a sparklyr function spark_read_text which will read delimited text files directly into the Spark instance. It's very useful.
From there you can use dplyr to manipulate your data. Good luck!
First, if the base read.table is used, you don't need to gunzip anything, as it uses Zlib to read these directly. read.table also works much faster if the colClasses parameter is specified.
Y'might need to write some custom R code to produce a melted data frame directly from each of the 31 TSVs, and then accumulate them by rbinding.
Still it will help to have a machine with lots of fast virtual memory. I often work with datasets on this order, and I sometimes find an Ubuntu system wanting on memory, even if it has 32 cores. I have an alternative system where I have convinced the OS that an SSD is more of its memory, giving me an effective 64 GB RAM. I find this very useful for some of these problems. It's Windows, so I need to set memory.limit(size=...) appropriately.
Note that once a TSV is read using read.table, it's pretty compressed, approaching what gzip delivers. You may not need a big system if you do it this way.
If it turns out to take a long time (I doubt it), be sure to checkpoint and save.image at spots in between.

Reading Large Files into Data Frames in R - Issues with sqldf

I am trying to read in and manipulate data that I have stored in large data sets. Each file is about 5GB. I mostly need to be able to grab chunks of specific data out of these data sets. I also have a similar 38 MB file that I use for testing. I initially used read.table to read in chunks of the file using 'nrows' and 'skip'. However, this process take s a huge amount of time because the act of skipping an increasing amount of rows is time consuming. Here is the code I had:
numskip = 0 #how many lines in the file to skip
cur_list = read.table("file.txt", header = TRUE, sep = ',',nrows = 200000, skip = numskip, col.names = col) #col is a vector of column names
I set this up in a while loop and increasing numskip to grab the next chunk of data, but as numskip increasing, the process slowed significantly.
I briefly tried using read.lines to read in data line by line, but a few threads pointed me towards the sqdl package. I wrote the following bit of code:
library(sqldf)
f = file("bigfile.txt")
dataset = sqldf("select * from f where CusomterID = 7127382") #example of what I would like to be able to grab
From what I understand, sqldf will allow me to use SQL queries to return sets of the data from the database without R doing anything, provided that the subset isn't then too big for R to handle.
The problem is that my 4GB machine runs out of memory when I run the large files (though not the smaller test file). I found this odd because I know that SQLite can handle much larger files than 5GB, and R shouldn't be doing any of the processing. Would using PostGreSQL help? do I just need a better machine with more RAM? Should I give up on sqldf and find a different way to do this?
To wrap this up, here's an example of the data I am working with:
"Project" "CustomerID" "Stamp" "UsagePoint" "UsagePointType" "Energy"
21 110981 YY 40 Red 0.17
21 110431 YY 40 Blue 0.19
22 120392 YY 40 Blue 0.20
22 210325 YY 40 Red 0.12
Thanks
Have you tried
dat <- read.csv.sql(file = "file.txt", "select * from file where CusomterID = 7127382")
You're right about sqldf and there are a ton of other great big data tools in R, including big.memory.
Conversions to csv or json can help (use RJSONIO) and you can also first load your data into a relational, NoSQL, Hadoop, or Hive database and read it in via RODBC, which is what I'd highly recommend in your case.
Also see fread and the CRAN HPC Taskview.

Can I cache data loading in R?

I'm working on a R script which has to load data (obviously). The data loading takes a lot of effort (500MB) and I wonder if I can avoid having to go through the loading step every time I rerun the script, which I do a lot during the development.
I appreciate that I could do the whole thing in the interactive R session, but developing multi-line functions is just so much less convenient on the R prompt.
Example:
#!/usr/bin/Rscript
d <- read.csv("large.csv", header=T) # 500 MB ~ 15 seconds
head(d)
How, if possible, can I modify the script, such that on subsequent executions, d is already available? Is there something like a cache=T statement as in R markdown code chunks?
Sort of. There are a few answers:
Use a faster csv read: fread() in the data.table() package is beloved by many. Your time may come down to a second or two.
Similarly, read once as csv and then write in compact binary form via saveRDS() so that next time you can do readRDS() which will be faster as you do not have to load and parse the data again.
Don't read the data but memory-map it via package mmap. That is more involved but likely very fast. Databases uses such a technique internally.
Load on demand, and eg the package SOAR package is useful here.
Direct caching, however, is not possible.
Edit: Actually, direct caching "sort of" works if you save your data set with your R session at the end. Many of us advise against that as clearly reproducible script which make the loading explicit are preferably in our view -- but R can help via the load() / save() mechanism (which lots several objects at once where saveRSS() / readRDS() work on a single object.
Package ‘R.cache’ R.cache
start_year <- 2000
end_year <- 2013
brics_countries <- c("BR","RU", "IN", "CN", "ZA")
indics <- c("NY.GDP.PCAP.CD", "TX.VAL.TECH.CD", "SP.POP.TOTL", "IP.JRN.ARTC.SC",
"GB.XPD.RSDV.GD.ZS", "BX.GSR.CCIS.ZS", "BX.GSR.ROYL.CD", "BM.GSR.ROYL.CD")
key <- list(brics_countries, indics, start_year, end_year)
brics_data <- loadCache(key)
if (is.null(brics_data)) {
brics_data <- WDI(country=brics_countries, indicator=indics,
start=start_year, end=end_year, extra=FALSE, cache=NULL)
saveCache(brics_data, key=key, comment="brics_data")
}
I use exists to check if the object is present and load conditionally, i.e.:
if (!exists(d))
{
d <- read.csv("large.csv", header=T)
# Any further processing on loading
}
# The rest of the script
If you want to load/process the file again, just use rm(d) before sourcing. Just be careful that you do not use object names that are already used elsewhere, otherwise it will pick that up and not load.
I wrote up some of the common ways of caching in R in "Caching in R" and published it to R-Bloggers. For your purpose, I would recommend just using saveRDS() or qs() from the 'qs' (quick serialization) package. My package, 'mustashe', uses qs() for reading and writing files, so you could just use mustashe::stash(), too.

read.csv is extremely slow in reading csv files with large numbers of columns

I have a .csv file: example.csv with 8000 columns x 40000 rows. The csv file have a string header for each column. All fields contains integer values between 0 and 10. When I try to load this file with read.csv it turns out to be extremely slow. It is also very slow when I add a parameter nrow=100. I wonder if there is a way to accelerate the read.csv, or use some other function instead of read.csv to load the file into memory as a matrix or data.frame?
Thanks in advance.
If your CSV only contains integers, you should use scan instead of read.csv, since ?read.csv says:
‘read.table’ is not the right tool for reading large matrices,
especially those with many columns: it is designed to read _data
frames_ which may have columns of very different classes. Use
‘scan’ instead for matrices.
Since your file has a header, you will need skip=1, and it will probably be faster if you set what=integer(). If you must use read.csv and speed / memory consumption are a concern, setting the colClasses argument is a huge help.
Try using data.table::fread(). This is by far on of the fastest ways to read .csv files into R. There is a good benchmark here.
library(data.table)
data <- fread("c:/data.csv")
If you want to make it even faster, you can also read only the subset of columns you want to use:
data <- fread("c:/data.csv", select = c("col1", "col2", "col3"))
Also try Hadley Wickham's readr package:
library(readr)
data <- read_csv("file.csv")
If you'll read the file often, it might well be worth saving it from R in a binary format using the save function. Specifying compress=FALSE often results in faster load times.
...You can then load it in with the (surprise!) load function.
d <- as.data.frame(matrix(1:1e6,ncol=1000))
write.csv(d, "c:/foo.csv", row.names=FALSE)
# Load file with read.csv
system.time( a <- read.csv("c:/foo.csv") ) # 3.18 sec
# Load file using scan
system.time( b <- matrix(scan("c:/foo.csv", 0L, skip=1, sep=','),
ncol=1000, byrow=TRUE) ) # 0.55 sec
# Load (binary) file using load
save(d, file="c:/foo.bin", compress=FALSE)
system.time( load("c:/foo.bin") ) # 0.09 sec
Might be worth it to try the new vroom package
vroom is a new approach to reading delimited and fixed width data into R.
It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.
Therefore you can obtain very rapid input by first performing a fast indexing step and then using the ALTREP (ALTernative REPresentations) framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.
This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.
#install.packages("vroom",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)
df <- vroom('example.csv')
Benchmark: readr vs data.table vs vroom for a 1.57GB file

Is there a built-in function for sampling a large delimited data set?

I have a few large data files I'd like to sample when loading into R. I can load the entire data set, but it's really too large to work with. sample does roughly the right thing, but I'd like to have to take random samples of the input while reading it.
I can imagine how to build that with a loop and readline and what-not but surely this has been done hundreds of times.
Is there something in CRAN or even base that can do this?
You can do that in one line of code using sqldf. See part 6e of example 6 on the sqldf home page.
No pre-built facilities. Best approach would be to use a database management program. (Seems as though this was addressed in either SO or Rhelp in the last week.)
Take a look at: Read csv from specific row , and especially note Grothendieck's comments. I consider him a "class A wizaRd". He's got first hand experience with sqldf. (The author IIRC.)
And another "huge files" problem with a Grothendieck solution that succeeded:
R: how to rbind two huge data-frames without running out of memory
I wrote the following function that does close to what I want:
readBigBz2 <- function(fn, sample_size=1000) {
f <- bzfile(fn, "r")
rv <- c()
repeat {
lines <- readLines(f, sample_size)
if (length(lines) == 0) break
rv <- append(rv, sample(lines, 1))
}
close(f)
rv
}
I may want to go with sqldf in the long-term, but this is a pretty efficient way of sampling the file itself. I just don't quite know how to wrap that around a connection for read.csv or similar.

Resources