Package a large data set - r

Column-wise storage in the inst/extdata directory of a package, as suggested by Jan, is now implemented in the dfunbind package.
I'm using the data-raw idiom to make entire analyses from the raw data to the results reproducible. For this, datasets are first wrapped in R packages which can then be loaded with library().
One of the datasets I'm using is largish, around 8 million observations with about 80 attributes. For my current analysis I only need a small fraction of the attributes, but I'd like to package the entire dataset anyway.
Now, if it is simply packaged as a data frame (e.g., with devtools::use_data()), it will be loaded in its entirety when first accessing it. What would be the best approach to package this kind of data so that I can lazy-load at the column level? (Only those columns which I'm actually accessing are loaded, the others happily stay on disk and don't occupy RAM.) Would the ff package help? Can anyone point me to a working example?

I think, I would store the data in inst/extdata. Then create a couple of functions in your package that can read and return parts of that data. In your functions you can get the path to your data using: system.file("extdata", "yourfile", package = "yourpackage"). (As on the page you linked to).
The question then is in what format you store your data and how do you obtain selections from it without reading the data in memory. For that, there are a large number of options. To name some:
sqlite: Store your data in a sqlite database. You can then perform queries on this data using the rsqlite package.
ff: store your data in ff objects (e.g. save using the save.ffdf function from ffbase; use load.ffdf to load again). ff doesn't handle character fields well (they are always converted to factors). And in theory the files are not cross platform although as long as you stay on intel platforms you should be ok.
CSV: store your data in a plain old csv file. You can then make selections from this file using the LaF package. The performance will probably be less than with ff but might be good enough.
RDS: store each of your columns in a seperate RDS file (using saveRDS) and load them using readRDS the advantage is that you do not depend on any R-packages. This is fast. The disadvantage is that you cannot do row selections (but that does not seem to be the case).
If you only want to select columns, I would go with RDS.
A rough example using RDS
The following code creates an example package containing the iris data set:
load_data <- function(dataset, columns) {
result <- vector("list", length(columns));
for (i in seq_along(columns)) {
col <- columns[i]
fn <- system.file("extdata", dataset, paste0(col, ".RDS"), package = "lazydata")
result[[i]] <- readRDS(fn)
}
names(result) <- columns
as.data.frame(result)
}
store_data <- function(package, name, data) {
dir <- file.path(package, "inst", "exdata", name)
dir.create(dir, recursive = TRUE)
for (col in names(data)) {
saveRDS(data[[col]], file.path(dir, paste0(col, ".RDS")))
}
}
packagename <- "lazyload"
package.skeleton(packagename, "load_data")
store_data(packagename, "iris", iris)
After building and installing the package (you'll need to fix the documentation, e.g. delete it) you can do:
library(lazyload)
data <- load_data("iris", "Sepal.Width")
To load the Sepal.Width column of the iris data set.
Of course this is a very simple implementation of load_data: no error handling, it assumes all column exist, it does not know which columns exist, it does not know which data sets exist.

Related

Export dataset from R to REDCap?

I want to export a dataset containing personal IDs and other variables to REDCap. Does anyone know how to do it?
I have found a package called REDCapR but it just contains importing from R.
REDCapR::redcap_write() moves info from a data.frame on your local R instance to the remote REDCap Server.
# Read the dataset for the first time.
result_read1 <- REDCapR::redcap_read_oneshot(redcap_uri=uri, token=token)
ds1 <- result_read1$data
ds1$telephone
# Manipulate a field in the dataset in a VALID way
ds1$telephone <- paste0("(405) 321-000", seq_len(nrow(ds1)))
ds1 <- ds1[1:3, ]
ds1$age <- NULL; ds1$bmi <- NULL # Drop the calculated fields before writing.
# Upload the data to the server.
result_write <- REDCapR::redcap_write(ds1, redcap_uri=uri, token=token)
Documentation and references:
The function's page in the reference manual: https://ouhscbbmc.github.io/REDCapR/reference/redcap_write.html
Troubleshooting document: https://ouhscbbmc.github.io/REDCapR/articles/TroubleshootingApiCalls.html#writing
Other vignettes that describe global approaches to working with R & REDCap: https://ouhscbbmc.github.io/REDCapR/articles/index.html
(I'm the primary REDCapR developer. The REDCapR::redcap_write() function was one of first functions added in 2013.)

Importing very large dataset into h2o from sqlite

I have a database of about 500G. It comprises of 16 tables, each containing 2 or 3 column (first column can be discarded) and 1,375,328,760 rows. I need all the tables to be joined as one dataframe in h2o as they are needed for running a prediction in an XGB model. I have tried to convert the individual sql tables into the h2o environment using as.h2o, and h2o.cbind them 2 or 3 tables at a time, until they are one dataset. However, I get this "GC overhead limit exceeded: java.lang.OutOfMemoryError", after converting 4 tables.
Is there a way around this?
My machine specs are 124G RAM, OS (Rhel 7.8), Root(1tb), Home(600G) and 2TB external HDD.
The model is run on this local machine and the max_mem_size is set at 100G. The details of the code are below.
library(data.table)
library(h2o)
h2o.init(
nthreads=14,
max_mem_size = "100G")
h2o.removeAll()
setwd("/home/stan/Documents/LUR/era_aq")
l1.hex <- as.h2o(d2)
l2.hex <- as.h2o(lai)
test_l1.hex <-h2o.cbind(l1.hex,l2.hex[,-1])
h2o.rm (l1.hex,l2.hex)
l3.hex <- as.h2o(lu100)
l4.hex <- as.h2o(lu1000)
test_l2.hex <-h2o.cbind(l3.hex,l4.hex[,-1])
h2o.rm(l3.hex,l4.hex)
l5.hex <- as.h2o(lu1250)
l6.hex <- as.h2o(lu250)
test_l3.hex <-h2o.cbind(l5.hex,l6.hex[,-1])
h2o.rm(l5.hex,l6.hex)
l7.hex <- as.h2o(pbl)
l8.hex <- as.h2o(msl)
test_l4.hex <-h2o.cbind(l7.hex,l8.hex[,-1])
h2o.rm(ll7.hex,l8.hex)
test.hex <-h2o.cbind(test_l1.hex,test_l2.hex[,-1],test_l3.hex[,-1],test_l4.hex[,-1])
test <- test.hex[,-1]
test[1:3,]```
First, as Tom says in the comments, you're gonna need a bigger boat. H2O holds all data in memory, and generally you need 3 to 4x the data size to be able to do anything useful with it. A dataset of 500GB means you need the total memory of your cluster to be 1.5-2TB.
(H2O stores the data compressed, and I don't think sqlite does, in which case you might get away with only needing 1TB.)
Second, as.h2o() is an inefficient way to load big datasets. What will happen is your dataset is loaded into R's memory space, then it is saved to a csv file, then that csv file is streamed over TCP/IP to the H2O process.
So, the better way is to export directly from sqlite to a csv file. And then use h2o.importFile() to load that csv file into H2O.
h2o.cbind() is also going to involve a lot of copying. If you can find a tool or script to column-bind the csv files in advance of import, it might be more efficient. A quick search found csvkit, but I'm not sure if it needs to load the files into memory, or can do work with the files completely on disk.
Since memory is a premium and all R runs in RAM, avoid storing large helper data.table andh20 objects in your global environment. Consider setting up a function to build a list for compilation that temporary objects are removed when function is out of scope. Ideally, you build your h2o objects directly from file source:
# BUILD LIST OF H20 OBJECTS WITHOUT HELPER COPIES
h2o_list <- lapply(list_of_files, function(f) as.h2o(data.table::fread(f))[-1])
# h2o_list <- lapply(list_of_files, function(f) h2o.importFile(f)[-1])
# CBIND ALL H20 OBJECTS
test.h2o <- do.call(h2o.cbind, h2o_list)
Or even combine both lines with named function as opposed to anonymous function. Then, only final object remains after processing.
build_h2o <- function(f) as.h2o(data.table::fread(f))[-1])
# build_h2o <- function(f) h2o.importFile(f)[-1]
test.h2o <- do.call(h2o.cbind, lapply(list_of_files, build_h2o))
Extend function with if for some datasets that need to retain first column or not.
build_h2o <- function(f) {
if (grepl("lai|lu1000|lu250|msl", f)) { tmp <- fread(f)[-1] }
else { tmp <- fread(f) }
return(as.h2o(tmp))
}
Finally, if possible, leverage data.table methods like cbindlist:
final_dt <- cbindlist(lapply(list_of_files, function(f) fread(f)[-1]))
test.h2o <- as.h2o(final_dt)
rm(final_dt)
gc()

How to batch read 2.8 GB gzipped (40 GB TSVs) files into R?

I have a directory with 31 gzipped TSVs (2.8 GB compressed / 40 GB uncompressed). I would like to conditionally import all matching rows based on the value of 1 column, and combine into one data frame.
I've read through several answers here, but none seem to work—I suspect that they are not meant to handle that much data.
In short, how can I:
Read 3 GB of gzipped files
Import only rows whose column matches a certain value
Combine matching rows into one data frame.
The data is tidy, with only 4 columns of interest: date, ip, type (str), category (str).
The first thing I tried using read_tsv_chunked():
library(purrr)
library(IPtoCountry)
library(lubridate)
library(scales)
library(plotly)
library(tidyquant)
library(tidyverse)
library(R.utils)
library(data.table)
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
#Define a function to filter data as it comes in.
call_back <- function(x, pos){
unique(dplyr::filter(x, .data[["type"]] == "purchase"))
}
raw_data <- files %>%
map(~ read_tsv_chunked(., DataFrameCallback$new(call_back),
chunk_size = 5000)) %>%
reduce(rbind) %>%
as_tibble() # %>%
This first approach worked with 9 GB of uncompressed data, but not with 40 GB.
The second approach using fread() (same loaded packages):
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
bind_rows(map(str_c("gunzip - c", files), fread))
That looked like it started working, but then locked up. I couldn't figure out how to pass the select = c(colnames) argument to fread() inside the map()/str_c() calls, let alone the filter criteria for the one column.
This is more of a strategy answer.
R loads all data into memory for processing, so you'll run into issues with the amount of data you're looking at.
What I suggest you do, which is what I do, is to use Apache Spark for the data processing, and use the R package sparklyr to interface to it. You can then load your data into Spark, process it there, then retrieve the summarised set of data back into R for further visualisation and analysis.
You can install Spark locally in your R Studio instance and do a lot there. If you need further computing capacity have a look at a hosted option such as AWS.
Have a read of this https://spark.rstudio.com/
One technical point, there is a sparklyr function spark_read_text which will read delimited text files directly into the Spark instance. It's very useful.
From there you can use dplyr to manipulate your data. Good luck!
First, if the base read.table is used, you don't need to gunzip anything, as it uses Zlib to read these directly. read.table also works much faster if the colClasses parameter is specified.
Y'might need to write some custom R code to produce a melted data frame directly from each of the 31 TSVs, and then accumulate them by rbinding.
Still it will help to have a machine with lots of fast virtual memory. I often work with datasets on this order, and I sometimes find an Ubuntu system wanting on memory, even if it has 32 cores. I have an alternative system where I have convinced the OS that an SSD is more of its memory, giving me an effective 64 GB RAM. I find this very useful for some of these problems. It's Windows, so I need to set memory.limit(size=...) appropriately.
Note that once a TSV is read using read.table, it's pretty compressed, approaching what gzip delivers. You may not need a big system if you do it this way.
If it turns out to take a long time (I doubt it), be sure to checkpoint and save.image at spots in between.

load package data with separate name

I want to load different package data but assign them in separate object. There are some packages which has data with same name. I want to load them but as a separate object. For example;
data("milk", package = "EMSC")
data("milk", package = "baseline")
But the later will replace the previous. So, I want to assign them on object Eg. milk.emsc and milk.baseline.
Is there any efficient and simple solution for this?
Since I came to this question after a long time, I will write my answer that I came up with in case someone has the same problem.
local({
data("milk", package="baseline", envir=environment())
assign(x="milk_baseline", envir=.GlobalEnv, value=milk)
})
local({
data("milk", package="EMSC", envir=environment())
assign(x="milk_emsc", envir=.GlobalEnv, value=milk)
})
This way the global environment will be clean and will only have two data sets with same name from two different packages.

How can I add blank top rows to an xlsx file using the xlsx package?

I would like to add two blank rows to the top of an xlsx file that I create.
so far I have tried:
library("xlsx")
fn1 <- 'test1.xlsx'
fn2 <- 'test2.xlsx'
write.xlsx(matrix(rnorm(25),5),fn1)
wb <- loadWorkbook(fn1)
rows <- getRows(getSheets(wb)[[1]])
for(i in 1:length(rows))
rows[[i]]$setRowNum(as.integer(i+1))
saveWorkbook(wb,fn2)
But test2.xlsx is empty!
I'm a bit confused by what you're trying to do with the for loop, but:
You could create a dummy object with the same number of columns as wb, then use rbind() to join the dummy and wb to create fn2.
fn1 <- 'test1.xlsx'
wb <- loadWorkbook(fn1)
dummy <- wb[c(1,2),]
# set all values of dummy to whatever you want, e.g. "NA" or 0
fn2 <- rbind(dummy, wb)
saveWorkbook( fn2)
Hope that helps
So the xlsx package actually interfaces with a java library in order to pull in and modify the workbooks. The read.xlsx and write.xlsx functions are convenience wrappers to read and write data frames within R without having to manually write code to parse the individual cells and rows yourself using the java objects. The loadWorkbook and getRows functions give you access to the actual java objects, which you can then use to modify things such as cell styles.
However, if all you want to do is add a blank row to your spreadsheet before you output it, the easiest way is to add a blank row to your data frame before you export it (as Chris mentioned). You can accomplish this with the following variant on your code:
library("xlsx")
fn1 <- 'test1.xlsx'
fn2 <- 'test2.xlsx'
# I have added row.names=FALSE because the read.xlsx function
# does not have a parameter to include row names
write.xlsx(matrix(rnorm(25),5),fn1,row.names=FALSE)
# If you read your data back in using the read.xlsx function
# you will have a data frame rather than a series of java object references
wb <- read.xlsx(fn1,1)
# Now that we have a data frame we can add a blank row at the top
wb<-rbind.data.frame(rep("",5),wb)
# Then we write to the file using the same function as before
write.xlsx(wb,fn2,row.names=FALSE)
If you desire to use the advanced functionality in the java libraries, some of it is not currently implemented in the R package, and thus you will have to call it directly using .jcall. If you decide to pursue this line of action, I definitely recommend using .jmethods on the objects produced (i.e. .jmethods(rows[[1]]) which will list the available functions which you can use on the object (at the cellular level these are quite extensive).

Resources