I'm new to R and I have basically no knowledge of data management in this language. I'm using the dynaTrees package to do some machine learning and I'd like to export the model to a file for further use.
The model is obtained by calling the dynaTrees function:
model <- dynaTrees(
as.matrix(training.data[,-1]),
as.matrix(training.data[, 1]),
R=10
)
I then want to export this model object so it can be loaded in another script later on. I tried the simple:
write(model, file="model.dat")
but that doesn't work (type list not supported).
Is there a generic way (or a dedicated package) in R to export complex data structure to file?
You probably want saveRDS (see ? saveRDS for details). Example:
saveRDS(model, file = "model.Rds")
This saves a single R object to file so that you can restore it later (using readRDS). save is an alternative that is designed for saving multiple R objects (or an entire workspace), which can be accessed later using load.
Your intuition was to use the write function, which is actually a rarely used tool for writing a matrix to a text representation. Here's an example:
write(as.matrix(warpbreaks[1:3,]), file = stdout())
# 26
# 30
# 54
# A
# A
# A
# L
# L
# L
Related
I have a database of about 500G. It comprises of 16 tables, each containing 2 or 3 column (first column can be discarded) and 1,375,328,760 rows. I need all the tables to be joined as one dataframe in h2o as they are needed for running a prediction in an XGB model. I have tried to convert the individual sql tables into the h2o environment using as.h2o, and h2o.cbind them 2 or 3 tables at a time, until they are one dataset. However, I get this "GC overhead limit exceeded: java.lang.OutOfMemoryError", after converting 4 tables.
Is there a way around this?
My machine specs are 124G RAM, OS (Rhel 7.8), Root(1tb), Home(600G) and 2TB external HDD.
The model is run on this local machine and the max_mem_size is set at 100G. The details of the code are below.
library(data.table)
library(h2o)
h2o.init(
nthreads=14,
max_mem_size = "100G")
h2o.removeAll()
setwd("/home/stan/Documents/LUR/era_aq")
l1.hex <- as.h2o(d2)
l2.hex <- as.h2o(lai)
test_l1.hex <-h2o.cbind(l1.hex,l2.hex[,-1])
h2o.rm (l1.hex,l2.hex)
l3.hex <- as.h2o(lu100)
l4.hex <- as.h2o(lu1000)
test_l2.hex <-h2o.cbind(l3.hex,l4.hex[,-1])
h2o.rm(l3.hex,l4.hex)
l5.hex <- as.h2o(lu1250)
l6.hex <- as.h2o(lu250)
test_l3.hex <-h2o.cbind(l5.hex,l6.hex[,-1])
h2o.rm(l5.hex,l6.hex)
l7.hex <- as.h2o(pbl)
l8.hex <- as.h2o(msl)
test_l4.hex <-h2o.cbind(l7.hex,l8.hex[,-1])
h2o.rm(ll7.hex,l8.hex)
test.hex <-h2o.cbind(test_l1.hex,test_l2.hex[,-1],test_l3.hex[,-1],test_l4.hex[,-1])
test <- test.hex[,-1]
test[1:3,]```
First, as Tom says in the comments, you're gonna need a bigger boat. H2O holds all data in memory, and generally you need 3 to 4x the data size to be able to do anything useful with it. A dataset of 500GB means you need the total memory of your cluster to be 1.5-2TB.
(H2O stores the data compressed, and I don't think sqlite does, in which case you might get away with only needing 1TB.)
Second, as.h2o() is an inefficient way to load big datasets. What will happen is your dataset is loaded into R's memory space, then it is saved to a csv file, then that csv file is streamed over TCP/IP to the H2O process.
So, the better way is to export directly from sqlite to a csv file. And then use h2o.importFile() to load that csv file into H2O.
h2o.cbind() is also going to involve a lot of copying. If you can find a tool or script to column-bind the csv files in advance of import, it might be more efficient. A quick search found csvkit, but I'm not sure if it needs to load the files into memory, or can do work with the files completely on disk.
Since memory is a premium and all R runs in RAM, avoid storing large helper data.table andh20 objects in your global environment. Consider setting up a function to build a list for compilation that temporary objects are removed when function is out of scope. Ideally, you build your h2o objects directly from file source:
# BUILD LIST OF H20 OBJECTS WITHOUT HELPER COPIES
h2o_list <- lapply(list_of_files, function(f) as.h2o(data.table::fread(f))[-1])
# h2o_list <- lapply(list_of_files, function(f) h2o.importFile(f)[-1])
# CBIND ALL H20 OBJECTS
test.h2o <- do.call(h2o.cbind, h2o_list)
Or even combine both lines with named function as opposed to anonymous function. Then, only final object remains after processing.
build_h2o <- function(f) as.h2o(data.table::fread(f))[-1])
# build_h2o <- function(f) h2o.importFile(f)[-1]
test.h2o <- do.call(h2o.cbind, lapply(list_of_files, build_h2o))
Extend function with if for some datasets that need to retain first column or not.
build_h2o <- function(f) {
if (grepl("lai|lu1000|lu250|msl", f)) { tmp <- fread(f)[-1] }
else { tmp <- fread(f) }
return(as.h2o(tmp))
}
Finally, if possible, leverage data.table methods like cbindlist:
final_dt <- cbindlist(lapply(list_of_files, function(f) fread(f)[-1]))
test.h2o <- as.h2o(final_dt)
rm(final_dt)
gc()
I want to save within a function, using the input object's name as the file name
saveNew <- function(dat){
# Collect the original name
originalName <- deparse(substitute(dat))
#Do lots of Fun and Interesting Things!
#Now lets save it, First i have to get it
newToSave <- get(originalName, envir = .GlobalEnv)
save(newToSave, file = paste0(originalName, '.Rdata') )
}
But the problem is when i go to save it, it saves the newly created data as newToSave. This is apparent when loading this newly created object with
load('funData.Rdata') the object is no longer funData but now newToSave
How can i get this function to save it as, in the example below, funData, as well as load it as fundata, not newToSave.
Example:
funData <- sample(seq(1,1000,.01))
saveNew(funData)
load("funData.Rdata")
You can use assign to assign dat to originalName
saveNew <- function(dat){
# Collect the original name
originalName <- deparse(substitute(dat))
#Do lots of Fun and Interesting Things!
assign(originalName, dat)
save(list = originalName, file = paste0(originalName, '.Rdata') )
}
# Sample data
funData <- 1:10
# Save
saveNew(funData)
# Remove funData from the current environment
remove(funData)
# Load the RData object
load("funData.RData")
# Confirm that funData is in our current environment
funData
# [1] 1 2 3 4 5 6 7 8 9 10
Note that we need to use save with the list argument to enforce that save writes the value that has been assigned to originalName.
Disclaimer: This isn't really an answer, but as the OP wanted more clarification on the pros and cons of saveRDS, I thought I could put those under an answer. If you consider it should be deleted, please state so in a comment (before downvoting) and I'll be happy to withdraw it.
From ?saveRDS:
Details:
These functions provide the means to save a single R object to a connection (typically a file) and to restore the object, quite possibly under a different name. This differs from ‘save’ and ‘load’, which save and restore one or more named objects into an environment. They are widely used by R itself, for example to store metadata for a package and to store the ‘help.search’ databases: the ‘".rds"’ file extension is most often used.
saveRDS is specifically aimed at saving one object, while save can save one or more, but for me the main difference is that save and load bring back the object to life with the same name it had when saved, so one of its potential drawbacks is that it could rewrite an object already in the environment, whilst saveRDS and its companion readRDS can save and load objects to different objects.
From ?load:
Warning:
...
‘load()’ replaces all existing objects with the same names in the current environment (typically your workspace, ‘.GlobalEnv’) and hence potentially overwrites important data. It is considerably safer to use ‘envir = ’ to load into a different environment, or to ‘attach(file)’ which ‘load()’s into a new entry in the ‘search’ path.
Consider this:
save(iris, "save_file.rdat")
iris[1, 2] <- 20000 # implement a change to iris
load("save_file.rdat") # overwrites iris
saveRDS(iris, "my_file.RDS")
iris[1, 2] <- 20000 # introduce a change to iris
new_iris <- readRDS("my_file.RDS") # modified-iris is kept. New object is created
I realize this is a pretty basic question, but I want to make sure I do it right, so I wanted to ask just to confirm. I have a vector in one project that I want to be able to use in another project, and I was wondering if there was a simple way to export the vector in a form that I can easily import it to another project.
The way that I figured out how to do it so far is to convert it to a df, then export the df as a csv, then import and unpack it to vector form, but that seems needlessly complicated. It's just a simple numeric vector.
There are a number of ways to read and write data/files in R. For reading, you may want to look at: read.table, read.csv, readLines, source, dget, load, unserialize, and readRDS. For writing, you will want to look write.table, writeLines, dump, dput, save, serialize, and saveRDS.
x <- 1:3
# [1] 1 2 3
save(x, file = "myvector.rda")
# Change x to prove a point.
x <- 4:6
x
# [1] 4 5 6
# Better yet, we could remove it entirely
rm(x)
x
# Error: object 'x' not found
# Now load what we saved to get us back to where we started.
load("myvector.rda")
x
# [1] 1 2 3
Alternatively, you can use saveRDS and readRDS -- best practice/convention is to use the .rds extension; note, however, that loading the object is slightly different as saveRDS does not save the object name:
saveRDS(x, file = "myvector_serialized.rds")
x <- readRDS("myvector_serialized.rds")
Finally, saveRDS is a lower-level function and therefore can only save one object a time. The traditional save approach allows you to save multiple objects at the same time, but can become a nightmare if you re-use the same names in different projects/files/scripts...
I agree that saveRDS is a good way to go, but I also recommend the save and save.image functions, which I will demonstrate below.
# save.image
x <- c(5,6,8)
y <- c(8,9,11)
save.image(file="~/vectors.Rdata") # saves all workspace objects
Or alternatively choose which objects you want to save
x <- c(5,6,8)
y <- c(8,9,11)
save(x, y, file="~/vectors.Rdata") # saves only the selected objects
One advantage to using .Rdata over .Rda (a minor one) is that you can click on the object in the file explorer (i.e. in windows) and it will be loaded into the R environment. This doesn't work with .Rda objects in say Rstudio on windows
I've got a function that has a list output. Every time I run it, I want to export the results with save. After a couple of runs I want to read the files in and compare the results. I do this, because I don't know how many tasks there will be, and maybe I'll use different computers to calculate each task. So how should I name the archived objects, so later I can read them all in?
My best guess would be to dynamically name the variables before saving, and keep track of the object names, but I've read everywhere that this is a big no-no.
So how should I approach this problem?
You might want to use the saveRDS and readRDS functions instead of save and load. The RDS version functions will save and read single objects without the attached name. You would create your object and save it to a file (using paste0 or sprintf to create unique names), then when processing the results you can read in one object at a time, or read several into a list to work with them.
You can use scope to hide the retrieved name inside a function, so first you might save a list to a file:
mybiglist <- list(fred=1, john='dum di dum', mary=3)
save(mybiglist, file='mybiglist1.RData')
Then you can load it back in through a function and give it whatever name you like be it inside another list or just a plain object:
# Use the fact that load returns the name of the object loaded
# and that scope will hide this object
myspecialload <- function(RD.fnam) {
return(eval(parse(text=load(RD.fnam))))
}
# now lets reload that file but put it in another object
mynewbiglist <- myspecialload('mybiglist1.RData')
mynewbiglist
$fred
[1] 1
$john
[1] "dum di dum"
$mary
[1] 3
Note that this is not really a generic 'use it anywhere' type function, as for an RData file with multiple objects it appears to return the last object saved... so best stick with one list object per file for now!
One time I was given several RData files, and they all had only one variable called x. In order to read all of them in my workspace, I loaded sequentially each the variable to its environment, and I used get() to read its value.
tenv <- new.env()
load("file_1.RData", envir = tenv)
ls(tenv) # x
myvar1 <- get(ls(tenv), tenv)
rm(tenv)
....
This code can be repeated for each file.
I would like to add two blank rows to the top of an xlsx file that I create.
so far I have tried:
library("xlsx")
fn1 <- 'test1.xlsx'
fn2 <- 'test2.xlsx'
write.xlsx(matrix(rnorm(25),5),fn1)
wb <- loadWorkbook(fn1)
rows <- getRows(getSheets(wb)[[1]])
for(i in 1:length(rows))
rows[[i]]$setRowNum(as.integer(i+1))
saveWorkbook(wb,fn2)
But test2.xlsx is empty!
I'm a bit confused by what you're trying to do with the for loop, but:
You could create a dummy object with the same number of columns as wb, then use rbind() to join the dummy and wb to create fn2.
fn1 <- 'test1.xlsx'
wb <- loadWorkbook(fn1)
dummy <- wb[c(1,2),]
# set all values of dummy to whatever you want, e.g. "NA" or 0
fn2 <- rbind(dummy, wb)
saveWorkbook( fn2)
Hope that helps
So the xlsx package actually interfaces with a java library in order to pull in and modify the workbooks. The read.xlsx and write.xlsx functions are convenience wrappers to read and write data frames within R without having to manually write code to parse the individual cells and rows yourself using the java objects. The loadWorkbook and getRows functions give you access to the actual java objects, which you can then use to modify things such as cell styles.
However, if all you want to do is add a blank row to your spreadsheet before you output it, the easiest way is to add a blank row to your data frame before you export it (as Chris mentioned). You can accomplish this with the following variant on your code:
library("xlsx")
fn1 <- 'test1.xlsx'
fn2 <- 'test2.xlsx'
# I have added row.names=FALSE because the read.xlsx function
# does not have a parameter to include row names
write.xlsx(matrix(rnorm(25),5),fn1,row.names=FALSE)
# If you read your data back in using the read.xlsx function
# you will have a data frame rather than a series of java object references
wb <- read.xlsx(fn1,1)
# Now that we have a data frame we can add a blank row at the top
wb<-rbind.data.frame(rep("",5),wb)
# Then we write to the file using the same function as before
write.xlsx(wb,fn2,row.names=FALSE)
If you desire to use the advanced functionality in the java libraries, some of it is not currently implemented in the R package, and thus you will have to call it directly using .jcall. If you decide to pursue this line of action, I definitely recommend using .jmethods on the objects produced (i.e. .jmethods(rows[[1]]) which will list the available functions which you can use on the object (at the cellular level these are quite extensive).