saving and loading compressed R object - r

save(something, file="something.RData", compress="xz")
then when I load for reuse
load("something.RData")
print(something)
Error in print(something) : object 'something' not found
It is a random forest object.
Am I missing the unzip code?

This works at the console (where you have no parent environment), but not in a function because of the way load() uses environments (and will assign to the calling function).
Two simple alternatives:
Use saveRDS() and readRDS() for single objects.
Create an environment and use it as shown below.
Here is a short example of the second approach:
ne <- new.env()
load(somefile, ne) # now ls(ne) will show what was loaded
foo <- ne$something

Related

``data.table`` object, nested into ``xgboost`` object, doesn't load when saved as ``.rda``

Assume I have an xgboost model, which I want to store for later reuse. Additionally, I store additional information to the object for future convenience, one of which is the data clause, stored as a data.table, i.e.,
stored_data <- as.data.table(data)
model$data <- stored_data
Saving and loading the model works, i.e.,
save(model, 'data/model.rda')
however, the $data is not retained after reattaching the object:
load('data/model.rda')
model$data
R> NULL
On the other hand,
model$data <- as.data.frame(stored_data)
appears to work, but it would be interesting to figure out what is the problem with data.table object (or xgboost object, not sure which causes the problem here).
Side note: I'm sticking to .rda files because I'm using them for lazy load into a package, i.e,
.onLoad <- function(lib, pkg){
utils::data(model, package = pkg, envir = parent.env(environment()))
}
It would be nice to use .rds instead of .rda, but I'm not sure how would one do that.
EDIT:
What's weird, that it is null when calling from within a function. I.e., model$data returns a frame in the global environment, but when going to a debugonce(a_function), from within debug model$data returns NULL. The model object is the same, I've checked with adding a timestamp to comment(model).

An error while trying to use glm model for prediction on another computer

I would like to save a glm object in one R machine and use it for prediction on another data set located on another machine that has a newer data.I try to use save and load but with no success.What am I doing wrong?
Here is a toy example:
# on machine 1:
glm<-glm(y~x1+x2,data=dat1, family=binomial(link="logit")
save(glm,file="glm.Rdata") # the file is stored in a folder.
# on machine 2:
load(glm.RData) # got an error:"Error in load(glm.RData) : object 'glm.RData' not found"
#I tried :
load(file='glm.RData') # no error was displayed
print(glm) # got an error:"Error in load(glm.RData) : object 'glm.RData' not found"
Any help will be great.
As per #user3710546's advice, I would avoid saving your model using the name glm, as it'll mask (ie. block) the glm() function, making it difficult for you to use it in your session.
Using save() and load()
save() is generally used to save a list of objects to a file, rather than a single object. The first argument to save() is list, 'A character vector containing the names of objects to be saved.' (Emphasis mine.) So you'd want to use it like this:
# On machine 1:
save(list = 'glm', file = '/path/to/glm.RData')
# On machine 2:
load(file = '/path/to/glm.RData')
Note that the file extensions are often case-sensitive: you saved to a file with the extension .RData but loaded from one with the extension .Rdata, which is different. This may explain why the file isn't found.
Using saveRDS() and readRDS()
An alternative to using save() and load is to use saveRDS() and readRDS(), which are designed to be used with one object. They're used slightly differently:
# On machine 1
saveRDS(glm, file = '/path/to/glm.rds')
# On machine 2
glm = readRDS(file = '/path/to/glm.rds')
Note the .rds file extension and the fact that readRDS() isn't automatically put in the environment (it needs to be assigned to something).
Saving parts of a GLM
If you just want the formula saved—that is, the actual text string—you can find it in glm$formula, where glm is the name of your object. It comes back as a formula object, but you can convert it to a string with as.character(glm$formula), to then be written to a text file or whatever.
If, however, you want the model itself without the dataset it was created from (to cut down on disk space), have a look at this article, which discusses which parts of a glm object can be safely deleted.

Employ environments to handle package-data in package-functions

I recently wrote a R extension. The functions use data contained in the package and must therefore load them. Subroutines also need to access the data.
This is the approach taken:
main<- function(...){
data(data)
sub <- function(...,data=data){...}
...
}
I'm unhappy with the fact that the data resides in .GlobalEnv so it still hangs around when the function had terminated (also undermining the downpassing via argument concept).
Please put me on the right track! How do you employ environments, when you have to handle package-data in package-functions?
It looks that you are looking for the LazyData directive in your namepace:
LazyData: yes
Othewise, data has the envir argument you can use to control in which environment you want to load your data, so for example if you wanted the data to be loaded inside main, you could use :
main<- function(...){
data(data, envir = environment() )
sub <- function(...,data=data){...}
...
}
If the data is needed for your functions, not for the user of the package, it should be saved in a file called sysdata.rda located in the R directory.
From R extensions:
Two exceptions are allowed: if the R subdirectory contains a file
sysdata.rda (a saved image of R objects: please use suitable
compression as suggested by tools::resaveRdaFiles) this will be
lazy-loaded into the namespace/package environment – this is intended
for system datasets that are not intended to be user-accessible via
data.

How do I load objects to the current environment from a function in R?

Instead of doing
a <- loadBigObject("a")
b <- loadBigObject("b")
I'd like to call a function like
loadBigObjects(list("a","b"))
And be able to access the a and b objects.
It is not clear what loadBigObjects() does or where it will look for a and b. How does it load the objects from file or sourcing code?
There are lots of options in general:
sys.source() allows an R file to be sourced to a given environment
load() which will load an .Rdata file to a given environment
assign() in combination with any object created by loadBigObjects() or a call to readRDS() can also load an object to a given environment.
From within your function, you'll want to specify the environment in which to load objects as the Global Environment by using globalenv(). If you don't do that then the object will only exist in the evaluation frame of the running loadBigObjects(). E.g.
loadBigObjects <- function(list) {
lapply(list, function(x) assign(x, readRDS(x), envir = globalenv()))
}
(as per your comment to #GSee's Answer, and assuming the list("a","b") is sufficient information for readRDS() to locate and open the object.
Without knowing anything about what loadBigObject is or does, you can use lapply to apply a function to a list of objects
lapply(list("a", "b"), loadBigObject)
If you provided the code for loadBigObject or at least describe what it is supposed to do, a better loadBigObjects function could probably be written.
The assign function can be used to define a variable in an environment other than the current one.
loadBigObjects <- function(lst) {
lapply(lst, function(l) {
assign(l, loadBigObject(l), envir=globalenv())
}
lst
}
(Not that this is necessarily a good idea.)

Examining contents of .rdata file by attaching into a new environment - possible?

I am interested in listing objects in an RDATA file and loading only selected objects, rather than the whole set (in case some may be big or may already exist in the environment). I'm not quite clear on how to do this when there are conflicts in names, as attach() doesn't work as nicely.
1: For examining the contents of an R data file without loading it: This question is similar, but different from, the one asked at listing contents of an R data file without loading
In that case, the solution offered was:
attach(filename)
ls(pos = 2)
detach()
If there are naming conflicts between objects in the file and those in the global environment, this warning appears:
The following object(s) are masked _by_ '.GlobalEnv':
I tried creating a new environment, but I cannot seem to attach into that.
For instance, this produces the same error:
lsfile <- function(filename){
tmpEnv <- new.env()
evalq(attach(filename), envir = tmpEnv)
tmpls <- ls(pos = 2)
detach()
return(tmpls)
}
lsfile(filename)
Maybe I've made a mess of things with evalq (or eval). Is there some other way to avoid the naming conflict?
2: If I want to access an object - if there are no naming conflicts, I can just work with the one from the .rdat file, or copy it to a new one. If there are conflicts, how does one access the object in the file's namespace?
For instance, if my file is "sample.rdat", and the object is surveyData, and a surveyData object already exists in the global environment, then how can I access the one from the file:sample.rdat namespace?
I currently solve this problem by loading everything into a temporary environment, and then copy out what's needed, but this is inefficient.
Since this question has just been referenced let's clarify two things:
attach() simply calls load() so there is really no point in using it instead of load
if you want selective access to prevent masking it's much easier to simply load the file into a new environment:
e = local({load("foo.RData"); environment()})
You can then use ls(e) and access contents like e$x. You can still use attach on the environment if you really want it on the search path.
FWIW .RData files have no index (the objects are stored in one big pairlist), so you can't list the contained objects without loading. If you want convenient access, convert it to the lazy-load format instead which simply adds an index so each object can be loaded separately (see Get specific object from Rdata file)
I just use an env= argument to load():
> x <- 1; y <- 2; z <- "foo"
> save(x, y, z, file="/tmp/foo.RData")
> ne <- new.env()
> load(file="/tmp/foo.RData", env=ne)
> ls(env=ne)
[1] "x" "y" "z"
> ne$z
[1] "foo"
>
The cost of this approach is that you do read the whole RData file---but on the other hand that seems to be unavoidable anyway as no other method seems to offer a list of the 'content' of such a file.
You can suppress the warning by setting warn.conflicts=FALSE on the call to attach. If an object is masked by one in the global environment, you can use get to retreive it from your attached data.
x <- 1:10
save(x, file="x.rData")
#attach("x.rData", pos=2, warn.conflicts=FALSE)
attach("x.rData", pos=2)
(x <- 1)
# [1] 1
(x <- get("x", pos=2))
# [1] 1 2 3 4 5 6 7 8 9 10
Thanks to #Dirk and #Joshua.
I had an epiphany. The command/package foreach with SMP or MC seems to produce environments that only inherit, but do not seem to conflict with, the global environment.
lsfile <- function(list_files){
aggregate_ls = foreach(ix = 1:length(list_files)) %dopar% {
attach(list_files[ix])
tmpls <- ls(pos = 2)
return(tmpls)
}
return(aggregate_ls)
}
lsfile("f1.rdat")
lsfile(dir(pattern = "*rdat"))
This is useful to me because I can now parallelize this. This is a bare-bones version, and I will modify it to give more detailed information, but so far it seems to be the only way to avoid conflicts, even without ignore.
So, question #1 can be resolved by either ignoring the warnings (as #Joshua suggested) or by using whatever magic foreach summons.
For part 2, loading an object, I think #Joshua has the right idea - "get" will do.
The foreach magic can also work, by using the .noexport option. However, this has risks: whatever isn't specifically excluded will be inherited/exported from the global environment (I could do ls(), but there's always the possibility of attached datasets). For safety, this means that get() must still be used to avoid the risk of a naming conflict. Loading into a subenvironment avoids the naming conflict, but doesn't avoid the loading of unnecessary objects.
#Joshua's answer is far simpler than my foreach detour.

Resources