How to delete temporary files of deleted objects in R package raster - r

Since I work with large RasterBrick objects, the objects can't be stored in memory but are stored as temporary files in the current Temporary files directory tempdir() or to be exact in the subfolder "raster".
Due to the large file sizes it would be very nice to delete the temporary files of the unused objects.
If I delete objects I no longer need by
rm(list=ls(pattern="xxx")
the temporary files still exist.
Garbage collection gc() will have no effect on that to my understanding since it has no effect on the hard drive.
The automatically given names of the temporary files don't show any relation to the object names.
Therefore it is not possible to delete them by a code like
raster_temp_dir <- paste(tempdir(), "/raster", sep="")
files_to_be_removed <- list.files(rastertemp, pattern="xxx", full.names=T)
Unfortunately the files of objects still in use aren't read-only.
Therefore I also would delete objects I still need by running:
files_to_be_removed <- list.files(rastertemp, full.names=T)
Did somebody already solve this problem or has any ideas how to solve it?
It would be perfect if somehow, there a code could distinguish between unused and used objects.
Since this is unlikely to implement a detour could be naming the temporary files of the Raster objects manually, but I haven't encountered an option for this neither since the filename argument can only be used when writing files to the hard disk but not when temporary files are created (to my knowledge).
Thanks!

I think the function you're looking for is file.remove(). Just pass it a vector with the file names you want to delete.

Related

Is there a way to reference files in a folder within the working directory in R?

I have already finished with my RMarkdown and I'm trying to clean up the workspace a little bit. This isn't exactly a necessary thing but more of an organizational practice which I'm not even sure if it's a good practice, so that I can keep the data separate from some scripts and other R and git related files.
I have a bunch of .csv files for data that I used. Previously they were on (for example)
C:/Users/Documents/Project
which is what I set as my working directory. But now I want them in
C:/Users/Document/Project/Data
The problem is that this only breaks the following code because they are not in the wd.
#create one big dataframe by unioning all the data
bigfile <- vroom(list.files(pattern = "*.csv"))
I've tried adding a full path to list.files() to where the csvs are but no luck.
bigfile <- vroom(list.files(path = "C:/Users/Documents/Project/Data", pattern = "*.csv"))
Error: 'data1.csv' does not exist in current working directory ('C:/Users/Documents/Project').
Is there a way to only access the /Data folder once for creating my dataframe with vroom() instead of changing the working directory multiple times?
You can list files including those in all subdirectories (Data in particular) using list.files(pattern = "*.csv", recursive = TRUE)
Best practices
Have one directory of raw and only raw data (the stuff you measured)
Have another directory of external data (e.g. reference data bases). This is something you do can remove afterwards and redownload if required.
Have another directory for the source code
Put only the source code directory under version control plus one other file containing check sums of the raw and external data to proof integrity
Every other thing must be reproducible using raw data and the source code. This can be removed after the project. Maybe you want to keep small result files (e.g. tables) which take long time to reproduce.
You can list the files and capture the full filepath name right?
bigfile <- vroom(list.files(path = "C:/Users/Documents/Project/Data", pattern = "*.csv", full.names = T))
and that should read the file in the directory without reference to your wd
Try one of these:
# list all csv files within Data within current directory
Sys.glob("Data/*.csv")
# list all csv files within immediate subdirectories of current directory
Sys.glob("*/*.csv")
If you only have csv files then these would also work but seem less desirable. Might be useful though if you quickly want to review what files and directories are there. (I would be very careful not to use the second one within statements to delete files since if you are not in the directory you think it is in then you can wind up deleting files you did not intend to delete. The first one might too but is a bit safer since it would only lead to deleting wrong files if the directory you are in does have a Data subdirectory.)
# list all files & directories within Data within current directory
Sys.glob("Data/*")
# list all files & directories within immediate subdirectories of current directory
Sys.glob("*/*")
If the subfolder always has the same name (or the same number of characters), you should be able to do it thanks to substring. In your example, "Data" has 4 characters (5 with the /), so the following code should do:
Repository <- substring(getwd(), 1, nchar(getwd())-5)

How to load a single object from .Rdata file? [duplicate]

I have a Rdata file containing various objects:
New.Rdata
|_ Object 1 (e.g. data.frame)
|_ Object 2 (e.g. matrix)
|_...
|_ Object n
Of course I can load the data frame with load('New.Rdata'), however, is there a smart way to load only one specific object out of this file and discard the others?
.RData files don't have an index (the contents are serialized as one big pairlist). You could hack a way to go through the pairlist and assign only entries you like, but it's not easy since you can't do it at the R level.
However, you can simply convert the .RData file into a lazy-load database which serializes each entry separately and creates an index. The nice thing is that the loading will be on-demand:
# convert .RData -> .rdb/.rdx
e = local({load("New.RData"); environment()})
tools:::makeLazyLoadDB(e, "New")
Loading the DB then only loads the index but not the contents. The contents are loaded as they are used:
lazyLoad("New")
ls()
x # if you had x in the New.RData it will be fetched now from New.rdb
Just like with load() you can specify an environment to load into so you don't need to pollute the global workspace etc.
You can use attach rather than load which will attach the data object to the search path, then you can copy the one object you are interested in and detach the .Rdata object.
This still loads everything, but is simpler to work with than loading everything into the global workspace (possibly overwriting things you don't want overwritten) then getting rid of everything you don't want.
Simon Urbanek's answer is very, very nice. A drawback is that it doesn't seem to work if an object to be saved is too large:
tools:::makeLazyLoadDB(
local({
x <- 1:1e+09
cat("size:", object.size(x) ,"\n")
environment()
}), "lazytest")
size: 4e+09
Error: serialization is too large to store in a raw vector
I'm guessing that this is due to a limitation of the current implementation of R (I have 2.15.2) rather than running out of physical memory and swap. The saves package might be an alternative for some uses, however.
A function is useful to extract a single object without loading everything in the RData file.
extractorRData <- function(file, object) {
#' Function for extracting an object from a .RData file created by R's save() command
#' Inputs: RData file, object name
E <- new.env()
load(file=file, envir=E)
return(get(object, envir=E, inherits=F))
}
See full answer here. https://stackoverflow.com/a/65964065/4882696
This blog post gives an a neat practice that prevents this sort of issue in the first problem. The gist of it is to use saveRDS(), loadRDS() functions instead of the regular save(), load() functions.

How to output a list of dataframes, which is able to be used by another user

I have a list whose elements are several dataframes, which looks like this
Because it is hard for another user to use these data by re-running my original code. Hence, I would like to export it. As the graph shows, the dataframes in that list have different number of rows. I am wondering if there is any method to export it as file without damaging any information, and make it be able to be used by Rstudio. I have tried to save it as RData, but I don't know how to save the information.
Thanks a lot
To output objects in R, here are 4 common methods:
dput() writes a text representation of an R object
This is very convenient if you want to allow someone to get your object by copying and pasting text (for instance on this site), without having to email or upload and download a file. The downside however is that the output is long and re-reading the object into R (simply by assigning the copied text to an object) can hang R for large objects. This works best to create reproducible examples. For a list of data frames, this would not be a very good option.
You can print an object to a .csv, .xlsx, etc. file with write.table(), write.csv(), readr::write_csv(), xlsx::write.xlsx(), etc.
While the file can then be used by other software (and re-imported into R with read.csv(), readr::read_csv(), readxl::read_excel(), etc.), the data can be transformed in the process and some objects cannot be printed in a single file without prior modifications. So this is not ideal in your case either.
save.image() saves your entire workspace (objects + environment)
The workspace can then be recreated with load(). This can be useful, but you are here only interested in saving one object. In that case, it is preferable to use:
saveRDS() which allows to write one object to file
The object can then be re-created with readRDS(). This is the best option to save an R object to file, without any modification and then re-create it.
In your situation, this is definitely the best solution.

How to save large output sufficiently fast in text or any other format?

My question is: how to save the output i.e., mydata
mydata=array(sample(100),dim=c(2,100,4000))
reasonably fast?
I used the reshape2 package as suggested here.
melt(mydata)
and
write.table(mydata,file="data_1")
But it is taking more than one hour to save the data into the file. I am looking for any other faster ways to do the job.
I strongly suggest to refer to this great post, that surely helps in make issues clear about file saving.
Anyway, saveRDS could be the most adequate for you. The difference more relevant, in this case, is that save can save many objects to a file in a single call, whilst saveRDS, being a lower-level function, works with a single object at a time.
save and load allow you to save a named R object to a file or other connection and restore that object again. But, when loaded, the named object is restored to the current environment with the same name it had when saved.
saveRDS and loadRDS, instead, allow to save a single R object to a connection (typically a file) and to restore the object, possibly with a different name. The low level operability makes RDS functions more efficient, probably, for your case.
Read the help text for saveRDS using ?saveRDS. This will probably be the best way for you to save and load large dataframes.
saveRDS(yourdata, file = "yourdata.Rda")

A more efficient .RData?

I am working with large data sets and often switch between my work station and laptop. Saving a workspace image to .RData is for me the most natural and convenient way, so this is the file that I want to synchronize between the two computers.
Unfortunately, it tends to be rather big (a few GB), so efficient synchronisation either requires me to connect my laptop with a cable or moving the files with a USB stick. If I forgot to synchronize my laptop when I was next to my workstation, it takes me hours to make sure everything is synchronized.
The largest objects, however, change relatively rarely (although I constantly work with them). I could save them to another file, and then delete them before saving the session and load them after restoring the session. This would work, but would be extremely annoying. Also, I would have to remember to save them whenever they are modified. It would soon end up being a total mess.
Is there more efficient way of dealing with such large data chunks?
For example, my problem would be solved if there was an alternative format to .RData -- one in which .RData is a directory, and files in that directory are objects to be loaded.
You can use saveRDS:
objs.names <- ls()
objs <- mget(objs.names)
invisible(
lapply(
seq_along(objs),
function(x) saveRDS(objs[[x]], paste0("mydatafolder/", objs.names[[x]], ".rds"))
) )
This will save every object in your session to the "mydatafolder" folder as a separate file (make sure to create the folder before hand).
Unfortunately, this will modify the timestamps of all objects, you can't rely on rsync. You could first read the objects in with readRDS, see which ones have changed with identical, and only run the lapply above on the changed objects so you can then use something like rsync.

Resources