I am working with a large dataset in an R package.
I need to get all of the separate data frames into my global environment, preferably into a list of data frames so that I can use lapply to do some repetitive operations later.
So far I've done the following:
l.my.package <- data(package="my.package")
lc.my.package <- l.my.package[[3]]
lc.df.my.package <- as.data.frame(lc.my.package)
This effectively creates a data frame of the location and name of each of the .RData files in my package, so I can load them all.
I have figured out how to load them all using a for loop.
I create a vector of path names and feed it into the loop:
f <- path('my/path/folder', lc.df.my.package$Item, ext="rdata")
f.v <- as.vector(f)
for (i in f.v) {load(i)}
This loads everything into separate data frames (as I want), but it obviously doesn't put the data frames into a list. I thought lapply would work here, but when I use lapply, the resulting list is a list of character strings (the title of each dataframe with no data included). That code looks like this:
f.l <- as.list(f)
func <- function(i) {load(i)}
df.list <- lapply(f.l, func)
I am looking for one of two possible solutions:
how can I efficiently collect the output of for loop into a list (a "while" loop would likely be too slow)?
how can I adjust lapply so the output includes each entire dataframe instead of just the title of each dataframe?
Edit: I have also tried introducing the "envir=.GlobalEnv" argument into load() within lapply. When I do that, the data frames load, but still not in a list. The list still contains only the names as character strings.
If you are willing to use a packaged solution, I wrote a package call libr that does exactly what you are asking for. Here is an example:
library(libr)
# Create temp directory
tmp <- tempdir()
# Save some data to temp directory
# for illustration purposes
saveRDS(trees, file.path(tmp, "trees.rds"))
saveRDS(rock, file.path(tmp, "rocks.rds"))
# Create library
libname(dat, tmp)
# library 'dat': 2 items
# - attributes: not loaded
# - path: C:\Users\User\AppData\Local\Temp\RtmpCSJ6Gc
# - items:
# Name Extension Rows Cols Size LastModified
# 1 rocks rds 48 4 3.1 Kb 2020-11-05 23:25:34
# 2 trees rds 31 3 2.4 Kb 2020-11-05 23:25:34
# Load library
lib_load(dat)
# Examine workspace
ls()
# [1] "dat" "dat.rocks" "dat.trees" "tmp"
# Unload the library from memory
lib_unload(dat)
# Examine workspace again
ls()
# [1] "dat" "tmp"
#rawr's response works perfectly:
df.list <- mget(l.my.package$results[, 'Item'], inherits = TRUE)
Let's say that we assign some variable:
variable_name <- runif(100)
letter <- "a"
listed <- list("a", "b", "C")
I would love to have a function assignment_code(object) that will output assignment code of these objects i.e.
>assignment_code(variable_name)
"variable_name <- runif(100)"
>assignment_code(letter)
"letter <- "a""
>assignment_code(listed)
"listed <- list("a", "b", "C")"
I tried to do it but I wasn't sure how it can be done. I tried to do some magic with ls() but I wasn't sure about proper algorithm of picking elements in ls(). Do you know how it can be done ?
In general, no it is not possible to find out from an object how it was created, for example the assignments x <- 1 and x <- 3-2 would leave x looking the same, with no clue as to which was used to create it.
Some possible solutions though are:
Accessing the history of R to see how variables were created by using the up arrow in Rgui, or the 'history' pane in Rstudio. This is also stored in a file called .rhistory.
Saving all your code as R scripts, so that you have a record of how each variable was created.
Saving your work as Rmarkdown (integrated well with Rstudio) where the code are resulting output can be combined.
See below for my reprex of my issues with source, <-, <<-, environments, etc.
There's 3 files, testrun.R, which calls inputs.R and CODE.R.
# testrun.R (file 1)
today <<- "abcdef"
source("inputs.R")
for (DC in c("a", "b")) {
usedlater_3 <- paste("X", DC, used_later2)
print(usedlater_3)
source("CODE.R", local = TRUE)
}
final_output <- paste(OD_output, used_later2, usedlater_3)
print(final_output)
# #---- file 2
# # inputs.R
# used_later1 <- paste(today, "_later")
# used_later2 <- "l2"
#
# #---- file 3
# # CODE.R
# OD_output <- paste(DC, today, used_later1, usedlater_2, usedlater_3)
I'm afraid I didn't learn R or CS in a proper way so I'm trying to catch up now. Any bigger picture lessons would be helpful. Previously, I've been relying on a global environment where I keep everything (and save/keep between sessions), but now I'm trying to make everything reproducible, so I'm using RStudio to run local jobs that start from scratch.
I've been trying different combinations of <-, <<-, and source(local = TRUE) (instead of local = FALSE). I do use functions for pieces of code where I know the inputs I need and outputs I want, but as you can see, CODE.R uses variables from both testrun.R, the loop inside testrun.R, and input.R. Converting some of the code into functions might help ? but I'd like to know of alternatives as well given this case.
Finally you can see my own troubleshooting log to see my thought process:
first run: variable today wasn't found, so I made today <<- "abcdef" double arrow assignment
second run: DC not found, so I will switch to local = TRUE
third run: but now usedlater_2 not found, so i will change usedlater_2 to <<-. (what about usedlater_1? why didn't this show up as error? we'll see...)
result of third run: usedlater_2 still not found when CODE.R needs it. out of ideas. note: used_later2 was found to create used_later3 in the for loop in testrun.R.
I want to pass a list of variables to saveRDS() to save their values, but instead it saves their names:
variables <- c("A", "B", "C")
saveRDS(variables, "file.R")
it saves the single vector "variables".
I also tried:
save(variables, "file.RData")
with no success
You need to use the list argument of the save function. EG:
var1 = "foo"
var2 = 2
var3 = list(a="abc", z="xyz")
ls()
save(list=c("var1", "var2", "var3"), file="myvariables.RData")
rm(list=ls())
ls()
load("myvariables.RData")
ls()
Please note that the saveRDS function creates a .RDS file, which is used to save a single R object. The save function creates a .RData file (same thing as .RDA file). .RData files are used to store an entire R workspace, or whichever names in an R workspace are passed to the list argument.
YiHui has a nice blogpost on this topic.
If you have several data tables and need them all saved in a single R object, then you can go the saveRDS route. As an example:
datalist = list(mtcars = mtcars, pressure=pressure)
saveRDS(datalist, "twodatasets.RDS")
rm(list=ls())
datalist = readRDS("twodatasets.RDS")
datalist
Another option is to store all your variables within a new environment and save this as an Rds file. You can then move the objects of this environment to the global environment (or leave them where they are).
e <- new.env()
with(e, {
var1 = "foo"
var2 = 2
var3 = list(a="abc", z="xyz")
})
saveRDS(e, "my_obj.Rds")
## new Session
my_obj <- readRDS("my_obj.Rds")
list2env(as.list(my_obj), globalenv())
I know I can use ls() and rm() to see and remove objects that exist in my environment.
However, when dealing with "old" .RData file, one needs to sometimes pick an environment a part to find what to keep and what to leave out.
What I would like to do, is to have a GUI like interface to allow me to see the objects, sort them (for example, by there size), and remove the ones I don't need (for example, by a check-box interface). Since I imagine such a system is not currently implemented in R, what ways do exist? What do you use for cleaning old .RData files?
Thanks,
Tal
I never create .RData files. If you are practicing reproducible research (and you should be!) you should be able to source in R files to go from input data files to all outputs.
When you have operations that take a long time it makes sense to cache them. If often use a construct like:
if (file.exists("cache.rdata")) {
load("cache.rdata")
} else {
# do stuff ...
save(..., file = "cache.rdata")
}
This allows you to work quickly from cached files, and when you need to recalculate from scratch you can just delete all the rdata files in your working directory.
Basic solution is to load your data, remove what you don't want and save as new, clean data.
Another way to handle this situation is to control loaded RData by loading it to own environment
sandbox <- new.env()
load("some_old.RData", sandbox)
Now you can see what is inside
ls(sandbox)
sapply(ls(sandbox), function(x) object.size(get(x,sandbox)))
Then you have several posibilities:
write what you want to new RData: save(A, B, file="clean.RData", envir=sandbox)
remove what you don't want from environment rm(x, z, u, envir=sandbox)
make copy of variables you want in global workspace and remove sandbox
I usually do something similar to third option. Load my data, do some checks, transformation, copy final data to global workspace and remove environments.
You could always implement what you want. So
Load the data
vars <- load("some_old.RData")
Get sizes
vars_size <- sapply(vars, function(x) object.size(get(x)))
Order them
vars <- vars[order(vars_size, decreasing=TRUE)]
vars_size <- vars_size [order(vars_size, decreasing=TRUE)]
Make dialog box (depends on OS, here is Windows)
vars_with_size <- paste(vars,vars_size)
vars_to_save <- select.list(vars_with_size, multiple=TRUE)
Remove what you don't want
rm(vars[!vars_with_size%in%vars_to_save])
To nice form of object size I use solution based on getAnywhere(print.object_size)
pretty_size <- function(x) {
ifelse(x >= 1024^3, paste(round(x/1024^3, 1L), "Gb"),
ifelse(x >= 1024^2, paste(round(x/1024^2, 1L), "Mb"),
ifelse(x >= 1024 , paste(round(x/1024, 1L), "Kb"),
paste(x, "bytes")
)))
}
Then in 4. one can use paste(vars, pretty_size(vars_size))
You may want to check out the RGtk2 package.
You can very easily create an interface with Glade Interface Designer and then attach whatever R commands you want to it.
If you want a good starting point where to "steal" ideas on how to use RGtk2, install the rattle package and run rattle();. Then look at the source code and start making your own interface :)
I may have a go at it and see if I can come out with something simple.
EDIT: this is a quick and dirty piece of code that you can play with. The big problem with it is that for whatever reason the rm instruction does not get executed, but I'm not sure why... I know that it is the central instruction, but at least the interface works! :D
TODO:
Make rm work
I put all the variables in the remObjEnv environment. It should not be listed in the current variable and it should be removed when the window is closed
The list will only show objects in the global environment, anything inside other environment won't be shown, but that's easy enough to implement
probably there's some other bug I haven't thought of :D
Enjoy
# Our environment
remObjEnv <<- new.env()
# Various required libraries
require("RGtk2")
remObjEnv$createModel <- function()
{
# create the array of data and fill it in
remObjEnv$objList <- NULL
objs <- objects(globalenv())
for (o in objs)
remObjEnv$objList[[length(remObjEnv$objList)+1]] <- list(object = o,
type = typeof(get(o)),
size = object.size(get(o)))
# create list store
model <- gtkListStoreNew("gchararray", "gchararray", "gint")
# add items
for (i in 1:length(remObjEnv$objList))
{
iter <- model$append()$iter
model$set(iter,
0, remObjEnv$objList[[i]]$object,
1, remObjEnv$objList[[i]]$type,
2, remObjEnv$objList[[i]]$size)
}
return(model)
}
remObjEnv$addColumns <- function(treeview)
{
colNames <- c("Name", "Type", "Size (bytes)")
model <- treeview$getModel()
for (n in 1:length(colNames))
{
renderer <- gtkCellRendererTextNew()
renderer$setData("column", n-1)
treeview$insertColumnWithAttributes(-1, colNames[n], renderer, text=n-1)
}
}
# Builds the list.
# I seem to have some problems in correctly build treeviews from glade files
# so we'll just do it by hand :)
remObjEnv$buildTreeView <- function()
{
# create model
model <- remObjEnv$createModel()
# create tree view
remObjEnv$treeview <- gtkTreeViewNewWithModel(model)
remObjEnv$treeview$setRulesHint(TRUE)
remObjEnv$treeview$getSelection()$setMode("single")
remObjEnv$addColumns(remObjEnv$treeview)
remObjEnv$vbox$packStart(remObjEnv$treeview, TRUE, TRUE, 0)
}
remObjEnv$delObj <- function(widget, treeview)
{
model <- treeview$getModel()
selection <- treeview$getSelection()
selected <- selection$getSelected()
if (selected[[1]])
{
iter <- selected$iter
path <- model$getPath(iter)
i <- path$getIndices()[[1]]
model$remove(iter)
}
obj <- as.character(remObjEnv$objList[[i+1]]$object)
rm(obj)
}
# The list of the current objects
remObjEnv$objList <- NULL
# Create the GUI.
remObjEnv$window <- gtkWindowNew("toplevel", show = FALSE)
gtkWindowSetTitle(remObjEnv$window, "R Object Remover")
gtkWindowSetDefaultSize(remObjEnv$window, 500, 300)
remObjEnv$vbox <- gtkVBoxNew(FALSE, 5)
remObjEnv$window$add(remObjEnv$vbox)
# Build the treeview
remObjEnv$buildTreeView()
remObjEnv$button <- gtkButtonNewWithLabel("Delete selected object")
gSignalConnect(remObjEnv$button, "clicked", remObjEnv$delObj, remObjEnv$treeview)
remObjEnv$vbox$packStart(remObjEnv$button, TRUE, TRUE, 0)
remObjEnv$window$showAll()
Once you've figured out what you want to keep, you can use the function -keep- from package gdata does what its name suggests.
a <- 1
b <- 2
library(gdata)
keep(a, all = TRUE, sure = TRUE)
See help(keep) for details on the -all- and -sure- options.
all: whether hidden objects (beginning with a .) should be removed, unless explicitly kept.
sure: whether to perform the removal, otherwise return names of objects that would have been removed.
This function is so useful that I'm surprised it isn't part of R itself.
The OS X gui does have such a thing, it's called the Workspace Browser. Quite handy.
I've also wished for an interface that shows the session dependency between objects, i.e. if I start from a plot() and work backwards to find all the objects that were used to create it. This would require parsing the history.
It doesn't have checkboxes to delete with, rather you select the file(s) then click delete. However, the solution below is pretty easy to implement:
library(gWidgets)
options(guiToolkit="RGtk2")
## make data frame with files
out <- lapply((x <- list.files()), file.info)
out <- do.call("rbind", out)
out <- data.frame(name=x, size=as.integer(out$size), ## more attributes?
stringsAsFactors=FALSE)
## set up GUI
w <- gwindow("Browse directory")
g <- ggroup(cont=w, horizontal=FALSE)
tbl <- gtable(out, cont=g, multiple=TRUE)
size(tbl) <- c(400,400)
deleteThem <- gbutton("delete", cont=g)
enabled(deleteThem) <- FALSE
## add handlers
addHandlerClicked(tbl, handler=function(h,...) {
enabled(deleteThem) <- (length(svalue(h$obj, index=TRUE)) > 0)
})
addHandlerClicked(deleteThem, handler=function(h,...) {
inds <- svalue(tbl, index=TRUE)
files <- tbl[inds,1]
print(files) # replace with rm?
})
The poor guy answer could be :
ls()
# spot the rank of the variables you want to remove, for example 10 to 25
rm(list= ls()[[10:25]])
# repeat until satisfied
To clean the complete environment you can try:
rm(list(ls())