I have a huge data frame loaded in global environment in R named df. How can I rename the data frame without copying the data frame by assigning it to another symbol and remove the original one?
R is smart enough not to make a copy if the variable is the same, so just go ahead, reassign and rm() the original.
Example:
x <- 1:10
tracemem(x)
# [1] "<0000000017181EA8>"
y <- x
tracemem(y)
# [1] "<0000000017181EA8>"
As we can see both objects point to the same address. R makes a new copy in the memory if one of them is modified, i.e.: 2 objects are not identical anymore.
# Now change one of the vectors
y[2] <- 3
# tracemem[0x0000000017181ea8 -> 0x0000000017178c68]:
# tracemem[0x0000000017178c68 -> 0x0000000012ebe3b0]:
tracemem(x)
# [1] "<0000000017181EA8>"
tracemem(y)
# [1] "<0000000012EBE3B0>"
Related post: How do I rename an R object?
There is a function called mv in the gdata package.
library(gdata)
x <- data.frame(A = 1:100, B = 101:200, C = 201:300)
tracemem(x)
"<0000000024EA66F8>"
mv(from = "x", to = "y")
tracemem(y)
"<0000000024EA66F8>"
You will notice that the output from tracemem is identical for x and y. Looking at the code of mv, you will see that it assigns the object to the environment in scope and then removes the old object. This is quite similar to the approach C8H10N4O2 used (although mv is for a single object), but at least the function is convenient to use.
To apply the accepted answer to many objects, you could use a loop of assign(new_name, get(old_name)) followed by rm(list= old_names). For example, if you wanted to replace old_df,old_x,old_y, ... with new_df, new_x...
for (obj_old_name in ls(pattern='old_')){
assign(sub('old_','new_',obj_old_name), get(obj_old_name))
}
rm(list=ls(pattern='old_'))
Related
In R, is there a function or other way to make a list of all variables that have been created in the global environment after a certain point in the script? I am using an R notebook so it has chunks of code, and the goal is to eventually delete all variables that were made in certain chunks. The first part of the script has many variables (takes a long time to reread) that I would like to keep but then delete all the variables created in the second part of the script. I know I can just clear the environment etc. but for certain reasons I can't do this. I also have too many variables to selectively type the ones I want to rm(). The variables are all different A (pseudo) example of what I want to do...
x <- 1
y <- 2
df <- data.frame()
rr <- raster()
## Function here to iteratively list all variables created after this line of code##
dd <- data.frame()
z <- c(1,2,3)
rm(listofvars) #contains "dd" and "zz" only
Alternatively, is there a way to list all variables in the global environment in the order that they were created?
I hope this makes sense. Any help is appreciated.
I don't think it's a great idea to get into the business of parsing your script and determine variable definition order in that manner. Here's an alternative: set "known variables" checkpoints.
Imagine this is your notebook, first code block:
.known_vars <- list()
### first code block
# some code here
a <- 1
bb <- 2
# more code
.known_vars <- c(.known_vars, list(setdiff(ls(), unlist(.known_vars))))
End each of your code-blocks (or even more frequently, it's entirely up to you) with that last part, which appends a list of variables not known in the previous code block(s).
Next:
### second code block
# some code here
a <- 2 # over-write, not new
quux1 <- quux2 <- 9
# more code
.known_vars <- c(.known_vars, list(setdiff(ls(), unlist(.known_vars))))
Again, that last line is the same as before. Just use that same line of code.
When you want to do some cleanup, then
.known_vars
# [[1]]
# [1] "a" "bb"
# [[2]]
# [1] "quux1" "quux2"
In this case, if we want to remove all variable except those in the first code block, then we'd do
unlist(.known_vars[-1])
# [1] "quux1" "quux2"
rm(list = unlist(.known_vars[-1]))
The reason I chose a dot-leading variable name is that by default it is not shown in ls() output: you'd need ls(all.names=TRUE) to see it as well. While not a problem, I just want to keep things a little cleaner. If you choose to not start with a dot, and for some reason choose to delete variables from the same code block in which known_vars is defined, the you might lose the checkpoints for other blocks, too.
If you want this a little more formal, then you can do
.vars <- local({
.cache <- list()
function(add = NULL, clear = FALSE) {
if (clear) .cache <<- list()
if (length(add)) .cache <<- c(.cache, list(setdiff(add, unlist(.cache))))
if (is.null(add)) .cache else invisible(.cache)
}
})
Where calling it with nothing gets its current stage, and calling with ls() will make a new entry. Such as:
ls() # proof we're starting "empty"
# character(0)
.vars(clear = TRUE) # instantiate with an empty list of variables
### first code block
# some code here
a <- 1
bb <- 2
# more code
.vars(ls())
### second code block
# some code here
a <- 2 # over-write, not new
quux1 <- quux2 <- 9
# more code
.vars(ls())
.vars()
# [[1]]
# [1] "a" "bb"
# [[2]]
# [1] "quux1" "quux2"
And removing unwanted variables is done in the same way.
Since this is still just an object in the global environment, the next best way to keep this protected (and perhaps as a not-leading-dot object name), would be to make sure it is in its own environment (not .GlobalEnv) and still in R's search path. This is likely easily done with a custom package, though that may be more work than you were expecting for this simple utility.
BTW: R does not store when an object is created, modified, or deleted, so you'd need to keep track of that, too. If you feel the need to add timestamps to .vars(), then you'll need to restructure things a bit ... again, perhaps more effort than needed here.
BTW 2: this is prone to deleted-then-redefined variables: it does not know if vars have been deleted, just that they were defined at some time. If anything else removes variables, this won't know, and then rm(list=...) will complain about missing variables. Not horrible, but still good to know.
Using the script created in the Note at the end, read it in using readLines, then grep out those lines that start with optional space, a word, more optional space and <- . Then remove the <- and everything thereafter and trim off whitespace leaving the variable names v in the order encountered in the script. Next as an example form vv as a subvector of v containing "df" and the following variable names.
L <- grep("^\\s*\\S*\\s*<-", readLines("myscript.R"), value = TRUE)
v <- trimws(sub("<-.*", "", L)); v
## [1] "x" "y" "df" "rr" "dd" "z"
vv <- tail(v, -(match("df", v)-1)); vv
## [1] "df" "rr" "dd" "z"
To remove variables in vv from global environment use rm(list = vv, .GlobalEnv) .
Note
Lines <- "
x <- 1
y <- 2
df <- data.frame()
rr <- raster()
## Function here to iteratively list all variables created after this line of code##
dd <- data.frame()
z <- c(1,2,3)
"
cat(Lines, file = "myscript.R")
Right before a function returns in R, I would like to remove all of the local variables with the exception of one or two.
Here's a minimum reproducible example:
f <- function(){
keep_this_local_var <- 3
remove_this_local_var <- 4
rm(setdiff(ls(environment()), c("keep_local_var"))) # doesn't work
return(ls(environment()))
}
f() # should only be 3
Motivation: my "real life" f function calls source() several times. There will be a possibly large, random amount of variables generated by each source() call. I won't know the names of each variable in advance; however, I do have a short list of variable names I want to to keep. Similar code has worked for me in the past, but when I source to the .GlobalEnv.
This works:
f <- function(){
keep_this_local_var <- 3
remove_this_local_var <- 4
rm(
list=setdiff(ls(environment()), "keep_this_local_var"),
envir = environment()
)
return(ls(environment()))
}
f() # should only be 3
As this thread mentions, you need to specify list=.
I have a huge data frame loaded in global environment in R named df. How can I rename the data frame without copying the data frame by assigning it to another symbol and remove the original one?
R is smart enough not to make a copy if the variable is the same, so just go ahead, reassign and rm() the original.
Example:
x <- 1:10
tracemem(x)
# [1] "<0000000017181EA8>"
y <- x
tracemem(y)
# [1] "<0000000017181EA8>"
As we can see both objects point to the same address. R makes a new copy in the memory if one of them is modified, i.e.: 2 objects are not identical anymore.
# Now change one of the vectors
y[2] <- 3
# tracemem[0x0000000017181ea8 -> 0x0000000017178c68]:
# tracemem[0x0000000017178c68 -> 0x0000000012ebe3b0]:
tracemem(x)
# [1] "<0000000017181EA8>"
tracemem(y)
# [1] "<0000000012EBE3B0>"
Related post: How do I rename an R object?
There is a function called mv in the gdata package.
library(gdata)
x <- data.frame(A = 1:100, B = 101:200, C = 201:300)
tracemem(x)
"<0000000024EA66F8>"
mv(from = "x", to = "y")
tracemem(y)
"<0000000024EA66F8>"
You will notice that the output from tracemem is identical for x and y. Looking at the code of mv, you will see that it assigns the object to the environment in scope and then removes the old object. This is quite similar to the approach C8H10N4O2 used (although mv is for a single object), but at least the function is convenient to use.
To apply the accepted answer to many objects, you could use a loop of assign(new_name, get(old_name)) followed by rm(list= old_names). For example, if you wanted to replace old_df,old_x,old_y, ... with new_df, new_x...
for (obj_old_name in ls(pattern='old_')){
assign(sub('old_','new_',obj_old_name), get(obj_old_name))
}
rm(list=ls(pattern='old_'))
I have a vector say varNames which has "name" of certain variables as "character". Now I want to save those particular variables as rdata using save(). How should I go about that?
I was trying to do the following:
> varSet
[1] "blah1" [2] "blah2"
> str(vatSet)
chr [1:44] "blah1" "blah2" ...
> foo <- lapply(varSet, function(x) as.name(x))
As expected foo is a list of symbols. I was thinking of doing something like
eval(unlist(foo), file="fileName")
I guess unlist(foo) is not working. How should I solve this issue? Can you also clear my concept why unlist(foo) is not unlisting the list of symbols?
Edit: Adding artificial example
> x <- c(1,2,3)
> y <- data.frame(m=c(1,2), n=c(1,2,3))
I can do this to save x and y.
> save(x, y, file="filename.rda")
But suppose I have
> varSet <- c("x", "y")
In my example varSet is a very big set. So I need to use varSet to save corresponding variables whose names are stored.
You can save any data object as:
save(varSet, file="varSet.RData")
But your inquiry sounds a bit confused. Do you want just to save it, or save it in a particular way, like data.frame?
Assuming your list of lists is called varSet:
You can also use a plyr solution:
library (plyr)
df <- ldply(varSet, data.frame)
Or more manually strategy. Assuming you list has 100 elements:
df <- data.frame(matrix(unlist(varSet), nrow=100, byrow=T))
The above will convert all character columns to factors, to avoid this you can add a parameter to the data.frame() call:
df <- data.frame(matrix(unlist(varSet), nrow=100, byrow=T),stringsAsFactors=FALSE)
I am working with RMongoDB and I need to fill an empty data.frame with the values of a query. The results are quite long, about 2 milion documents (rows).
While I was doing performance tests, I found out that the time for writing the values to a row increases by the dimension of the data frame. Maybe it is a well known issue and I am the last one to notice it.
Some code example:
set.seed(20140430)
nreg <- 2e3
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <- c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))
nreg <- 2e6
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <- c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))
On my machine, the assignment at the 2 milion rows data.frame takes about 0.4 seconds. This is a lot of time if I want to fill the whole dataset. Here goes a second simulation in order to draw the issue.
nreg <- seq(2e1,2e7,length.out=10)
te <- NULL
for(i in nreg){
dfres <- as.data.frame(matrix(rep(NA,i*7),nrow=i,ncol=7))
te <- c(te,mean(replicate(10,{r <- sample(1:i,1); system.time(dfres[r,] <- c(1:5,"a","b"))[3]}) ) )
}
plot(nreg,te,xlab="Number of rows",ylab="Avg. time for 10 random assignments [sec]",type="o")
#rm(nreg,dfres,te)
Question: Why this happens? Is there a quicker way to fill the data.frame in memory?
Let's start with "columns" first and see what goes on and then return to rows.
R versions < 3.1.0 (unnecessarily) copies the entire data.frame when you operate on them. For example:
## R v3.0.3
df <- data.frame(x=1:5, y=6:10)
dplyr:::changes(df, transform(df, z=11:15)) ## requires dplyr to be available
# Changed variables:
# old new
# x 0x7ff9343fb4d0 0x7ff9326dfba8
# y 0x7ff9343fb488 0x7ff9326dfbf0
# z <added> 0x7ff9326dfc38
# Changed attributes:
# old new
# names 0x7ff934170c28 0x7ff934308808
# row.names 0x7ff934551b18 0x7ff934308970
# class 0x7ff9346c5278 0x7ff935d1d1f8
You can see that addition of "new" column has resulted in a copy of the "old" columns (the addresses are different). Also the attributes are copied. What bites most is that these copies are deep copies, as opposed to shallow copies.
Shallow copies only copy the vector of column pointers, not the entire data, where as deep copies copy everything (which is unnecessary here).
However, in R v3.1.0, there has been nice welcoming changes in that the "old" columns are not deep copied. All credits to the R core dev team.
## R v3.1.0
df <- data.frame(x=1:5, y=6:10)
dplyr:::changes(df, transform(df, z=11:15)) ## requires dplyr to be available
# Changed variables:
# old new
# z <added> 0x7f85d328dda8
# Changed attributes:
# old new
# names 0x7f85d1459548 0x7f85d297bec8
# row.names 0x7f85d2c66cd8 0x7f85d2bfa928
# class 0x7f85d345cab8 0x7f85d2d6afb8
You can see that the columns x and y aren't changed at all (and therefore not present in the output of changes function call). This is a huge (and welcoming) improvement!
So far, we looked at the issue in adding columns in R <3.1.0 and v3.1.0
Now, coming to your question: so, what about the "rows"? Let's consider older version of R first and then come back to R v3.1.0.
## R v3.0.3
df <- data.frame(x=1:5, y=6:10)
df.old <- df
df$y[1L] <- -6L
dplyr:::changes(df.old, df)
# Changed variables:
# old new
# x 0x7f968b423e50 0x7f968ac6ba40
# y 0x7f968b423e98 0x7f968ac6bad0
#
# Changed attributes:
# old new
# names 0x7f968ab88a28 0x7f968abca8e0
# row.names 0x7f968abb6438 0x7f968ab22bb0
# class 0x7f968ad73e08 0x7f968b580828
Once again we see that changing column y has resulted in copying column x as well in older versions of R.
## R v3.1.0
df <- data.frame(x=1:5, y=6:10)
df.old <- df
df$y[1L] <- -6L
dplyr:::changes(df.old, df)
# Changed variables:
# old new
# y 0x7f85d3544090 0x7f85d2c9bbb8
#
# Changed attributes:
# old new
# row.names 0x7f85d35a69a8 0x7f85d35a6690
We see the nice improvements in R v3.1.0 which has resulted in the copy of just column y. Once again, great improvements in R v3.1.0! R's copy-on-modify has gotten wiser.
But still, using data.table's assignment by reference semantics, we can do one step better - not copy even the y column as is the case in R v3.1.0.
The idea being: as long as the type of the object you assign to a column at certain indices don't change (here, column y is integer - so as long as you assign an integer back to y), we really can do it without having to copy by modifying in-place (by reference).
Why? Because we don't have to allocate/re-allocate anything here. As an example, if you had assigned a double/numeric type, which requires 8 bytes of storage as opposed to 4-bytes of storage for integer column y, then we've to create a new column y and copy values back.
That is, we can sub-assign by reference using data.table. We can use either := or set() to do this. I'll demonstrate using set() here.
Now, here's a comparison with base R and data.table on your data with 2,000 to 20,000,000 rows in multiples of 10, against R v3.0.3 and v3.1.0 separately. You can find the code here.
Plot for comparison against R v3.0.3:
Plot for comparison against R v3.1.0:
The min, median and max for R v3.0.3, R v3.1.0 and data.table on 20 million rows with 10 replications are:
type min median max
base_3.0.3 10.05 10.70 18.51
base_3.1.0 1.67 1.97 5.20
data.table 0.04 0.04 0.05
Note: You can see the complete timings in this gist.
This clearly shows the improvement in R v3.1.0, but also shows that the column which is being changed is still being copied and that still consumes sometime, which is overcome through sub-assignment by reference in data.table.
HTH