What should I use instead of pass-by-reference in R? - r

I wrote a function in R that, if I could pass by reference, would work as I intend. It involves nesting sapply calls and performing assignment inside them. Since R does not use pass by reference, the functions don't work.
I am aware that there are packages available such as R.oo to bring pass by reference-y aspects to R, but I would like to learn if there is a better way. What is the 'R' way of passing data if pass by reference is not available?

If you don't modify an argument then it won't actually copy it so it may be pointless to do anything special:
> gc() # using 6.2 MB of Vcells
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 560642 15.0 984024 26.3 984024 26.3
Vcells 809878 6.2 2670432 20.4 2310055 17.7
> x <- as.numeric(1:1000000)
> gc() # now we are using 13.9 MB
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 560640 15.0 984024 26.3 984024 26.3
Vcells 1809867 13.9 2883953 22.1 2310055 17.7
> f0 <- function(x) { s <- sum(x); print(gc()); s }
> f0(x) # f0 did not use appreciably more despite using a huge vector
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 560655 15.0 984024 26.3 984024 26.3
Vcells 1809872 13.9 2883953 22.1 2310055 17.7
[1] 500000500000
EDIT: minor changes to example

Related

R Programming: What Happens to a dangling child environment when its parent has been removed?

I have googled on this subject to no avail and hope that someone will answer this question for me.
BTW, parent.env(child) shows the parent environment x but since I have removed it, it doesn't make sense to me why parent.env(child) would still return that environment:
e.g.
x<-new.env()
child<-new.env(parent=x)
print(x) # shows <environment: 0x00000000217b8498>
parent.env(child) # shows <environment: 0x00000000217b8498>
rm(x)
parent.env(child) # still shows <environment: 0x00000000217b8498>
Appreciate any help on this question.
In your example, the parent environment hasn't been removed.
Calling rm(x) doesn't remove the object itself, it only removes the binding
of the name x from the environment where rm() was called in. As long as
an object is reachable from the current environment, it won't ever be
removed.
Paraphrasing the beginning of the Names and Values
chapter of the Advanced R book,
it may be helpful to think of x <- new.env() as doing two things: creating
an environment object, and then binding the object to the name x in the
current environment.
Even if this original binding is removed, as long as we can reach the object,
we can restore a binding to it in the global environment. Here's an extension
of your example to demonstrate:
x <- new.env()
x
#> <environment: 0x0000000015043a78>
x$foo <- "bar"
y <- new.env(parent = x)
parent.env(y)
#> <environment: 0x0000000015043a78>
rm(x)
parent.env(y)
#> <environment: 0x0000000015043a78>
z <- parent.env(y)
z # the name z is now bound to the same object that x was
#> <environment: 0x0000000015043a78>
z$foo
#> [1] "bar"
So to answer the titular question: it's not possible to reach a state where
the parent environment of a still-existing child environment has been
removed.
Created on 2018-08-23 by the reprex package (v0.2.0).
The parent environment doesn't get removed as child depends on it. rm will remove a name from the memory nametable and then modify the state of values that were bound to the name. gc will handle the final removal and freeing of memory, but only if there are no further references to the value.
Consider the following:
x$largevec <- numeric(1e7)
memory.size()
[1] 99.69
rm(x)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 217423 11.7 460000 24.6 350000 18.7
Vcells 10399066 79.4 15376413 117.4 10402077 79.4
memory.size()
[1] 97.51
rm(child)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 217416 11.7 460000 24.6 350000 18.7
Vcells 399008 3.1 12301130 93.9 10402077 79.4
memory.size()
[1] 21.2
The memory allocated for largevec in x doesn't get freed until child is removed as there are still references to it's environment.

High memory consumption in unlist of list-of-lists of POSIXct

I have around 500-2000 lists of POSIXct dates in a list of lists looking like this:
ts <- lapply(c(1:500), function(x) seq(as.POSIXct("2000/1/1"), as.POSIXct("2017/1/1"), "hours"))
I need a list of unique dates. I have tried several things:
t <- unique(do.call("c", ts))
This preserves the POSIXct class, but takes very long and uses 7-8GB of memory; even though he whole list of lists is like 500MB or so.
t <- as.POSIXct(unique(unlist(ts, use.names = FALSE)), origin = "1970-01-01")
This goes much faster, though the memory consumed is roughly the same. So I tried to split it with this:
t <- lapply(split(ts, ceiling(seq_along(ts)/30)), function(x) {
return(unique(unlist(x, use.names = FALSE)))
})
t <- unique(unlist(x, use.names = FALSE))
Same consumption and it seems to me like the memory comes from the unlist() or unique() call on just one of "small" lists.
Is there a way to achieve this memory efficient? Processing time matters, but just a little. If the list size doubles (which is likely) this may cause serious problems.
Instead of creating a large vector with do.call(c, .)/unlist(.) and a single large hash table which have a high memory usage as shown in Joshua's answer, we could follow the less time-efficient but more memory-efficient way of processing "ts" iteratively:
ff1 = function(x) ## a simple version of `Reduce(unique(c()), )`
{
ans = NULL
for(elt in x) ans = unique(c(ans, elt))
return(.POSIXct(ans))
}
system.time({ ans1 = ff1(ts) })
# user system elapsed
# 11.41 1.25 12.74
"ts" has identical elements. Though this ideal is not generally the case, we could try to avoid some concatenations if possible:
ff2 = function(x)
{
ans = NULL
for(elt in x) {
new = !(elt %in% ans)
if(any(new)) ans = c(ans, elt[new])
}
return(.POSIXct(ans))
}
system.time({ ans2 = ff2(ts) })
# user system elapsed
# 6.65 1.12 7.93
On the same note, fastmatch package has a very interesting but, unfortunately, not exported, high level hash table functionality that we could try using here. It should, also, be more light on memory consumption.
First define some convenient wrappers:
HASH = function(x, size) fastmatch:::mk.hash(x = x, size = size)
APPEND = function(x, what) fastmatch:::append.hash(hash = x, x = what, index = FALSE)
HTABLE = function(x) fastmatch:::levels.fasthash(x)
And build the same concept on it:
ff3 = function(x, size)
{
h = HASH(double(), size)
for(elt in x) h = APPEND(h, elt)
return(.POSIXct(HTABLE(h)))
}
system.time({ ans3 = ff3(ts, sum(lengths(ts)) / 1e2) }) #an estimate of unique values
# user system elapsed
# 4.81 0.00 4.87
system.time({ ans3b = ff3(ts, length(ts[[1]])) }) #we know the number of uniques
# user system elapsed
# 2.03 0.03 2.10
And to compare:
all.equal(ans1, ans2)
#[1] TRUE
all.equal(ans2, ans3)
#[1] TRUE
On a smaller example to illustrate:
set.seed(1821)
tmp = split(sample(1e2, 26, TRUE) + 0, rep(1:4, c(6, 3, 11, 6)))
identical(unique(unlist(tmp)), as.double(ff1(tmp)))
#[1] TRUE
identical(unique(unlist(tmp)), as.double(ff2(tmp)))
#[1] TRUE
identical(unique(unlist(tmp)), as.double(ff3(tmp, 1e2)))
#[1] TRUE
The unique(do.call("c", ts)) call uses < 4GB (3869.9 - 606 ~ 3GB) of RAM on my machine. And the ts object is 568MB.
R> ts <- lapply(c(1:500), function(x) seq(as.POSIXct("2000/1/1"), as.POSIXct("2017/1/1"), "hours"))
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 221687 11.9 460000 24.6 392929 21
Vcells 74925836 571.7 112760349 860.3 79427802 606
R> t <- unique(do.call("c", ts))
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 221729 11.9 460000 24.6 392929 21.0
Vcells 75074953 572.8 413082177 3151.6 507227169 3869.9
R> print(object.size(ts), units="MB")
568.8 Mb
R> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
Setting recursive = FALSE and use.names = FALSE makes it faster and drops the memory consumption to ~2GB.
R> ts <- lapply(1:500, function(x) seq(as.POSIXct("2000-01-01"), as.POSIXct("2017-01-01"), "hours"))
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 221686 11.9 460000 24.6 371201 19.9
Vcells 74925836 571.7 111681359 852.1 80924280 617.5
R> u <- do.call("c", c(ts, recursive = FALSE, use.names = FALSE))
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 221725 11.9 460000 24.6 371201 19.9
Vcells 149446409 1140.2 413082872 3151.6 373009943 2845.9
Using unlist with the same arguments is a little lighter on memory consumption:
R> ts <- lapply(1:500, function(x) seq(as.POSIXct("2000-01-01"), as.POSIXct("2017-01-01"), "hours"))
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 221686 11.9 460000 24.6 371201 19.9
Vcells 74925836 571.7 111681359 852.1 80924280 617.5
R> u <- .POSIXct(unlist(ts, recursive = FALSE, use.names = FALSE))
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 221695 11.9 460000 24.6 371201 19.9
Vcells 149446337 1140.2 358453576 2734.8 298487368 2277.3
Adding alexis_laz's comment, you can see that memory consumption is a measly 230MB:
R> ts <- lapply(c(1:500), function(x) seq(as.POSIXct("2000/1/1"), as.POSIXct("2017/1/1"), "hours"))
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 218429 11.7 460000 24.6 389555 20.9
Vcells 74922694 571.7 111432506 850.2 81226910 619.8
R> u <- Reduce(function(x, y) unique(c(x, y)), ts)
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 218893 11.7 460000 24.6 389555 20.9
Vcells 75072416 572.8 111432506 850.2 111399894 850.0

Release memory by gc() in silence

I am running R code in ubuntu and want to release some memory. After I remove (rm()) variables, I call gc(). It seems it works. But how can make it work in silence (i.e. don't report the message).
I tried to set gcinfo(verbose=FALSE), but gc() still reports the message.
gcinfo(verbose=FALSE)
# [1] FALSE
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 256641 13.8 467875 25.0 350000 18.7
# Vcells 103826620 792.2 287406824 2192.8 560264647 4274.5
The invisible() function is useful for this. One way would be to write a little gc() wrapper function of your own that without any arguments returns gc() invisibly.
gcQuiet <- function(quiet = TRUE, ...) {
if(quiet) invisible(gc()) else gc(...)
}
gcQuiet() ## runs gc() invisibly
gcQuiet(FALSE)
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 283808 15.2 531268 28.4 407500 21.8
# Vcells 505412 3.9 1031040 7.9 896071 6.9
gcQuiet(FALSE, verbose=TRUE)
# Garbage collection 26 = 12+1+13 (level 2) ...
# 15.2 Mbytes of cons cells used (53%)
# 3.9 Mbytes of vectors used (49%)
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 283813 15.2 531268 28.4 407500 21.8
# Vcells 505412 3.9 1031040 7.9 896071 6.9
Quick and dirty method that I use:
echo "gc()" > gc.R
Then you can just do this:
source("gc.R", echo=FALSE)

Suppress output of gc()

Is there a possibility to suppress all messages of gc( ) in R?
The usual like suppressWarnings(gc( )) or suppressMessages(gc( )) don't work. gc( ) itself has a verbose option but this is not working how I like it:
> gc(verbose=TRUE)
Garbage collection 375 = 234+40+101 (level 2) ...
17.9 Mbytes of cons cells used (41%)
171.2 Mbytes of vectors used (43%)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 334493 17.9 818163 43.7 818163 43.7
Vcells 22431904 171.2 52178020 398.1 50193465 383.0
> gc(verbose=FALSE)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 334496 17.9 818163 43.7 818163 43.7
Vcells 22431916 171.2 52178020 398.1 50193465 383.0
Thanks in advance!
I sometimes use invisible(gc()).
Not pretty, but
foo <- gc();rm foo
will take care of it

Memory leak in data.table grouped assignment by reference

I'm seeing odd memory usage when using assignment by reference by group in a data.table. Here's a simple example to demonstrate (please excuse the triviality of the example):
N <- 1e6
dt <- data.table(id=round(rnorm(N)), value=rnorm(N))
gc()
for (i in seq(100)) {
dt[, value := value+1, by="id"]
}
gc()
tables()
which produces the following output:
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 303909 16.3 597831 32.0 407500 21.8
Vcells 2442853 18.7 3260814 24.9 2689450 20.6
> for (i in seq(100)) {
+ dt[, value := value+1, by="id"]
+ }
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 315907 16.9 597831 32.0 407500 21.8
Vcells 59966825 457.6 73320781 559.4 69633650 531.3
> tables()
NAME NROW MB COLS KEY
[1,] dt 1,000,000 16 id,value
Total: 16MB
So about 440MB of used Vcells memory were added after the loop. This memory is not accounted for after removing the data.table from memory:
> rm(dt)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 320888 17.2 597831 32 407500 21.8
Vcells 57977069 442.4 77066820 588 69633650 531.3
> tables()
No objects of class data.table exist in .GlobalEnv
The memory leak seems to disappear when removing the by=... from the assignment:
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 312955 16.8 597831 32.0 467875 25.0
Vcells 2458890 18.8 3279586 25.1 2704448 20.7
> for (i in seq(100)) {
+ dt[, value := value+1]
+ }
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 322698 17.3 597831 32.0 467875 25.0
Vcells 2478772 19.0 5826337 44.5 5139567 39.3
> tables()
NAME NROW MB COLS KEY
[1,] dt 1,000,000 16 id,value
Total: 16MB
To summarize, two questions:
Am I missing something or is there a memory leak?
If there is indeed a memory leak, can anyone suggest a workaround that lets me use assignment by reference by group without the memory leak?
For reference, here's the output of sessionInfo():
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.8.10
loaded via a namespace (and not attached):
[1] tools_3.0.2
UPDATE from Matt - Now fixed in v1.8.11. From NEWS :
Long outstanding (usually small) memory leak in grouping fixed. When
the last group is smaller than the largest group, the difference in
those sizes was not being released. Most users run a grouping query
once and will never have noticed, but anyone looping calls to grouping
(such as when running in parallel, or benchmarking) may have suffered,
#2648. Test added.
Many thanks to vc273, Y T and others.
From Arun ...
Why was this happening?
I wish I had come across this post before sitting on this issue. Nevertheless, a nice learning experience. Simon Urbanek summarises the issue pretty succinctly, that it's not a memory leak but bad-reporting of memory used/freed. I had the feeling this is what was happening.
What's the reason for this to happen in data.table? This part is on identifying the portion of code from dogroups.c responsible for the apparent memory increase.
Okay, so after some tedious testing, I think I've managed to at least find as to what the reason is for this to happen. Hopefully someone can help me get there from this post. My conclusion is that this is not a memory leak.
The short explanation is that this seems to be an effect of the usage of SETLENGTH function (from R's C-interface) in data.table's dogroups.c .
In data.table, when you use by=..., for example,
set.seed(45)
DT <- data.table(x=sample(3, 12, TRUE), id=rep(3:1, c(2,4,6)))
DT[, list(y=mean(x)), by=id]
Corresponding to id=1, the values of "x" (=c(1,2,1,1,2,3)) has to be picked. This means, having to allocate memory for .SD (all columns not in by) per by value.
To overcome this allocation for each group in by, data.table accomplishes this cleverly by first allocating .SD with the length of the largest group in by (which here is corresponding to id=1, length 6). Then, we could, for each value of id, re-use the (overly) allocated data.table and by using the function SETLENGTH we can just adjust the length to the length of current group. Note that, by doing this, no memory is being actually allocated here, except just the once that's allocated for the biggest group.
But what seems strange is that when the number of elements for each group in by all have the same number of items, nothing special seems to be happening with regard to gc() output. However, when they aren't the same, gc() seems to report increasing usage in Vcells. This is in spite of the fact that no extra memory is being allocated in both cases.
To illustrate this point, I've written a C-code that mimics the SETLENGTH function usage in dogroups.c in `data.table.
// test.c
#include <R.h>
#define USE_RINTERNALS
#include <Rinternals.h>
#include <Rdefines.h>
int sizes[100];
#define SIZEOF(x) sizes[TYPEOF(x)]
// test function - no checks!
SEXP test(SEXP vec, SEXP SD, SEXP lengths)
{
R_len_t i, j;
char before_address[32], after_address[32];
SEXP tmp, ans;
PROTECT(tmp = allocVector(INTSXP, 1));
PROTECT(ans = allocVector(STRSXP, 2));
snprintf(before_address, 32, "%p", (void *)SD);
for (i=0; i<LENGTH(lengths); i++) {
memcpy((char *)DATAPTR(SD), (char *)DATAPTR(vec), INTEGER(lengths)[i] * SIZEOF(tmp));
SETLENGTH(SD, INTEGER(lengths)[i]);
// do some computation here.. ex: mean(SD)
}
snprintf(after_address, 32, "%p", (void *)SD);
SET_STRING_ELT(ans, 0, mkChar(before_address));
SET_STRING_ELT(ans, 1, mkChar(after_address));
UNPROTECT(2);
return(ans);
}
Here vec is equivalent to any data.table dt and SD is equivalent to .SD and lengths is the length of each group. This is just a dummy program. Basically for each value of lengths, say n, the first n elements are copied from vec on to SD. Then one can compute whatever one wants on this SD (which is not done here). For our purposes, the address of SD before and after operation using SETLENGTH are being returned, to illustrate that there's no copy being made by SETLENGTH.
Save this file as test.c and then compile it as follows from terminal:
R CMD SHLIB -o test.so test.c
Now, open a new R-session, go to the path where test.so exists and then type:
dyn.load("test.so")
require(data.table)
set.seed(45)
max_len <- as.integer(1e6)
lengths <- as.integer(sample(4:(max_len)/10, max_len/10))
gc()
vec <- 1:max_len
for (i in 1:100) {
SD <- vec[1:max(lengths)]
bla <- .Call("test", vec, SD, lengths)
print(gc())
}
Note that for each i here, .SD will be allocated a different memory location and that's being replicated here by assigning SD for each i.
By running this code, you'll find that 1) the two values returned are identical for each i to that of address(SD) and 2) Vcells used Mb keeps increasing. Now, remove all variables from the workspace with rm(list=ls()) and then do gc(), you'll find that not all memory is being restored/freed.
Initial:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 332708 17.8 597831 32.0 467875 25.0
Vcells 1033531 7.9 2327578 17.8 2313676 17.7
After 100 runs:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 332912 17.8 597831 32.0 467875 25.0
Vcells 2631370 20.1 4202816 32.1 2765872 21.2
After rm(list=ls()) and gc():
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 341275 18.3 597831 32.0 467875 25.0
Vcells 2061531 15.8 4202816 32.1 3121469 23.9
If you remove the line SETLENGTH(SD, ...) from the C-code, and run it again, you'll find that there's no change in the Vcells.
Now as to why SETLENGTH on grouping with non-identical group lengths has this effect, I'm still trying to understand - check out the link in the edit above.

Resources