How to efficiently grow a vector in R? - r

Let's say you have a function that takes a number as an input and outputs a vector. However, the output vector size depends on the input and you can't calculate it before the function.
For example, take the 3N+1 famous algorithm. A simple implementation of that algorithm, returning the whole path until 1 could look like this:
compute <- function(x) {
if (x %% 2 == 0)
return(x / 2)
return(3*x + 1)
}
algo <- function(x) {
if (x == 1)
return(1)
output <- x
while(x != 1) {
x <- compute(x)
output <- c(output, x)
}
return(output)
}
The algo function returns the whole path of an input X to 1, according to the function. As you can tell, the output variable grows dynamically, using the c() (combine) function.
Are there any alternatives to this? Is growing a list faster? Should I adopt some classic dynamic vector logic, such as initializing an empty N-sized vector and double it everytime it goes full?
EDIT: Please don't mind trying to optimize the way my helper functions are structured. I get it, but that's not the point here! I am only concerned about the c() function and an alternative to it.

Update
As per your edit, maybe you can check the following solution
algo_TIC2 <- function(x) {
res <- x
repeat {
u <- tail(res, 1)
if (u != 1) {
res[length(res) + 1] <- if (u %% 2) 3 * u + 1 else u / 2
} else {
return(res)
}
}
}
You can use recursions like below
compute <- function(x) if (x %% 2) 3*x + 1 else x / 2
algo_TIC1 <- function(x) {
if (x == 1) {
return(1)
}
c(x, algo_TIC1(compute(x)))
}
and you will see
> algo_TIC1(3000)
[1] 3000 1500 750 375 1126 563 1690 845 2536 1268 634 317 952 476 238
[16] 119 358 179 538 269 808 404 202 101 304 152 76 38 19 58
[31] 29 88 44 22 11 34 17 52 26 13 40 20 10 5 16
[46] 8 4 2 1
If you don't want any helper function, i.e., compute, you can try
algo_TIC1 <- function(x) {
if (x == 1) {
return(1)
}
c(x, algo_TIC1(if (x %% 2) 3*x + 1 else x / 2))
}

So, what bothers you is reallocation, and you are right. Let's see.
library(microbenchmark)
microbenchmark({
a <- c()
for (i in seq(1e4)) {
a <- c(a, i)
}
})
microbenchmark({
a <- numeric(1e4)
for (i in seq(1e4)) {
a[[i]] <- i
}
})
microbenchmark({
a <- numeric(1)
k <- 1
for (i in seq(1e4)) {
if (i > k) {
a <- c(a, numeric(k))
k <- k + k
}
a[[i]] <- i
}
a <- head(a, 1e4)
})
And the timings:
Append
min lq mean median uq max neval
78.0162 78.67925 83.36224 79.54515 81.79535 166.6988 100
Preallocate
min lq mean median uq max neval
1.484901 1.516051 1.567897 1.5552 1.569451 1.895601 100
Amortize
min lq mean median uq max neval
3.316501 3.377201 3.62415 3.484351 3.585701 11.7596 100
Never append many elements to a vector. If possible, preallocate, otherwise amortized allocation will do.
Even if you don't know the actual size beforehand, you may have an upper bound. Then you can still preallocate and truncate in the end. Even a reasonable estimate is useful: preallocate that size, and then resort to amortization if needed.
A remark: R is not good at loops. For small loops, for instance over variables of a dataframe or files in a directory, there is usually no problem. But if you have a long computation that really needs to be achieved with many loops and you can't vectorize, R might not be the right tool. On occasions, writing a function in C, C++, Fortran or Java could help: it's fairly easy to build plugins or to use Rcpp, and the performance gain is considerable.

You can set the length of a vector and then make assignments to specific elements. Your code would look like this:
algo2 <- function(x) {
if (x == 1)
return(1)
output <- x
index <- 1
while(x != 1) {
x <- compute(x)
index <- index + 1
if (index > length(output))
length(output) <- 2*length(output)
output[index] <- x
}
return(output[seq_len(index)])
}
This makes a difference, though not a big one in your example, because all those calls to compute() (and return()!) are quite costly. If you folded that calculation into algo you'd see more improvement. You could also initialize output to a length that is likely to be good enough for most cases, and rarely need doubling.

Related

Sum the odds numbers of a "number"

I am trying to sum the odds numbers of a specific number (but excluding itself), for example: N = 5 then 1+3 = 4
a<-5
sum<-function(x){
k<-0
for (n in x) {
if(n %% 2 == 1)
k<-k+1
}
return(k)
}
sum(a)
# [1] 1
But the function is not working, because it counts the odds numbers instead of summing them.
We may use vectorized approach
a1 <- head(seq_len(a), -1)
sum(a1[a1%%2 == 1])
[1] 4
If we want a loop, perhaps
f1 <- function(x) {
s <- 0
k <- 1
while(k < x) {
if(k %% 2 == 1) {
s <- s + k
}
k <- k + 1
}
s
}
f1(5)
The issue in OP's code is
for(n in x)
where x is just a single value and thus n will be looped once - i.e. if our input is 5, then it will be only looped once and 'n' will be 5. Instead, it would be seq_len(x -1). The correct loop would be something like
f2<-function(x){
k<- 0
for (n in seq_len(x-1)) {
if(n %% 2 == 1) {
k <- k + n
}
}
k
}
f2(5)
NOTE: sum is a base R function. So, it is better to name the custom function with a different name
Mathematically, we can try the following code to calculate the sum (N could be odd or even)
(ceiling((N - 1) / 2))^2
It's simple and it does what it says:
sum(seq(1, length.out = floor(N/2), by = 2))
The multiplication solution is probably gonna be quicker, though.
NB - an earlier version of this answer was
sum(seq(1, N - 1, 2))
which as #tjebo points out, silently gives the wrong answer for N = 1.
We could use logical statement to access the values:
a <- 5
a1 <- head(seq_len(a), -1)
sum(a1[c(TRUE, FALSE)])
output:
[1] 4
Fun benchmarking. Does it surprise that Thomas' simple formula is by far the fastest solution...?
count_odds_thomas <- function(x){
(ceiling((x - 1) / 2))^2
}
count_odds_akrun <- function(x){
a1 <- head(seq_len(x), -1)
sum(a1[a1%%2 == 1])
}
count_odds_dash2 <- function(x){
sum(seq(1, x - 1, 2))
}
m <- microbenchmark::microbenchmark(
akrun = count_odds_akrun(10^6),
dash2 = count_odds_dash2(10^6),
thomas = count_odds_thomas(10^6)
)
m
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> akrun 22117564 26299922.0 30052362.16 28653712 31891621 70721894 100
#> dash2 4016254 4384944.0 7159095.88 4767401 8202516 52423322 100
#> thomas 439 935.5 27599.34 6223 8482 2205286 100
ggplot2::autoplot(m)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Moreover, Thomas solution works on really big numbers (also no surprise)... on my machine, count_odds_akrun stuffs the memory at a “mere” 10^10, but Thomas works fine till Infinity…
count_odds_thomas(10^10)
#> [1] 2.5e+19
count_odds_akrun(10^10)
#> Error: vector memory exhausted (limit reached?)

Factorial Memoization in R

I wrote this function to find a factorial of number
fact <- function(n) {
if (n < 0){
cat ("Sorry, factorial does not exist for negative numbers", "\n")
} else if (n == 0){
cat ("The factorial of 0 is 1", "\n")
} else {
results = 1
for (i in 1:n){
results = results * i
}
cat(paste("The factorial of", n ,"is", results, "\n"))
}
}
Now I want to implement Memoization in R. I have Basic idea on R and trying to implement using them. But I am not sure is this way forward. Could you please also elaborate this topic as well. Thank in advance.
Memoized Factorial
fact_tbl <- c(0, 1, rep(NA, 100))
fact_mem <- function(n){
stopifnot(n > 0)
if(!is.na(fib_tbl[n])){
fib_tbl[n]
} else {
fact_tbl[n-1] <<- fac_mem(n-1) * n
}
}
print (fact_mem(4))
First of all, if you need an efficient implementation, use R's factorial function. Don't write it yourself. Then, the factorial is a good exercise for understanding recursion:
myfactorial <- function(n) {
if (n == 1) return(1)
n * myfactorial(n-1)
}
myfactorial(10)
#[1] 3628800
With this function memoization is only useful, if you intend to use the function repeatedly. You can implement memoization in R using closures. Hadley explains these in his book.
createMemFactorial <- function() {
res <- 1
memFactorial <- function(n) {
if (n == 1) return(1)
#grow res if necessary
if (length(res) < n) res <<- `length<-`(res, n)
#return pre-calculated value
if (!is.na(res[n])) return(res[n])
#calculate new values
res[n] <<- n * factorial(n-1)
res[n]
}
memFactorial
}
memFactorial <- createMemFactorial()
memFactorial(10)
#[1] 3628800
Is it actually faster?
library(microbenchmark)
microbenchmark(factorial(10),
myfactorial(10),
memFactorial(10))
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# factorial(10) 235 264.0 348.02 304.5 378.5 2463 100 a
# myfactorial(10) 4799 5279.5 6491.94 5629.0 6044.5 15955 100 c
# memFactorial(10) 950 1025.0 1344.51 1134.5 1292.0 7942 100 b
Note that microbenchmark evaluates the functions (by default) 100 times. Since we have stored the value for n = 10 when testing the memFactorial, we time only the if conditions and the lookup here. As you can also see, R's implementation, which is mostly written in C, is faster.
A better (and easier) example implements Fibonacci numbers. Here the algorithm itself benefits from memoization.
#naive recursive implementation
fib <- function(n) {
if(n == 1 || n == 2) return(1)
fib(n-1) + fib(n-2)
}
#with memoization
fibm <- function(n) {
if(n == 1 || n == 2) return(1)
seq <- integer(n)
seq[1:2] <- 1
calc <- function(n) {
if (seq[n] != 0) return(seq[n])
seq[n] <<- calc(n-1) + calc(n-2)
seq[n]
}
calc(n)
}
#try it:
fib(20)
#[1] 6765
fibm(20)
#[1] 6765
#Is memoization faster?
microbenchmark(fib(20),
fibm(20))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# fib(20) 8005.314 8804.130 9758.75325 9301.6210 9798.8500 46867.182 100 b
#fibm(20) 38.991 44.798 54.12626 53.6725 60.4035 97.089 100 a

Speed in for loops

With the data set here:
https://www.dropbox.com/s/gyimxbz5f3v0uq3/kfg.RData?dl=0
And executing the below code:
matrix(nrow=1600,ncol=8) -> ctw
for(k in 1:8){
for(i in 1:1600){
which(kfg[,9]==i) -> aj
if(length(aj)!=0){
sample(kfg[aj,11],prob=kfg[aj,k],size=1) -> ctw[i,k]
}
ctw[i,k]
}
}
Is doable, but the real set is over 800k rows and it takes very long. Is there a way in data.table or other package to do this faster? It is very slow to do the which() step.
I had to revise your original code to check for non-zero probabilities. I also removed the statement ctw[i,k] from the last line of the inner loop, because it has no effect. Your code is
matrix(nrow=1600,ncol=8) -> ctw
for(k in 1:8){
for(i in 1:1600){
which(kfg[,9]==i) -> aj
if ((length(aj)!=0) && any(kfg[aj, k] > 0)) {
sample(kfg[aj,11],prob=kfg[aj,k],size=1) -> ctw[i,k]
}
}
}
ctw
I reversed the order of the loops, so that kfg[,9] == i is only evaluated once instead of 8 times. I also took the test for length(aj) != 0 outside the loops using tabulate(). My revised code is
matrix(nrow=1600,ncol=8) -> ctw
which(tabulate(kfg[, 9], 1600) != 0) -> ii
for(i in ii) {
kfg[,9] == i -> aj
for(k in 1:8)
if (any(kfg[aj, k] > 0))
sample(kfg[aj,11], 1, prob=kfg[aj,k]) -> ctw[i,k]
}
ctw
This is approximately 5x faster for your sample data.
It is much faster to extract the vector of sample values kfg[,11] == kfg[[11]] once, and to work with a matrix as.matrix(kfg[, 1:8]) of probabilities, rather than a data.frame. For the sample data it is marginally faster to hoist the split on column 9 out of the loop, and to avoid the conditional inside the k loop by doing a vectorized calculation outside the loop to identify relevant indices
nrow <- 1600
matrix(nrow=nrow,ncol=8) -> ctw
x <- kfg[[11]]
pr <- as.matrix(kfg[,1:8])
ajs <- split(seq_len(nrow(kfg)), factor(kfg[[9]], levels=seq_len(nrow)))
ii <- seq_along(ajs)[lengths(ajs) > 0]
for(i in ii) {
aj <- ajs[[i]]
kk <- which(colSums(pr[aj,, drop=FALSE]) > 0)
for(k in kk)
sample(x[aj], 1, prob=pr[aj,k]) -> ctw[i,k]
}
ctw
These lead to a further 5x speed-up, so 25 times faster than the original.
To measure the speed, I enclosed each of the above in a function, e.g.,
f0 <- function() {
matrix(nrow=1600,ncol=8) -> ctw
for(k in 1:8){
for(i in 1:1600){
which(kfg[,9]==i) -> aj
if ((length(aj)!=0) && any(kfg[aj, k] > 0)) {
sample(kfg[aj,11],prob=kfg[aj,k],size=1) -> ctw[i,k]
}
}
}
ctw
}
and used the microbenchmark package
> library(microbenchmark)
> microbenchmark(f0(), f1(), f2(), times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
f0() 466.12527 483.43954 484.34258 483.74805 484.21627 521.19957 10 c
f1() 92.77415 94.79052 94.99273 95.10352 95.45368 96.10641 10 b
f2() 17.33708 17.83257 17.87095 17.87205 18.01723 18.16400 10 a
f1() and f2() should be identical, but they are not
> set.seed(123); res1 <- f1(); set.seed(123); res2 <- f2()
> all.equal(res1, res2)
[1] "'is.NA' value mismatch: 12096 in current 12133 in target"
Investigating, this is because the values in column 9 are numeric, but are treated, e.g., kfg[, 9] == i as though they are integer. For instance,
> kfg[[9]][(kfg[[9]] > 28 & kfg[[9]] <= 29)]
[1] 29 29 29
> kfg[[9]][(kfg[[9]] > 28 & kfg[[9]] <= 29)] == 29
[1] FALSE FALSE FALSE
Perhaps the intention is
kfg[[9]] = round(kfg[[9]])
With this change, we have
> all.equal(res1, res2)
[1] TRUE
> identical(res1, res2)
[1] TRUE

pick a random number, always with increasing value over last random number picked

How would I efficiently go about taking a 1-by-1 ascending random sample of the values 1:n, making sure that each of the randomly sampled values is always higher than
the previous value?
e.g.:
For the values 1:100, get a random number, say which is 61. (current list=61)
Then pick another number between 62 and 100, say which is 90 (current list=61,90)
Then pick another number between 91 and 100, say which is 100.
Stop the process as the max value has been hit (final list=61,90,100)
I have been stuck in loop land, thinking in this clunky manner:
a1 <- sample(1:100,1)
if(a1 < 100) {
a2 <- sample((a+1):100,1)
}
etc etc...
I want to report a final vector being the concatenation of a1,a2,a(n):
result <- c(a1,a2)
Even though this sounds like a homework question, it is not. I thankfully left the days of homework many years ago.
Coming late to the party, but I think this is gonna rock your world:
unique(cummax(sample.int(100)))
This uses a while loop and is wrapped in a function
# from ?sample
resample <- function(x, ...) x[sample.int(length(x), ...)]
sample_z <- function(n){
z <- numeric(n)
new <- 0
count <- 1
while(new < n){
from <- seq(new+1,n,by=1)
new <- resample(from, size= 1)
z[count] <- new
if(new < n) count <- count+1
}
z[1:count]
}
set.seed(1234)
sample_z(100)
## [1] 12 67 88 96 100
Edit
note the change to deal with when the new sample is 100 and the way sample deals with an integer as opposed to a vector for x
Edit 2
Actually reading the help for sample gave the useful resample function. Which avoids the pitfalls when length(x) == 1
Not particularly efficient but:
X <- 0
samps <- c()
while (X < 100) {
if(is.null(samps)) {z <- 1 } else {z <- 1 + samps[length(samps)]}
if (z == 100) {
samps <- c(samps, z)
} else {
samps <- c(samps, sample(z:100, 1))
}
X <- samps[length(samps)]
}
samps
EDIT: Trimming a little fat from it:
samps <- c()
while (is.null(samps[length(samps)]) || samps[length(samps)] < 100 ) {
if(is.null(samps)) {z <- 1 } else {z <- 1 + samps[length(samps)]}
if (z == 100) {
samps <- c(samps, z)
} else {
samps <- c(samps, sample(z:100, 1))
}
}
samps
even later to the party, but just for kicks:
X <- Y <- sample(100L)
while(length(X <- Y) != length(Y <- X[c(TRUE, diff(X)>0)])) {}
> print(X)
[1] 28 44 60 98 100
Sorting Random Vectors
Create a vector of random integers and sort it afterwards.
sort(sample(1:1000, size = 10, replace = FALSE),decreasing = FALSE)
Gives 10 random Integers between 1 and 1000.
> sort(sample(1:1000, size = 10, replace = FALSE),decreasing = FALSE)
[1] 44 88 164 314 617 814 845 917 944 995
This of course also works with random decimals and floats.

Append an object to a list in R in amortized constant time, O(1)?

If I have some R list mylist, you can append an item obj to it like so:
mylist[[length(mylist)+1]] <- obj
But surely there is some more compact way. When I was new at R, I tried writing lappend() like so:
lappend <- function(lst, obj) {
lst[[length(lst)+1]] <- obj
return(lst)
}
but of course that doesn't work due to R's call-by-name semantics (lst is effectively copied upon call, so changes to lst are not visible outside the scope of lappend(). I know you can do environment hacking in an R function to reach outside the scope of your function and mutate the calling environment, but that seems like a large hammer to write a simple append function.
Can anyone suggest a more beautiful way of doing this? Bonus points if it works for both vectors and lists.
If it's a list of string, just use the c() function :
R> LL <- list(a="tom", b="dick")
R> c(LL, c="harry")
$a
[1] "tom"
$b
[1] "dick"
$c
[1] "harry"
R> class(LL)
[1] "list"
R>
That works on vectors too, so do I get the bonus points?
Edit (2015-Feb-01): This post is coming up on its fifth birthday. Some kind readers keep repeating any shortcomings with it, so by all means also see some of the comments below. One suggestion for list types:
newlist <- list(oldlist, list(someobj))
In general, R types can make it hard to have one and just one idiom for all types and uses.
The OP (in the April 2012 updated revision of the question) is interested in knowing if there's a way to add to a list in amortized constant time, such as can be done, for example, with a C++ vector<> container. The best answer(s?) here so far only show the relative execution times for various solutions given a fixed-size problem, but do not address any of the various solutions' algorithmic efficiency directly. Comments below many of the answers discuss the algorithmic efficiency of some of the solutions, but in every case to date (as of April 2015) they come to the wrong conclusion.
Algorithmic efficiency captures the growth characteristics, either in time (execution time) or space (amount of memory consumed) as a problem size grows. Running a performance test for various solutions given a fixed-size problem does not address the various solutions' growth rate. The OP is interested in knowing if there is a way to append objects to an R list in "amortized constant time". What does that mean? To explain, first let me describe "constant time":
Constant or O(1) growth:
If the time required to perform a given task remains the same as the size of the problem doubles, then we say the algorithm exhibits constant time growth, or stated in "Big O" notation, exhibits O(1) time growth. When the OP says "amortized" constant time, he simply means "in the long run"... i.e., if performing a single operation occasionally takes much longer than normal (e.g. if a preallocated buffer is exhausted and occasionally requires resizing to a larger buffer size), as long as the long-term average performance is constant time, we'll still call it O(1).
For comparison, I will also describe "linear time" and "quadratic time":
Linear or O(n) growth:
If the time required to perform a given task doubles as the size of the problem doubles, then we say the algorithm exhibits linear time, or O(n) growth.
Quadratic or O(n2) growth:
If the time required to perform a given task increases by the square of the problem size, them we say the algorithm exhibits quadratic time, or O(n2) growth.
There are many other efficiency classes of algorithms; I defer to the Wikipedia article for further discussion.
I thank #CronAcronis for his answer, as I am new to R and it was nice to have a fully-constructed block of code for doing a performance analysis of the various solutions presented on this page. I am borrowing his code for my analysis, which I duplicate (wrapped in a function) below:
library(microbenchmark)
### Using environment as a container
lPtrAppend <- function(lstptr, lab, obj) {lstptr[[deparse(substitute(lab))]] <- obj}
### Store list inside new environment
envAppendList <- function(lstptr, obj) {lstptr$list[[length(lstptr$list)+1]] <- obj}
runBenchmark <- function(n) {
microbenchmark(times = 5,
env_with_list_ = {
listptr <- new.env(parent=globalenv())
listptr$list <- NULL
for(i in 1:n) {envAppendList(listptr, i)}
listptr$list
},
c_ = {
a <- list(0)
for(i in 1:n) {a = c(a, list(i))}
},
list_ = {
a <- list(0)
for(i in 1:n) {a <- list(a, list(i))}
},
by_index = {
a <- list(0)
for(i in 1:n) {a[length(a) + 1] <- i}
a
},
append_ = {
a <- list(0)
for(i in 1:n) {a <- append(a, i)}
a
},
env_as_container_ = {
listptr <- new.env(parent=globalenv())
for(i in 1:n) {lPtrAppend(listptr, i, i)}
listptr
}
)
}
The results posted by #CronAcronis definitely seem to suggest that the a <- list(a, list(i)) method is fastest, at least for a problem size of 10000, but the results for a single problem size do not address the growth of the solution. For that, we need to run a minimum of two profiling tests, with differing problem sizes:
> runBenchmark(2e+3)
Unit: microseconds
expr min lq mean median uq max neval
env_with_list_ 8712.146 9138.250 10185.533 10257.678 10761.33 12058.264 5
c_ 13407.657 13413.739 13620.976 13605.696 13790.05 13887.738 5
list_ 854.110 913.407 1064.463 914.167 1301.50 1339.132 5
by_index 11656.866 11705.140 12182.104 11997.446 12741.70 12809.363 5
append_ 15986.712 16817.635 17409.391 17458.502 17480.55 19303.560 5
env_as_container_ 19777.559 20401.702 20589.856 20606.961 20939.56 21223.502 5
> runBenchmark(2e+4)
Unit: milliseconds
expr min lq mean median uq max neval
env_with_list_ 534.955014 550.57150 550.329366 553.5288 553.955246 558.636313 5
c_ 1448.014870 1536.78905 1527.104276 1545.6449 1546.462877 1558.609706 5
list_ 8.746356 8.79615 9.162577 8.8315 9.601226 9.837655 5
by_index 953.989076 1038.47864 1037.859367 1064.3942 1065.291678 1067.143200 5
append_ 1634.151839 1682.94746 1681.948374 1689.7598 1696.198890 1706.683874 5
env_as_container_ 204.134468 205.35348 208.011525 206.4490 208.279580 215.841129 5
>
First of all, a word about the min/lq/mean/median/uq/max values: Since we are performing the exact same task for each of 5 runs, in an ideal world, we could expect that it would take exactly the same amount of time for each run. But the first run is normally biased toward longer times due to the fact that the code we are testing is not yet loaded into the CPU's cache. Following the first run, we would expect the times to be fairly consistent, but occasionally our code may be evicted from the cache due to timer tick interrupts or other hardware interrupts that are unrelated to the code we are testing. By testing the code snippets 5 times, we are allowing the code to be loaded into the cache during the first run and then giving each snippet 4 chances to run to completion without interference from outside events. For this reason, and because we are really running the exact same code under the exact same input conditions each time, we will consider only the 'min' times to be sufficient for the best comparison between the various code options.
Note that I chose to first run with a problem size of 2000 and then 20000, so my problem size increased by a factor of 10 from the first run to the second.
Performance of the list solution: O(1) (constant time)
Let's first look at the growth of the list solution, since we can tell right away that it's the fastest solution in both profiling runs: In the first run, it took 854 microseconds (0.854 milliseconds) to perform 2000 "append" tasks. In the second run, it took 8.746 milliseconds to perform 20000 "append" tasks. A naïve observer would say, "Ah, the list solution exhibits O(n) growth, since as the problem size grew by a factor of ten, so did the time required to execute the test." The problem with that analysis is that what the OP wants is the growth rate of a single object insertion, not the growth rate of the overall problem. Knowing that, it's clear then that the list solution provides exactly what the OP wants: a method of appending objects to a list in O(1) time.
Performance of the other solutions
None of the other solutions come even close to the speed of the list solution, but it is informative to examine them anyway:
Most of the other solutions appear to be O(n) in performance. For example, the by_index solution, a very popular solution based on the frequency with which I find it in other SO posts, took 11.6 milliseconds to append 2000 objects, and 953 milliseconds to append ten times that many objects. The overall problem's time grew by a factor of 100, so a naïve observer might say "Ah, the by_index solution exhibits O(n2) growth, since as the problem size grew by a factor of ten, the time required to execute the test grew by a factor of 100." As before, this analysis is flawed, since the OP is interested in the growth of a single object insertion. If we divide the overall time growth by the problem's size growth, we find that the time growth of appending objects increased by a factor of only 10, not a factor of 100, which matches the growth of the problem size, so the by_index solution is O(n). There are no solutions listed which exhibit O(n2) growth for appending a single object.
In the other answers, only the list approach results in O(1) appends, but it results in a deeply nested list structure, and not a plain single list. I have used the below datastructures, they supports O(1) (amortized) appends, and allow the result to be converted back to a plain list.
expandingList <- function(capacity = 10) {
buffer <- vector('list', capacity)
length <- 0
methods <- list()
methods$double.size <- function() {
buffer <<- c(buffer, vector('list', capacity))
capacity <<- capacity * 2
}
methods$add <- function(val) {
if(length == capacity) {
methods$double.size()
}
length <<- length + 1
buffer[[length]] <<- val
}
methods$as.list <- function() {
b <- buffer[0:length]
return(b)
}
methods
}
and
linkedList <- function() {
head <- list(0)
length <- 0
methods <- list()
methods$add <- function(val) {
length <<- length + 1
head <<- list(head, val)
}
methods$as.list <- function() {
b <- vector('list', length)
h <- head
for(i in length:1) {
b[[i]] <- head[[2]]
head <- head[[1]]
}
return(b)
}
methods
}
Use them as follows:
> l <- expandingList()
> l$add("hello")
> l$add("world")
> l$add(101)
> l$as.list()
[[1]]
[1] "hello"
[[2]]
[1] "world"
[[3]]
[1] 101
These solutions could be expanded into full objects that support al list-related operations by themselves, but that will remain as an exercise for the reader.
Another variant for a named list:
namedExpandingList <- function(capacity = 10) {
buffer <- vector('list', capacity)
names <- character(capacity)
length <- 0
methods <- list()
methods$double.size <- function() {
buffer <<- c(buffer, vector('list', capacity))
names <<- c(names, character(capacity))
capacity <<- capacity * 2
}
methods$add <- function(name, val) {
if(length == capacity) {
methods$double.size()
}
length <<- length + 1
buffer[[length]] <<- val
names[length] <<- name
}
methods$as.list <- function() {
b <- buffer[0:length]
names(b) <- names[0:length]
return(b)
}
methods
}
Benchmarks
Performance comparison using #phonetagger's code (which is based on #Cron Arconis' code). I have also added a better_env_as_container and changed the env_as_container_ a bit. The original env_as_container_ was broken and doesn't actually store all the numbers.
library(microbenchmark)
lPtrAppend <- function(lstptr, lab, obj) {lstptr[[deparse(lab)]] <- obj}
### Store list inside new environment
envAppendList <- function(lstptr, obj) {lstptr$list[[length(lstptr$list)+1]] <- obj}
env2list <- function(env, len) {
l <- vector('list', len)
for (i in 1:len) {
l[[i]] <- env[[as.character(i)]]
}
l
}
envl2list <- function(env, len) {
l <- vector('list', len)
for (i in 1:len) {
l[[i]] <- env[[paste(as.character(i), 'L', sep='')]]
}
l
}
runBenchmark <- function(n) {
microbenchmark(times = 5,
env_with_list_ = {
listptr <- new.env(parent=globalenv())
listptr$list <- NULL
for(i in 1:n) {envAppendList(listptr, i)}
listptr$list
},
c_ = {
a <- list(0)
for(i in 1:n) {a = c(a, list(i))}
},
list_ = {
a <- list(0)
for(i in 1:n) {a <- list(a, list(i))}
},
by_index = {
a <- list(0)
for(i in 1:n) {a[length(a) + 1] <- i}
a
},
append_ = {
a <- list(0)
for(i in 1:n) {a <- append(a, i)}
a
},
env_as_container_ = {
listptr <- new.env(hash=TRUE, parent=globalenv())
for(i in 1:n) {lPtrAppend(listptr, i, i)}
envl2list(listptr, n)
},
better_env_as_container = {
env <- new.env(hash=TRUE, parent=globalenv())
for(i in 1:n) env[[as.character(i)]] <- i
env2list(env, n)
},
linkedList = {
a <- linkedList()
for(i in 1:n) { a$add(i) }
a$as.list()
},
inlineLinkedList = {
a <- list()
for(i in 1:n) { a <- list(a, i) }
b <- vector('list', n)
head <- a
for(i in n:1) {
b[[i]] <- head[[2]]
head <- head[[1]]
}
},
expandingList = {
a <- expandingList()
for(i in 1:n) { a$add(i) }
a$as.list()
},
inlineExpandingList = {
l <- vector('list', 10)
cap <- 10
len <- 0
for(i in 1:n) {
if(len == cap) {
l <- c(l, vector('list', cap))
cap <- cap*2
}
len <- len + 1
l[[len]] <- i
}
l[1:len]
}
)
}
# We need to repeatedly add an element to a list. With normal list concatenation
# or element setting this would lead to a large number of memory copies and a
# quadratic runtime. To prevent that, this function implements a bare bones
# expanding array, in which list appends are (amortized) constant time.
expandingList <- function(capacity = 10) {
buffer <- vector('list', capacity)
length <- 0
methods <- list()
methods$double.size <- function() {
buffer <<- c(buffer, vector('list', capacity))
capacity <<- capacity * 2
}
methods$add <- function(val) {
if(length == capacity) {
methods$double.size()
}
length <<- length + 1
buffer[[length]] <<- val
}
methods$as.list <- function() {
b <- buffer[0:length]
return(b)
}
methods
}
linkedList <- function() {
head <- list(0)
length <- 0
methods <- list()
methods$add <- function(val) {
length <<- length + 1
head <<- list(head, val)
}
methods$as.list <- function() {
b <- vector('list', length)
h <- head
for(i in length:1) {
b[[i]] <- head[[2]]
head <- head[[1]]
}
return(b)
}
methods
}
# We need to repeatedly add an element to a list. With normal list concatenation
# or element setting this would lead to a large number of memory copies and a
# quadratic runtime. To prevent that, this function implements a bare bones
# expanding array, in which list appends are (amortized) constant time.
namedExpandingList <- function(capacity = 10) {
buffer <- vector('list', capacity)
names <- character(capacity)
length <- 0
methods <- list()
methods$double.size <- function() {
buffer <<- c(buffer, vector('list', capacity))
names <<- c(names, character(capacity))
capacity <<- capacity * 2
}
methods$add <- function(name, val) {
if(length == capacity) {
methods$double.size()
}
length <<- length + 1
buffer[[length]] <<- val
names[length] <<- name
}
methods$as.list <- function() {
b <- buffer[0:length]
names(b) <- names[0:length]
return(b)
}
methods
}
result:
> runBenchmark(1000)
Unit: microseconds
expr min lq mean median uq max neval
env_with_list_ 3128.291 3161.675 4466.726 3361.837 3362.885 9318.943 5
c_ 3308.130 3465.830 6687.985 8578.913 8627.802 9459.252 5
list_ 329.508 343.615 389.724 370.504 449.494 455.499 5
by_index 3076.679 3256.588 5480.571 3395.919 8209.738 9463.931 5
append_ 4292.321 4562.184 7911.882 10156.957 10202.773 10345.177 5
env_as_container_ 24471.511 24795.849 25541.103 25486.362 26440.591 26511.200 5
better_env_as_container 7671.338 7986.597 8118.163 8153.726 8335.659 8443.493 5
linkedList 1700.754 1755.439 1829.442 1804.746 1898.752 1987.518 5
inlineLinkedList 1109.764 1115.352 1163.751 1115.631 1206.843 1271.166 5
expandingList 1422.440 1439.970 1486.288 1519.728 1524.268 1525.036 5
inlineExpandingList 942.916 973.366 1002.461 1012.197 1017.784 1066.044 5
> runBenchmark(10000)
Unit: milliseconds
expr min lq mean median uq max neval
env_with_list_ 357.760419 360.277117 433.810432 411.144799 479.090688 560.779139 5
c_ 685.477809 734.055635 761.689936 745.957553 778.330873 864.627811 5
list_ 3.257356 3.454166 3.505653 3.524216 3.551454 3.741071 5
by_index 445.977967 454.321797 515.453906 483.313516 560.374763 633.281485 5
append_ 610.777866 629.547539 681.145751 640.936898 760.570326 763.896124 5
env_as_container_ 281.025606 290.028380 303.885130 308.594676 314.972570 324.804419 5
better_env_as_container 83.944855 86.927458 90.098644 91.335853 92.459026 95.826030 5
linkedList 19.612576 24.032285 24.229808 25.461429 25.819151 26.223597 5
inlineLinkedList 11.126970 11.768524 12.216284 12.063529 12.392199 13.730200 5
expandingList 14.735483 15.854536 15.764204 16.073485 16.075789 16.081726 5
inlineExpandingList 10.618393 11.179351 13.275107 12.391780 14.747914 17.438096 5
> runBenchmark(20000)
Unit: milliseconds
expr min lq mean median uq max neval
env_with_list_ 1723.899913 1915.003237 1921.23955 1938.734718 1951.649113 2076.910767 5
c_ 2759.769353 2768.992334 2810.40023 2820.129738 2832.350269 2870.759474 5
list_ 6.112919 6.399964 6.63974 6.453252 6.910916 7.321647 5
by_index 2163.585192 2194.892470 2292.61011 2209.889015 2436.620081 2458.063801 5
append_ 2832.504964 2872.559609 2983.17666 2992.634568 3004.625953 3213.558197 5
env_as_container_ 573.386166 588.448990 602.48829 597.645221 610.048314 642.912752 5
better_env_as_container 154.180531 175.254307 180.26689 177.027204 188.642219 206.230191 5
linkedList 38.401105 47.514506 46.61419 47.525192 48.677209 50.952958 5
inlineLinkedList 25.172429 26.326681 32.33312 34.403442 34.469930 41.293126 5
expandingList 30.776072 30.970438 34.45491 31.752790 38.062728 40.712542 5
inlineExpandingList 21.309278 22.709159 24.64656 24.290694 25.764816 29.158849 5
I have added linkedList and expandingList and an inlined version of both. The inlinedLinkedList is basically a copy of list_, but it also converts the nested structure back into a plain list. Beyond that the difference between the inlined and non-inlined versions is due to the overhead of the function calls.
All variants of expandingList and linkedList show O(1) append performance, with the benchmark time scaling linearly with the number of items appended. linkedList is slower than expandingList, and the function call overhead is also visible. So if you really need all the speed you can get (and want to stick to R code), use an inlined version of expandingList.
I've also had a look at the C implementation of R, and both approaches should be O(1) append for any size up until you run out of memory.
I have also changed env_as_container_, the original version would store every item under index "i", overwriting the previously appended item. The better_env_as_container I have added is very similar to env_as_container_ but without the deparse stuff. Both exhibit O(1) performance, but they have an overhead that is quite a bit larger than the linked/expanding lists.
Memory overhead
In the C R implementation there is an overhead of 4 words and 2 ints per allocated object. The linkedList approach allocates one list of length two per append, for a total of (4*8+4+4+2*8=) 56 bytes per appended item on 64-bit computers (excluding memory allocation overhead, so probably closer to 64 bytes). The expandingList approach uses one word per appended item, plus a copy when doubling the vector length, so a total memory usage of up to 16 bytes per item. Since the memory is all in one or two objects the per-object overhead is insignificant. I haven't looked deeply into the env memory usage, but I think it will be closer to linkedList.
In the Lisp we did it this way:
> l <- c(1)
> l <- c(2, l)
> l <- c(3, l)
> l <- rev(l)
> l
[1] 1 2 3
though it was 'cons', not just 'c'. If you need to start with an empy list, use l <- NULL.
You want something like this maybe?
> push <- function(l, x) {
lst <- get(l, parent.frame())
lst[length(lst)+1] <- x
assign(l, lst, envir=parent.frame())
}
> a <- list(1,2)
> push('a', 6)
> a
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 6
It's not a very polite function (assigning to parent.frame() is kind of rude) but IIUYC it's what you're asking for.
If you pass in the list variable as a quoted string, you can reach it from within the function like:
push <- function(l, x) {
assign(l, append(eval(as.name(l)), x), envir=parent.frame())
}
so:
> a <- list(1,2)
> a
[[1]]
[1] 1
[[2]]
[1] 2
> push("a", 3)
> a
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
>
or for extra credit:
> v <- vector()
> push("v", 1)
> v
[1] 1
> push("v", 2)
> v
[1] 1 2
>
Not sure why you don't think your first method won't work. You have a bug in the lappend function: length(list) should be length(lst). This works fine and returns a list with the appended obj.
I have made a small comparison of methods mentioned here.
n = 1e+4
library(microbenchmark)
### Using environment as a container
lPtrAppend <- function(lstptr, lab, obj) {lstptr[[deparse(substitute(lab))]] <- obj}
### Store list inside new environment
envAppendList <- function(lstptr, obj) {lstptr$list[[length(lstptr$list)+1]] <- obj}
microbenchmark(times = 5,
env_with_list_ = {
listptr <- new.env(parent=globalenv())
listptr$list <- NULL
for(i in 1:n) {envAppendList(listptr, i)}
listptr$list
},
c_ = {
a <- list(0)
for(i in 1:n) {a = c(a, list(i))}
},
list_ = {
a <- list(0)
for(i in 1:n) {a <- list(a, list(i))}
},
by_index = {
a <- list(0)
for(i in 1:n) {a[length(a) + 1] <- i}
a
},
append_ = {
a <- list(0)
for(i in 1:n) {a <- append(a, i)}
a
},
env_as_container_ = {
listptr <- new.env(parent=globalenv())
for(i in 1:n) {lPtrAppend(listptr, i, i)}
listptr
}
)
Results:
Unit: milliseconds
expr min lq mean median uq max neval cld
env_with_list_ 188.9023 198.7560 224.57632 223.2520 229.3854 282.5859 5 a
c_ 1275.3424 1869.1064 2022.20984 2191.7745 2283.1199 2491.7060 5 b
list_ 17.4916 18.1142 22.56752 19.8546 20.8191 36.5581 5 a
by_index 445.2970 479.9670 540.20398 576.9037 591.2366 607.6156 5 a
append_ 1140.8975 1316.3031 1794.10472 1620.1212 1855.3602 3037.8416 5 b
env_as_container_ 355.9655 360.1738 399.69186 376.8588 391.7945 513.6667 5 a
try this function lappend
lappend <- function (lst, ...){
lst <- c(lst, list(...))
return(lst)
}
and other suggestions from this page Add named vector to a list
Bye.
in fact there is a subtelty with the c() function. If you do:
x <- list()
x <- c(x,2)
x = c(x,"foo")
you will obtain as expected:
[[1]]
[1]
[[2]]
[1] "foo"
but if you add a matrix with x <- c(x, matrix(5,2,2), your list will have another 4 elements of value 5 !
You would better do:
x <- c(x, list(matrix(5,2,2))
It works for any other object and you will obtain as expected:
[[1]]
[1]
[[2]]
[1] "foo"
[[3]]
[,1] [,2]
[1,] 5 5
[2,] 5 5
Finally, your function becomes:
push <- function(l, ...) c(l, list(...))
and it works for any type of object. You can be smarter and do:
push_back <- function(l, ...) c(l, list(...))
push_front <- function(l, ...) c(list(...), l)
I think what you want to do is actually pass by reference (pointer) to the function-- create a new environment (which are passed by reference to functions) with the list added to it:
listptr=new.env(parent=globalenv())
listptr$list=mylist
#Then the function is modified as:
lPtrAppend <- function(lstptr, obj) {
lstptr$list[[length(lstptr$list)+1]] <- obj
}
Now you are only modifying the existing list (not creating a new one)
This is a straightforward way to add items to an R List:
# create an empty list:
small_list = list()
# now put some objects in it:
small_list$k1 = "v1"
small_list$k2 = "v2"
small_list$k3 = 1:10
# retrieve them the same way:
small_list$k1
# returns "v1"
# "index" notation works as well:
small_list["k2"]
Or programmatically:
kx = paste(LETTERS[1:5], 1:5, sep="")
vx = runif(5)
lx = list()
cn = 1
for (itm in kx) { lx[itm] = vx[cn]; cn = cn + 1 }
print(length(lx))
# returns 5
There is also list.append from the rlist (link to the documentation)
require(rlist)
LL <- list(a="Tom", b="Dick")
list.append(LL,d="Pam",f=c("Joe","Ann"))
It's very simple and efficient.
> LL<-list(1:4)
> LL
[[1]]
[1] 1 2 3 4
> LL<-list(c(unlist(LL),5:9))
> LL
[[1]]
[1] 1 2 3 4 5 6 7 8 9
This is a very interesting question and I hope my thought below could contribute an way of solution to it. This method do give a flat list without indexing, but it does have list and unlist to avoid the nesting structures. I'm not sure about the speed since I don't know how to benchmark it.
a_list<-list()
for(i in 1:3){
a_list<-list(unlist(list(unlist(a_list,recursive = FALSE),list(rnorm(2))),recursive = FALSE))
}
a_list
[[1]]
[[1]][[1]]
[1] -0.8098202 1.1035517
[[1]][[2]]
[1] 0.6804520 0.4664394
[[1]][[3]]
[1] 0.15592354 0.07424637
For validation I ran the benchmark code provided by #Cron. There is one major difference (in addition to running faster on the newer i7 processor): the by_index now performs nearly as well as the list_:
Unit: milliseconds
expr min lq mean median uq
env_with_list_ 167.882406 175.969269 185.966143 181.817187 185.933887
c_ 485.524870 501.049836 516.781689 518.637468 537.355953
list_ 6.155772 6.258487 6.544207 6.269045 6.290925
by_index 9.290577 9.630283 9.881103 9.672359 10.219533
append_ 505.046634 543.319857 542.112303 551.001787 553.030110
env_as_container_ 153.297375 154.880337 156.198009 156.068736 156.800135
For reference here is the benchmark code copied verbatim from #Cron's answer (just in case he later changes the contents):
n = 1e+4
library(microbenchmark)
### Using environment as a container
lPtrAppend <- function(lstptr, lab, obj) {lstptr[[deparse(substitute(lab))]] <- obj}
### Store list inside new environment
envAppendList <- function(lstptr, obj) {lstptr$list[[length(lstptr$list)+1]] <- obj}
microbenchmark(times = 5,
env_with_list_ = {
listptr <- new.env(parent=globalenv())
listptr$list <- NULL
for(i in 1:n) {envAppendList(listptr, i)}
listptr$list
},
c_ = {
a <- list(0)
for(i in 1:n) {a = c(a, list(i))}
},
list_ = {
a <- list(0)
for(i in 1:n) {a <- list(a, list(i))}
},
by_index = {
a <- list(0)
for(i in 1:n) {a[length(a) + 1] <- i}
a
},
append_ = {
a <- list(0)
for(i in 1:n) {a <- append(a, i)}
a
},
env_as_container_ = {
listptr <- new.env(parent=globalenv())
for(i in 1:n) {lPtrAppend(listptr, i, i)}
listptr
}
)
I ran the following benchmark:
bench=function(...,n=1,r=3){
a=match.call(expand.dots=F)$...
t=matrix(ncol=length(a),nrow=n)
for(i in 1:length(a))for(j in 1:n){t1=Sys.time();eval(a[[i]],parent.frame());t[j,i]=Sys.time()-t1}
o=t(apply(t,2,function(x)c(median(x),min(x),max(x),mean(x))))
round(1e3*`dimnames<-`(o,list(names(a),c("median","min","max","mean"))),r)
}
ns=10^c(3:7)
m=sapply(ns,function(n)bench(n=5,
`vector at length + 1`={l=c();for(i in 1:n)l[length(l)+1]=i},
`vector at index`={l=c();for(i in 1:n)l[i]=i},
`vector at index, initialize with type`={l=integer();for(i in 1:n)l[i]=i},
`vector at index, initialize with length`={l=vector(length=n);for(i in 1:n)l[i]=i},
`vector at index, initialize with type and length`={l=integer(n);for(i in 1:n)l[i]=i},
`list at length + 1`={l=list();for(i in 1:n)l[[length(l)+1]]=i},
`list at index`={l=list();for(i in 1:n)l[[i]]=i},
`list at index, initialize with length`={l=vector('list',n);for(i in 1:n)l[[i]]=i},
`list at index, initialize with double length, remove null`={l=vector("list",2*n);for(i in 1:n)l[[i]]=i;l=head(l,i)},
`list at index, double when full, get length from variable`={len=1;l=list();for(i in 1:n){l[[i]]=i;if(i==len){len=len*2;length(l)=len}};l=head(l,i)},
`list at index, double when full, check length inside loop`={len=1;l=list();for(i in 1:n){l[[i]]=i;if(i==length(l)){length(l)=i*2}};l=head(l,i)},
`nested lists`={l=list();for(i in 1:n)l=list(l,i)},
`nested lists with unlist`={if(n<=1e5){l=list();for(i in 1:n)l=list(l,i);o=unlist(l)}},
`nested lists with manual unlist`={l=list();for(i in 1:n)l=list(l,i);o=integer(n);for(i in 1:n){o[n-i+1]=l[[2]];l=l[[1]]}},
`JanKanis better_env_as_container`={env=new.env(hash=T,parent=globalenv());for(i in 1:n)env[[as.character(i)]]=i},
`JanKanis inlineLinkedList`={a=list();for(i in 1:n)a=list(a,i);b=vector('list',n);head=a;for(i in n:1){b[[i]]=head[[2]];head=head[[1]]}},
`JanKanis inlineExpandingList`={l=vector('list',10);cap=10;len=0;for(i in 1:n){if(len==cap){l=c(l,vector('list',cap));cap=cap*2};len=len+1;l[[len]]=i};l[1:len]},
`c`={if(n<=1e5){l=c();for(i in 1:n)l=c(l,i)}},
`append vector`={if(n<=1e5){l=integer(n);for(i in 1:n)l=append(l,i)}},
`append list`={if(n<=1e9){l=list();for(i in 1:n)l=append(l,i)}}
)[,1])
m[rownames(m)%in%c("nested lists with unlist","c","append vector","append list"),4:5]=NA
m2=apply(m,2,function(x)formatC(x,max(0,2-ceiling(log10(min(x,na.rm=T)))),format="f"))
m3=apply(rbind(paste0("1e",log10(ns)),m2),2,function(x)formatC(x,max(nchar(x)),format="s"))
writeLines(apply(cbind(m3,c("",rownames(m))),1,paste,collapse=" "))
Output:
1e3 1e4 1e5 1e6 1e7
2.35 24.5 245 2292 27146 vector at length + 1
0.61 5.9 60 590 7360 vector at index
0.61 5.9 64 587 7132 vector at index, initialize with type
0.56 5.6 54 523 6418 vector at index, initialize with length
0.54 5.5 55 522 6371 vector at index, initialize with type and length
2.65 28.8 299 3955 48204 list at length + 1
0.93 9.2 96 1605 13480 list at index
0.58 5.6 57 707 8461 list at index, initialize with length
0.62 5.8 59 739 9413 list at index, initialize with double length, remove null
0.88 8.4 81 962 11872 list at index, double when full, get length from variable
0.96 9.5 92 1264 15813 list at index, double when full, check length inside loop
0.21 1.9 22 426 3826 nested lists
0.25 2.4 29 NA NA nested lists with unlist
2.85 27.5 295 3065 31427 nested lists with manual unlist
1.65 20.2 293 6505 8835 JanKanis better_env_as_container
1.11 10.1 110 1534 27119 JanKanis inlineLinkedList
2.66 26.3 266 3592 47120 JanKanis inlineExpandingList
1.22 118.6 15466 NA NA c
3.64 512.0 45167 NA NA append vector
6.35 664.8 71399 NA NA append list
The table above shows the median time for each method and not the mean time, because occasionally a single run took much longer than a typical run which distorted the mean running time. But none of the methods became much faster on subsequent runs after the first run, so the minimum time and median time were typically similar for each method.
The method "vector at index" (l=c();for(i in 1:n)l[i]=i) was about 5 times faster than "vector at length + 1" (l=c();for(i in 1:n)l[length(l)]=i), because getting the length of the vector took longer than adding an element to the vector. When I initialized the vector with a predetermined length, it made the code about 20% faster, but initializing with a specific type didn't make a difference, because the type just needs to be changed once when the first item is added to the vector. And in the case of lists, when you compare the methods "list at index" and "list at index initialized with length", initializing the list with a predetermined length made a bigger difference as the length of the list increased, because it made the code about twice as fast at length 1e6 but about 3 times as fast at length 1e7.
The method "list at index" (l=list();for(i in 1:n)l[[i]]=i) was about 3-4 times faster than the method "list at length + 1" (l=list();for(i in 1:n)l[[length(l)+1]]=i).
The linked list and expanding list methods by JanKanis were slower than "list at index" but faster than "list at length + 1". The linked list was faster than the expanding list.
Some people claim that the append function is faster than the c function, but in my benchmark append was about 3-4 times slower than c.
In the table above, the lengths 1e6 and 1e7 and are missing for three methods: for "c", "append vector", and "append list" because they had quadratic time complexity, and for "nested lists with unlist" because it resulted in a stack overflow.
The "nested lists" option was the fastest, but it doesn't include the time that it takes to flatten the list. When I used the unlist function to flatten the nested list, I got a stack overflow when the length of the list was around 1.26e5 or higher, because the unlist function calls itself recursively by default: n=1.26e5;l=list();for(i in 1:n)l=list(l,list(i));u=unlist(l). And when I used repeated calls of unlist(recursive=F), it took about 4 seconds to run even for a list with only 10,000 items: for(i in 1:n)l=unlist(l,recursive=F). But when I did the unlisting manually, it only took about 0.3 seconds to run for a list with a million items: o=integer(n);for(i in 1:n){o[n-i+1]=l[[2]];l=l[[1]]}.
If you don't know how many items you are going to append to a list in advance but you know the maximum number of items, then you can try to initialize the list at the maximum length and then later remove NULL values. Or another approach is to double the size of the list every time the list becomes full (which you can do faster if you have one variable for the length of the list and another variable for the number of items you have added to the list, so then you don't have to check the length of the list object on each iteration of a loop):
ns=10^c(2:7)
m=sapply(ns,function(n)bench(n=5,
`list at index`={l=list();for(i in 1:n)l[[i]]=i},
`list at length + 1`={l=list();for(i in 1:n)l[[length(l)+1]]=i},
`list at index, initialize with length`={l=vector("list",n);for(i in 1:n)l[[i]]=i},
`list at index, initialize with double length, remove null`={l=vector("list",2*n);for(i in 1:n)l[[i]]=i;l=head(l,i)},
`list at index, initialize with length 1e7, remove null`={l=vector("list",1e7);for(i in 1:n)l[[i]]=i;l=head(l,i)},
`list at index, initialize with length 1e8, remove null`={l=vector("list",1e8);for(i in 1:n)l[[i]]=i;l=head(l,i)},
`list at index, double when full, get length from variable`={len=1;l=list();for(i in 1:n){l[[i]]=i;if(i==len){len=len*2;length(l)=len}};l=head(l,i)},
`list at index, double when full, check length inside loop`={len=1;l=list();for(i in 1:n){l[[i]]=i;if(i==length(l)){length(l)=i*2}};l=head(l,i)}
)[,1])
m2=apply(m,2,function(x)formatC(x,max(0,2-ceiling(log10(min(x)))),format="f"))
m3=apply(rbind(paste0("1e",log10(ns)),m2),2,function(x)formatC(x,max(nchar(x)),format="s"))
writeLines(apply(cbind(m3,c("",rownames(m))),1,paste,collapse=" "))
Output:
1e4 1e5 1e6 1e7
9.3 102 1225 13250 list at index
27.4 315 3820 45920 list at length + 1
5.7 58 726 7548 list at index, initialize with length
5.8 60 748 8057 list at index, initialize with double length, remove null
33.4 88 902 7684 list at index, initialize with length 1e7, remove null
333.2 393 2691 12245 list at index, initialize with length 1e8, remove null
8.6 83 1032 10611 list at index, double when full, get length from variable
9.3 96 1280 14319 list at index, double when full, check length inside loop

Resources