Revert list structure - r

GOAL
Given a list of lists my goal is to reverse its structure (R language).
So, I want to bring the elements of the nested lists to be elements of the tier one list.
Probably an example better specifies my purpose. Given:
z <- list(z1 = list(a = 1, b = 2, c = 3), z2 = list(b = 4, a = 1, c = 0))
I want an output equivalent to the subsequent R object:
o <- list(a = list(z1 = 1, z2 = 1), b = list(z1 = 2, z2 = 4), c = list(z1 = 3, z2 = 0))
SOLUTIONS
MY SOLUTION
I created my own solution, which I am attaching below, but let me know if there is some better.
revert_list_str_1 <- function(ls) {
res <- lapply(names(ls[[1]]), function(n, env) {
name <- paste(n, 'elements', sep = '_')
assign(name, vector('list', 0))
inner <- sapply(ls, function(x) {
assign(name, c(get(name), x[which(names(x) == n)]))
})
names(inner) <- names(ls)
inner
})
names(res) <- names(ls[[1]])
res
}
Executing str(revert_list_str_1(z)) I obtain the subsequent output, corresponding to what I wanted.
List of 3
$ a:List of 2
..$ z1: num 1
..$ z2: num 1
$ b:List of 2
..$ z1: num 2
..$ z2: num 4
$ c:List of 2
..$ z1: num 3
..$ z2: num 0
But as I said I'd like to investigate (and learn) the existence of a more elegant and dynamic solution.
In fact my solution works fully only if all the nested lists have the same names (also in different order). This because of names(ls[[1]]). I would also point out that it acts only on lists of 2 levels, like the one reported.
So, do you know other solutions that are more dynamic? Can rapply and/or Filter functions be useful for this task?
end edit 1.
ANALYSIS OF PROPOSED SOLUTIONS
I've done a little analysis of the proposes solutions, thans you all !.
The analysis consists of verifying the following points for all functions:
accepted classes (nested list elements)
type preserved also if there are elements with different types (if they are atomic)
object contained in the elements preserved (e.g. a matrix)
columns considered (for columns I mean the names of the nested lists)
not common columns ignored (the classification 'not' is understood positively in this case)
not common columns preserved
it works also when columns do not match (based only on the names of the first nested list)
In all this cases the classification 'yes' is understood positively execept for point 2.1.
This are all the functions I've considered (the comments relate to the analysis items mentioned above):
# yes 1.1
# yes 1.2
# yes 2.1, not 2.2, not 2.3
revert_list_str_1 <- function(ls) { # #leodido
# see above
}
# not 1.1
# not 1.2
# not 2.1, not 2.2, not 2.3
revert_list_str_2 <- function(ls) { # #mnel
# convert each component of list to a data.frame
# so rbind.data.frame so named elements are matched
x <- data.frame((do.call(rbind, lapply(ls, data.frame))))
# convert each column into an appropriately named list
o <- lapply(as.list(x), function(i, nam) as.list(`names<-`(i, nam)), nam = rownames(x))
o
}
# yes 1.1
# yes 1.2
# yes 2.1, not 2.2, yes 2.3
revert_list_str_3 <- function(ls) { # #mnel
# unique names
nn <- Reduce(unique, lapply(ls, names))
# convert from matrix to list `[` used to ensure correct ordering
as.list(data.frame(do.call(rbind,lapply(ls, `[`, nn))))
}
# yes 1.1
# yes 1.2
# yes 2.1, not 2.2, yes 2.3
revert_list_str_4 <- function(ls) { # #Josh O'Brien
# get sub-elements in same order
x <- lapply(ls, `[`, names(ls[[1]]))
# stack and reslice
apply(do.call(rbind, x), 2, as.list)
}
# not 1.1
# not 1.2
# not 2.1, not 2.2, not 2.3
revert_list_str_5 <- function(ls) { # #mnel
apply(data.frame((do.call(rbind, lapply(ls, data.frame)))), 2, as.list)
}
# not 1.1
# not 1.2
# not 2.1, yes 2.2, yes 2.3
revert_list_str_6 <- function(ls) { # #baptiste + #Josh O'Brien
b <- recast(z, L2 ~ L1)
apply(b, 1, as.list)
}
# yes 1.1
# yes 1.2
# not 2.1, yes 2.2, yes 2.3
revert_list_str_7 <- function(ll) { # #Josh O'Brien
nms <- unique(unlist(lapply(ll, function(X) names(X))))
ll <- lapply(ll, function(X) setNames(X[nms], nms))
ll <- apply(do.call(rbind, ll), 2, as.list)
lapply(ll, function(X) X[!sapply(X, is.null)])
}
CONSIDERATIONS
From this analysis emerges that:
functions revert_list_str_7 and revert_list_str_6 are the most flexible regarding the names of the nested list
functions revert_list_str_4, revert_list_str_3 followed by my own function are complete enough, good trade-offs.
the most complete in absolute function is revert_list_str_7.
BENCHMARKS
To complete the work I've done some little benchmarks (with microbenchmark R package) on this 4 functions (times = 1000 for each benchmark).
BENCHMARK 1
Input:
list(z1 = list(a = 1, b = 2, c = 3), z2 = list(a = 0, b = 3, d = 22, f = 9)).
Results:
Unit: microseconds
expr min lq median uq max
1 func_1 250.069 467.5645 503.6420 527.5615 2028.780
2 func_3 204.386 393.7340 414.5485 429.6010 3517.438
3 func_4 89.922 173.7030 189.0545 194.8590 1669.178
4 func_6 11295.463 20985.7525 21433.8680 21934.5105 72476.316
5 func_7 348.585 387.0265 656.7270 691.2060 2393.988
Winner: revert_list_str_4.
BENCHMARK 2
Input:
list(z1 = list(a = 1, b = 2, c = 'ciao'), z2 = list(a = 0, b = 3, c = 5)).
revert_list_str_6 excluded because it does not support different type of nested child elements.
Results:
Unit: microseconds
expr min lq median uq max
1 func_1 249.558 483.2120 502.0915 550.7215 2096.978
2 func_3 210.899 387.6835 400.7055 447.3785 1980.912
3 func_4 92.420 170.9970 182.0335 192.8645 1857.582
4 func_7 257.772 469.9280 477.8795 487.3705 2035.101
Winner: revert_list_str_4.
BENCHMARK 3
Input:
list(z1 = list(a = 1, b = m, c = 'ciao'), z2 = list(a = 0, b = 3, c = m)).
m is a matrix 3x3 of integers and revert_list_str_6 has been excluded again.
Results:
Unit: microseconds
expr min lq median uq max
1 func_1 261.173 484.6345 503.4085 551.6600 2300.750
2 func_3 209.322 393.7235 406.6895 449.7870 2118.252
3 func_4 91.556 174.2685 184.5595 196.2155 1602.983
4 func_7 252.883 474.1735 482.0985 491.9485 2058.306
Winner: revert_list_str_4. Again!
end edit 2.
CONCLUSION
First of all: thanks to all, wonderful solutions.
In my opinion if you know in advance that you list will have nested list with the same names reverse_str_4 is the winner as best compromise between performances and support for different types.
The most complete solution is revert_list_str_7 although the full flexibility induces an average of about 2.5 times a worsening of performances compared to reverse_str_4 (useful if your nested list have different names).

Edit:
Here's a more flexible version that will work on lists whose elements don't necessarily contain the same set of sub-elements.
fun <- function(ll) {
nms <- unique(unlist(lapply(ll, function(X) names(X))))
ll <- lapply(ll, function(X) setNames(X[nms], nms))
ll <- apply(do.call(rbind, ll), 2, as.list)
lapply(ll, function(X) X[!sapply(X, is.null)])
}
## An example of an 'unbalanced' list
z <- list(z1 = list(a = 1, b = 2),
z2 = list(b = 4, a = 1, c = 0))
## Try it out
fun(z)
Original answer
z <- list(z1 = list(a = 1, b = 2, c = 3), z2 = list(b = 4, a = 1, c = 0))
zz <- lapply(z, `[`, names(z[[1]])) ## Get sub-elements in same order
apply(do.call(rbind, zz), 2, as.list) ## Stack and reslice

EDIT -- working from #Josh O'Briens suggestion and my own improvemes
The problem was that do.call rbind was not calling rbind.data.frame which does some matching of names. rbind.data.frame should work, because data.frames are lists and each sublist is a list, so we could just call it directly.
apply(do.call(rbind.data.frame, z), 1, as.list)
However, while this may be succicint, it is slow because do.call(rbind.data.frame, ...) is inherently slow.
Something like (in two steps)
# convert each component of z to a data.frame
# so rbind.data.frame so named elements are matched
x <- data.frame((do.call(rbind, lapply(z, data.frame))))
# convert each column into an appropriately named list
o <- lapply(as.list(x), function(i,nam) as.list(`names<-`(i, nam)), nam = rownames(x))
o
$a
$a$z1
[1] 1
$a$z2
[1] 1
$b
$b$z1
[1] 2
$b$z2
[1] 4
$c
$c$z1
[1] 3
$c$z2
[1] 0
And an alternative
# unique names
nn <- Reduce(unique,lapply(z, names))
# convert from matrix to list `[` used to ensure correct ordering
as.list(data.frame(do.call(rbind,lapply(z, `[`, nn))))

reshape can get you close,
library(reshape)
b = recast(z, L2~L1)
split(b[,-1], b$L2)

The recently released purrr contains a function, transpose, whose's purpose is to 'revert' a list structure. There is a major caveat to the transpose function, it matches on position and not name, https://cran.r-project.org/web/packages/purrr/purrr.pdf. These means that it is not the correct tool for the benchmark 1 above. I therefore only consider benchmark 2 and 3 below.
Benchmark 2
B2 <- list(z1 = list(a = 1, b = 2, c = 'ciao'), z2 = list(a = 0, b = 3, c = 5))
revert_list_str_8 <- function(ll) { # #z109620
transpose(ll)
}
microbenchmark(revert_list_str_1(B2), revert_list_str_3(B2), revert_list_str_4(B2), revert_list_str_7(B2), revert_list_str_8(B2), times = 1e3)
Unit: microseconds
expr min lq mean median uq max neval
revert_list_str_1(B2) 228.752 254.1695 297.066646 268.8325 293.5165 4501.231 1000
revert_list_str_3(B2) 211.645 232.9070 277.149579 250.9925 278.6090 2512.361 1000
revert_list_str_4(B2) 79.673 92.3810 112.889130 100.2020 111.4430 2522.625 1000
revert_list_str_7(B2) 237.062 252.7030 293.978956 264.9230 289.1175 4838.982 1000
revert_list_str_8(B2) 2.445 6.8440 9.503552 9.2880 12.2200 148.591 1000
Clearly function transpose is the winner! It also utilizes much less code.
Benchmark 3
B3 <- list(z1 = list(a = 1, b = m, c = 'ciao'), z2 = list(a = 0, b = 3, c = m))
microbenchmark(revert_list_str_1(B3), revert_list_str_3(B3), revert_list_str_4(B3), revert_list_str_7(B3), revert_list_str_8(B3), times = 1e3)
Unit: microseconds
expr min lq mean median uq max neval
revert_list_str_1(B3) 229.242 253.4360 280.081313 266.877 281.052 2818.341 1000
revert_list_str_3(B3) 213.600 232.9070 271.793957 248.304 272.743 2739.646 1000
revert_list_str_4(B3) 80.161 91.8925 109.713969 98.980 108.022 2403.362 1000
revert_list_str_7(B3) 236.084 254.6580 287.274293 264.922 280.319 2718.628 1000
revert_list_str_8(B3) 2.933 7.3320 9.140367 9.287 11.243 55.233 1000
Again, transpose outperforms all others.
The problem with these above benchmarks test is that they focus on very small lists. For this reason, the numerous loops nested within functions 1-7 do not pose too much of a problem. As the size of the list and therefore the iteration increase, the speed gains of transpose will likely increase.
The purrr package is awesome! It does a lot more than revert lists. In combination with the dplyr package, the purrr package makes it possible to do most of your coding using the poweriful and beautiful functional programming paradigm. Thank the lord for Hadley!

How about this simple solution, which is completely general, and almost as fast as Josh O'Brien's original answer that assumed common internal names (#4).
zv <- unlist(unname(z), recursive=FALSE)
ans <- split(setNames(zv, rep(names(z), lengths(z))), names(zv))
And here is a general version that is robust to not having names:
invertList <- function(z) {
zv <- unlist(unname(z), recursive=FALSE)
zind <- if (is.null(names(zv))) sequence(lengths(z)) else names(zv)
if (!is.null(names(z)))
zv <- setNames(zv, rep(names(z), lengths(z)))
ans <- split(zv, zind)
if (is.null(names(zv)))
ans <- unname(ans)
ans
}

I'd like to add a further solution to this valuable collection (to which I have turned many times):
revert_list_str_9 <- function(x) do.call(Map, c(c, x))
If this were code golf, we'd have a clear winner! Of course, this requires the individual list entries to be in the same order. This can be extended, using various ideas from above, such as
revert_list_str_10 <- function(x) {
nme <- names(x[[1]]) # from revert_list_str_4
do.call(Map, c(c, lapply(x, `[`, nme)))
}
revert_list_str_11 <- function(x) {
nme <- Reduce(unique, lapply(x, names)) # from revert_list_str_3
do.call(Map, c(c, lapply(x, `[`, nme)))
}
Performance-wise it's also not too shabby. If stuff is properly sorted, we have a new base R solution to beat. If not, timings still are very competitive.
z <- list(z1 = list(a = 1, b = 2, c = 3), z2 = list(b = 4, a = 1, c = 0))
microbenchmark::microbenchmark(
revert_list_str_1(z), revert_list_str_2(z), revert_list_str_3(z),
revert_list_str_4(z), revert_list_str_5(z), revert_list_str_7(z),
revert_list_str_9(z), revert_list_str_10(z), revert_list_str_11(z),
times = 1e3
)
#> Unit: microseconds
#> expr min lq mean median uq max
#> revert_list_str_1(z) 51.946 60.9845 67.72623 67.2540 69.8215 1293.660
#> revert_list_str_2(z) 461.287 482.8720 513.21260 490.5495 498.8110 1961.542
#> revert_list_str_3(z) 80.180 89.4905 99.37570 92.5800 95.3185 1424.012
#> revert_list_str_4(z) 19.383 24.2765 29.50865 26.9845 29.5385 1262.080
#> revert_list_str_5(z) 499.433 525.8305 583.67299 533.1135 543.4220 25025.568
#> revert_list_str_7(z) 56.647 66.1485 74.53956 70.8535 74.2445 1309.346
#> revert_list_str_9(z) 6.128 7.9100 11.50801 10.2960 11.5240 1591.422
#> revert_list_str_10(z) 8.455 10.9500 16.06621 13.2945 14.8430 1745.134
#> revert_list_str_11(z) 14.953 19.8655 26.79825 22.1805 24.2885 2084.615
Unfortunately, this is not my creation, but exists courtesy of #thelatemail.

Related

make a variable based on a cumulative sum with reset based on condition

I want a variable such as desired_output, based on a cumulative sum over cumsumover, where the cumsum function resets every time it reaches the next number in thresh.
cumsumover <- c(1, 2, 7, 4, 2, 5)
thresh <- c(3, 7, 11)
desired_output <- c(3, 3 ,7 ,11 ,11 ,11) # same length as cumsumover
This question is similar, but I can't wrap my head around the code.
dplyr / R cumulative sum with reset
Compared to similar questions my condition is specified in a vector of different length than the cumsumover.
Any help would be greatly appreciated. Bonus if both a base R and a tidyverse approach is provided.
In base R, we can use cut with breaks as thresh and labels as letters of same length as thresh.
cut(cumsum(cumsumover),breaks = c(0, thresh[-1], max(cumsum(cumsumover))),
labels = letters[seq_along(thresh)])
#[1] a a b c c c
Replaced the last element of thresh with max(cumsum(cumsumover)) so that anything outside last element of thresh is assigned the last label.
If we want labels as thresh instead of letters
cut(cumsum(cumsumover),breaks = c(0, thresh[-1], max(cumsum(cumsumover))),labels = thresh)
#[1] 3 3 7 11 11 11
Here is another solution:
data:
cumsumover <- c(1, 2, 7, 4, 2, 5)
thresh <- c(3, 7, 11)
code:
outp <- letters[1:3] # to make solution more general
cumsumover_copy <- cumsumover # I use <<- inside sapply so therefore I make a copy to stay save
unlist(
sapply(seq_along(thresh), function(x) {
cs_over <- cumsum(cumsumover_copy)
ntimes = sum( cs_over <= thresh[x] )
cumsumover_copy <<- cumsumover_copy[-(1:ntimes)]
return( rep(outp[x], ntimes) )
} )
)
result:
#[1] "a" "a" "b" "c" "c" "c"
Using .bincode you can do this:
thresh[.bincode(cumsum(cumsumover), c(-Inf,thresh[-1],Inf))]
[1] 3 3 7 11 11 11
.bincode is used by cut, which basically adds labels and checks, so it's more efficient:
x <-rep(cumsum(cumsumover),10000)
microbenchmark::microbenchmark(
bincode = thresh[.bincode(x, c(-Inf,thresh[-1],Inf))],
cut = cut(x,breaks = c(-Inf, thresh[-1], Inf),labels = thresh))
# Unit: microseconds
# expr min lq mean median uq max neval
# bincode 450.2 459.75 654.794 482.10 642.20 5028.4 100
# cut 1739.3 1864.90 2622.593 2215.15 2713.25 12194.8 100

Count number of occurrences of vector in list

I have a list of vectors of variable length, for example:
q <- list(c(1,3,5), c(2,4), c(1,3,5), c(2,5), c(7), c(2,5))
I need to count the number of occurrences for each of the vectors in the list, for example (any other suitable datastructure acceptable):
list(list(c(1,3,5), 2), list(c(2,4), 1), list(c(2,5), 2), list(c(7), 1))
Is there an efficient way to do this? The actual list has tens of thousands of items so quadratic behaviour is not feasible.
match and unique accept and handle "list"s too (?match warns for being slow on "list"s). So, with:
match(q, unique(q))
#[1] 1 2 1 3 4 3
each element is mapped to a single integer. Then:
tabulate(match(q, unique(q)))
#[1] 2 1 2 1
And find a structure to present the results:
as.data.frame(cbind(vec = unique(q), n = tabulate(match(q, unique(q)))))
# vec n
#1 1, 3, 5 2
#2 2, 4 1
#3 2, 5 2
#4 7 1
Alternatively to match(x, unique(x)) approach, we could map each element to a single value with deparseing:
table(sapply(q, deparse))
#
# 7 c(1, 3, 5) c(2, 4) c(2, 5)
# 1 2 1 2
Also, since this is a case with unique integers, and assuming in a small range, we could map each element to a single integer after transforming each element to a binary representation:
n = max(unlist(q))
pow2 = 2 ^ (0:(n - 1))
sapply(q, function(x) tabulate(x, nbins = n)) # 'binary' form
sapply(q, function(x) sum(tabulate(x, nbins = n) * pow2))
#[1] 21 10 21 18 64 18
and then tabulate as before.
And just to compare the above alternatives:
f1 = function(x)
{
ux = unique(x)
i = match(x, ux)
cbind(vec = ux, n = tabulate(i))
}
f2 = function(x)
{
xc = sapply(x, deparse)
i = match(xc, unique(xc))
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
f3 = function(x)
{
n = max(unlist(x))
pow2 = 2 ^ (0:(n - 1))
v = sapply(x, function(X) sum(tabulate(X, nbins = n) * pow2))
i = match(v, unique(v))
cbind(vec = x[!duplicated(v)], n = tabulate(i))
}
q2 = rep_len(q, 1e3)
all.equal(f1(q2), f2(q2))
#[1] TRUE
all.equal(f2(q2), f3(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.980041 8.161524 10.525946 8.291678 8.848133 178.96333 100 b
# f2(q2) 24.407143 24.964991 27.311056 25.514834 27.538643 45.25388 100 c
# f3(q2) 3.951567 4.127482 4.688778 4.261985 4.518463 10.25980 100 a
Another interesting alternative is based on ordering. R > 3.3.0 has a grouping function, built off data.table, which, along with the ordering, provides some attributes for further manipulation:
Make all elements of equal length and "transpose" (probably the most slow operation in this case, though I'm not sure how else to feed grouping):
n = max(lengths(q))
qq = .mapply(c, lapply(q, "[", seq_len(n)), NULL)
Use ordering to group similar elements mapped to integers:
gr = do.call(grouping, qq)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
i
#[1] 1 2 1 3 4 3
then, tabulate as before.
To continue the comparisons:
f4 = function(x)
{
n = max(lengths(x))
x2 = .mapply(c, lapply(x, "[", seq_len(n)), NULL)
gr = do.call(grouping, x2)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
all.equal(f3(q2), f4(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2), f4(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.956377 8.048250 8.792181 8.131771 8.270101 21.944331 100 b
# f2(q2) 24.228966 24.618728 28.043548 25.031807 26.188219 195.456203 100 c
# f3(q2) 3.963746 4.103295 4.801138 4.179508 4.360991 35.105431 100 a
# f4(q2) 2.874151 2.985512 3.219568 3.066248 3.186657 7.763236 100 a
In this comparison q's elements are of small length to accomodate for f3, but f3 (because of large exponentiation) and f4 (because of mapply) will suffer, in performance, if "list"s of larger elements are used.
One way is to paste each vector , unlist and tabulate, i.e.
table(unlist(lapply(q, paste, collapse = ',')))
#1,3,5 2,4 2,5 7
# 2 1 2 1

how to remove unique values from a vector

I have a large numeric vector - how can I remove the unique values from it efficiently?
To give a simplified example, how can I get from vector a to vector b?
> a = c(1, 2, 3, 3, 2, 4) # 1 and 4 are the unique values
> b = c(2, 3, 3, 2)
To add to the options already available:
a[duplicated(a) | duplicated(a, fromLast=TRUE)]
# [1] 2 3 3 2
Update: More benchmarks!
Comparing Prasanna's answer with mine, and comparing it against asieira's functions, we get the following:
fun1 <- function(x) x[x %in% x[duplicated(x)]]
fun2 <- function(x) x[duplicated(x) | duplicated(x, fromLast=TRUE)]
set.seed(1)
a <- ceiling(runif(1000000, min=0, max=100))
library(microbenchmark)
microbenchmark(remove.uniques1(a), remove.uniques2(a),
fun1(a), fun2(a), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# remove.uniques1(a) 1957.9565 1971.3125 2002.7045 2057.0911 2151.1178 20
# remove.uniques2(a) 2049.9714 2065.6566 2095.4877 2146.3000 2210.6742 20
# fun1(a) 213.6129 216.6337 219.2829 297.3085 303.9394 20
# fun2(a) 154.0829 155.5459 155.9748 158.9121 246.2436 20
I suspect that the number of unique values would also make a difference in terms of the efficiency of these approaches.
a[a %in% a[duplicated(a)]]
[1] 2 3 3 2
This should give the right answer.
a = c(1, 2, 3, 3, 2, 4)
dups <- duplicated(a)
dup.val <- a[dups]
a[a %in% dup.val]
One vectorized way to do this is to use the built-in table function to find which values only appear once, and then remove them from the vector:
> a = c(1, 2, 3, 3, 2, 4)
> tb.a = table(a)
> appears.once = as.numeric(names(tb.a[tb.a==1]))
> appears.once
[1] 1 4
> b = a[!a %in% appears.once]
> b
[1] 2 3 3 2
Notice the table function converts the values from the original vector to the names, which is character. So we need to convert it back to numeric in your example.
Another way of doing that with data.table:
> dt.a = data.table(a=a)
> dt.a[,count:=.N,by=a]
> b = dt.a[count>1]$a
> b
[1] 2 3 3 2
Now let's time them:
remove.uniques1 <- function(x) {
tb.x = table(x)
appears.once = as.numeric(names(tb.x[tb.x==1]))
return(x[!x %in% appears.once])
}
remove.uniques2 <- function(x) {
dt.x = data.table(data=x)
dt.x[,count:=.N,by=data]
return(dt.x[count>1]$data)
}
> a = ceiling(runif(1000000, min=0, max=100))
> system.time( remove.uniques1(a) )
user system elapsed
1.598 0.033 1.658
> system.time( remove.uniques2(a) )
user system elapsed
0.845 0.007 0.855
So both are pretty fast, but the data.table version is clearly faster. Not to mention remove.uniques2 preserves whatever type the input vector is. In the case of remove.uniques1, however, you have to replace the call to as.numeric to whatever fits the type of your original vector.

Finding the index of first changes in the elements of a vector

I have a vector v and I would like to find the index of first changes in elements of a vector in R. How can I do this? For example:
v = c(1, 1, 1, 1, 1, 1, 1, 1.5, 1.5, 2, 2, 2, 2, 2)
rle is a good idea, but if you only want the indices of the changepoints you can just do:
c(1,1+which(diff(v)!=0))
## 1 8 10
You're looking for rle:
rle(v)
## Run Length Encoding
## lengths: int [1:3] 7 2 5
## values : num [1:3] 1 1.5 2
This says that the value changes at locations 7+1, 7+2+1 (and 7+2+5+1 would be the index of the element "one off the end")
The data.table package internally (meaning not exported yet) uses a function uniqlist (in dev 1.8.11) or alternatively duplist (in current 1.8.10 #CRAN) that does exactly what you're after. It should be quite fast. Here's a benchmark:
require(data.table)
set.seed(45)
# prepare a huge vector (sorted)
x <- sort(as.numeric(sample(1e5, 1e7, TRUE)))
require(microbenchmark)
ben <- function(v) c(1,1+which(diff(v)!=0))
matthew <- function(v) rle(v)
matteo <- function(v) firstDiff(v)
exegetic <- function(v) first.changes(v)
# if you use 1.8.10, replace uniqlist with duplist
dt <- function(v) data.table:::uniqlist(list(v))
microbenchmark( ans1 <- ben(x), matthew(x), matteo(x),
exegetic(x), ans2 <- dt(x), times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# ans1 <- ben(x) 1181.808 1229.8197 1313.2646 1357.5026 1553.9296 10
# matthew(x) 1456.918 1496.0300 1581.0062 1660.4067 2148.1691 10
# matteo(x) 28609.890 29305.1117 30698.5843 32706.3147 34290.9864 10
# exegetic(x) 1365.243 1546.0652 1576.8699 1659.5488 1886.6058 10
# ans2 <- dt(x) 113.712 114.7794 143.9938 180.3743 221.8386 10
identical(as.integer(ans1), ans2) # [1] TRUE
I'm not that familiar with Rcpp, but seems like the solution could be improved quite a bit.
Edit: Refer to Matteo's updated answer for Rcpp timings.
> v <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 5, 5, 6, 6, 6, 6)
first.changes <- function(d) {
p <- cumsum(rle(d)$lengths) + 1
p[-length(p)]
}
> first.changes(v)
[1] 4 6 11 12 14
Or, with your data,
> v = c(1, 1, 1, 1, 1, 1, 1, 1.5, 1.5, 2, 2, 2, 2, 2)
> first.changes(v)
[1] 8 10
If you need the operation to be fast you can use the Rcpp package to call C++ from R:
library(Rcpp)
library(data.table)
library(microbenchmark)
# Rcpp solution
cppFunction('
NumericVector firstDiff(NumericVector & vett)
{
const int N = vett.size();
std::list<double> changes;
changes.push_back(1.0);
NumericVector::iterator iterH = vett.begin() + 1;
NumericVector::iterator iterB = vett.begin();
int count = 2;
for(iterH = vett.begin() + 1; iterH != vett.end(); iterH++, iterB++)
{
if(*iterH != *iterB) changes.push_back(count);
count++;
}
return wrap(changes);
}
')
# Data table
dt <- function(input) data.table:::uniqlist(list(input))
# Comparison
set.seed(543)
x <- sort(as.numeric(sample(1e5, 1e7, TRUE)))
microbenchmark(ans1 <- firstDiff(x), which(diff(x) != 0)[1], rle(x),
ans2 <- dt(x), times = 10)
Unit: milliseconds
expr min lq median uq max neval
ans1 <- firstDiff(x) 50.10679 50.12327 50.14164 50.16343 50.28475 10
which(diff(x) != 0)[1] 545.66478 547.58388 556.15550 561.78275 617.40281 10
rle(x) 664.53262 687.04316 709.84949 714.91528 721.37204 10
dt(x) 60.60317 82.30181 82.70207 86.13330 94.07739 10
identical(as.integer(ans1), ans2)
#[1] TRUE
Rcpp is slightly faster than data.table and much faster then the other alternatives in this example.

How can I flatten two lists within a list without using data.table?

I would like to form one data.frame from lists within a list
L1 <- list(A = c(1, 2, 3), B = c(5, 6, 7))
L2 <- list(A = c(11, 22, 33), B = c(15, 16, 17))
L3 <- list(L1, L2)
L3
library(data.table)
According to the 'data.table' manual : "'rbindlist' Same as do.call("rbind",l), but much faster"
I would like to achieve what 'rbindlist' does using R base package
rbindlist does exactly what I need but 'do.call' does not!
rbindlist(L3)
do.call does not do what I want
do.call(rbind, L3)
identical(rbindlist(L3), do.call(rbind, L3))
I'd think calling as.data.frame each time could be costly. How about?
as.data.frame(do.call(mapply, c(L3, FUN=c, SIMPLIFY=FALSE)))
mapply basically takes the first elements of L3 and applies the function FUN, then 2nd element and so on... Suppose you'd two lists (L3[[1]] and L3[[2]]), then you'd do:
mapply(FUN=c, L3[[1]], L3[[2]], SIMPLIFY=FALSE)
Here SIMPLIFY=FALSE makes sure the output is not converted (or simplified) to matrix. Thus it'll be a list. For a general case, we use do.call and pass our list with all other arguments for function mapply. Hope this helps.
Benchmarking on big data:
ll <- unlist(replicate(1e3, L3, simplify=FALSE), rec=FALSE)
aa <- function() as.data.frame(do.call(mapply, c(ll, FUN=c, SIMPLIFY=FALSE)))
bb <- function() do.call(rbind, lapply(ll, as.data.frame))
require(microbenchmark)
microbenchmark(o1 <- aa(), o2 <- bb(), times=10)
Unit: milliseconds
expr min lq median uq max neval
o1 <- aa() 4.356838 4.931118 5.462995 7.623445 20.5797 10
o2 <- bb() 673.773795 683.754535 701.557972 710.535860 724.2267 10
identical(o1, o2) # [1] TRUE
You need to convert the sublists in L3 to data.frames first:
> do.call(rbind, lapply(L3, as.data.frame))
A B
1 1 5
2 2 6
3 3 7
4 11 15
5 22 16
6 33 17

Resources