I have a list of vectors of variable length, for example:
q <- list(c(1,3,5), c(2,4), c(1,3,5), c(2,5), c(7), c(2,5))
I need to count the number of occurrences for each of the vectors in the list, for example (any other suitable datastructure acceptable):
list(list(c(1,3,5), 2), list(c(2,4), 1), list(c(2,5), 2), list(c(7), 1))
Is there an efficient way to do this? The actual list has tens of thousands of items so quadratic behaviour is not feasible.
match and unique accept and handle "list"s too (?match warns for being slow on "list"s). So, with:
match(q, unique(q))
#[1] 1 2 1 3 4 3
each element is mapped to a single integer. Then:
tabulate(match(q, unique(q)))
#[1] 2 1 2 1
And find a structure to present the results:
as.data.frame(cbind(vec = unique(q), n = tabulate(match(q, unique(q)))))
# vec n
#1 1, 3, 5 2
#2 2, 4 1
#3 2, 5 2
#4 7 1
Alternatively to match(x, unique(x)) approach, we could map each element to a single value with deparseing:
table(sapply(q, deparse))
#
# 7 c(1, 3, 5) c(2, 4) c(2, 5)
# 1 2 1 2
Also, since this is a case with unique integers, and assuming in a small range, we could map each element to a single integer after transforming each element to a binary representation:
n = max(unlist(q))
pow2 = 2 ^ (0:(n - 1))
sapply(q, function(x) tabulate(x, nbins = n)) # 'binary' form
sapply(q, function(x) sum(tabulate(x, nbins = n) * pow2))
#[1] 21 10 21 18 64 18
and then tabulate as before.
And just to compare the above alternatives:
f1 = function(x)
{
ux = unique(x)
i = match(x, ux)
cbind(vec = ux, n = tabulate(i))
}
f2 = function(x)
{
xc = sapply(x, deparse)
i = match(xc, unique(xc))
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
f3 = function(x)
{
n = max(unlist(x))
pow2 = 2 ^ (0:(n - 1))
v = sapply(x, function(X) sum(tabulate(X, nbins = n) * pow2))
i = match(v, unique(v))
cbind(vec = x[!duplicated(v)], n = tabulate(i))
}
q2 = rep_len(q, 1e3)
all.equal(f1(q2), f2(q2))
#[1] TRUE
all.equal(f2(q2), f3(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.980041 8.161524 10.525946 8.291678 8.848133 178.96333 100 b
# f2(q2) 24.407143 24.964991 27.311056 25.514834 27.538643 45.25388 100 c
# f3(q2) 3.951567 4.127482 4.688778 4.261985 4.518463 10.25980 100 a
Another interesting alternative is based on ordering. R > 3.3.0 has a grouping function, built off data.table, which, along with the ordering, provides some attributes for further manipulation:
Make all elements of equal length and "transpose" (probably the most slow operation in this case, though I'm not sure how else to feed grouping):
n = max(lengths(q))
qq = .mapply(c, lapply(q, "[", seq_len(n)), NULL)
Use ordering to group similar elements mapped to integers:
gr = do.call(grouping, qq)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
i
#[1] 1 2 1 3 4 3
then, tabulate as before.
To continue the comparisons:
f4 = function(x)
{
n = max(lengths(x))
x2 = .mapply(c, lapply(x, "[", seq_len(n)), NULL)
gr = do.call(grouping, x2)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
all.equal(f3(q2), f4(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2), f4(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.956377 8.048250 8.792181 8.131771 8.270101 21.944331 100 b
# f2(q2) 24.228966 24.618728 28.043548 25.031807 26.188219 195.456203 100 c
# f3(q2) 3.963746 4.103295 4.801138 4.179508 4.360991 35.105431 100 a
# f4(q2) 2.874151 2.985512 3.219568 3.066248 3.186657 7.763236 100 a
In this comparison q's elements are of small length to accomodate for f3, but f3 (because of large exponentiation) and f4 (because of mapply) will suffer, in performance, if "list"s of larger elements are used.
One way is to paste each vector , unlist and tabulate, i.e.
table(unlist(lapply(q, paste, collapse = ',')))
#1,3,5 2,4 2,5 7
# 2 1 2 1
Related
I have a data.frame of 1,480 rows and 1,400 columns like:
1 2 3 4 5 6 ..... 1399 1400
1 0 0 0 1 0 0 ..... 1 0 #first occurrence would be at 4
2 0 0 0 0 0 1 ..... 0 1
3 1 0 0 1 0 0 ..... 0 0
## and etc
Each row contains a series of 0's and 1's - predominantly 0's. For each row, I want to find at which column the first 1 shows up and set the remaining values to 0's.
My current implementation can efficiently find the occurrence of the first 1, but I've only figured out how to zero out the remaining values iteratively by row. In repeated simulations, this iterative process is taking too long.
Here is the current implementation:
N <- length(df[which(df$arm == 0), "pt_id"]) # of patients
M <- max_days
#
# df is like the data frame shown above
#
df[which(df$arm == 0), 5:length(colnames(df))] <- unlist(lapply(matrix(data = rep(pbo_hr, M*N), nrow=N, ncol = M), rbinom, n=1, size = 1))
event_day_post_rand <- apply(df[,5:length(colnames(df))], MARGIN = 1, FUN = function(x) which (x>0)[1])
df <- add_column(df, "event_day_post_rand" = event_day_post_rand, .after = "arm_id")
##
## From here trial days start on column 6 for df
##
#zero out events that occurred after the first event, since each patient can only have 1 max event which will be taken as the earliest event
for (pt_id in df[which(!is.na(df$event_day_post_rand)),"pt_id"]){
event_idx = df[which(df$pt_id == pt_id), "event_day_post_rand"]
df[which(df$pt_id == pt_id), as.character(5+event_idx+1):"1400"] <- 0
}
We can do
mat <- as.matrix(df) ## data frame to matrix
j <- max.col(mat, ties.method = "first")
mat[] <- 0
mat[cbind(1:nrow(mat), j)] <- 1
df <- data.frame(mat) ## matrix to data frame
I also suggest just using a matrix to store these values. In addition, the result will be a sparse matrix. So I recommend
library(Matrix)
sparseMatrix(i = 1:nrow(mat), j = j, x = rep(1, length(j)))
We can get a little more performance by setting the 1 elements to 0 whose rows are duplicates.
Since the OP is open to starting with a matrix rather than a data.frame, I'll do the same.
# dummy data
m <- matrix(sample(0:1, 1480L*1400L, TRUE, c(0.9, 0.1)), 1480L, 1400L)
# proposed solution
f1 <- function(m) {
ones <- which(m == 1L)
m[ones[duplicated((ones - 1L) %% nrow(m), nmax = nrow(m))]] <- 0L
m
}
# Zheyuan Li's solution
f2 <- function(m) {
j <- max.col(m, ties.method = "first")
m[] <- 0L
m[cbind(1:nrow(m), j)] <- 1L
m
}
microbenchmark::microbenchmark(f1 = f1(m),
f2 = f2(m),
check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> f1 9.1457 11.45020 12.04258 11.9011 12.3529 37.6716 100
#> f2 12.8424 14.92955 17.31811 15.3251 16.0550 43.6314 100
Zheyuan Li's suggestion to go with a sparse matrix is a good idea.
# convert to a memory-efficient nsparseMatrix
library(Matrix)
m1 <- as(Matrix(f1(m), dimnames = list(NULL, NULL), sparse = TRUE), "nsparseMatrix")
object.size(m)
#> 8288216 bytes
object.size(m1)
#> 12864 bytes
# proposed function to go directly to a sparse matrix
f3 <- function(m) {
n <- nrow(m)
ones <- which(m == 1L) - 1L
i <- ones %% n
idx <- which(!duplicated(i, nmax = n))
sparseMatrix(i[idx], ones[idx] %/% n, dims = dim(m), index1 = FALSE, repr = "C")
}
# going directly to a sparse matrix using Zheyuan Li's solution
f4 <- function(m) {
sparseMatrix(1:nrow(m), max.col(m, ties.method = "first"), dims = dim(m), repr = "C")
}
identical(m1, f3(m))
#> [1] TRUE
identical(m1, f4(m))
#> [1] TRUE
microbenchmark::microbenchmark(f1 = f1(m),
f3 = f3(m),
f4 = f4(m))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> f1 9.1719 9.30715 11.12569 9.52300 11.92740 83.8518 100
#> f3 7.4330 7.59875 12.62412 7.69610 11.08815 84.8291 100
#> f4 8.9607 9.31115 14.01477 9.49415 11.44825 87.1577 100
I have a vector of threshold values, thresholds, and another vector, x. I'd like to create a new vector, say vec_sum, of the same length as thresholds, that stores, for each element of thresholds, the sum of values of x larger than this element.
What is the fastest way of doing this?
The naive way I'm doing it is
vec_sum <- rep(NA,length(thresholds))
for(i in seq_along(thresholds))
{
vec_sum[i] <- sum(x[x>thresholds[i]])
}
In case it helps, thresholds is already sorted.
Here is another solution using cumsum:
f1 <- function(v, th){
v2 <- v[order(v)]
v2s <- rev(cumsum(rev(v2)))
return(v2s[findInterval(th, v2) + 1])
}
Here are some tests and comparison with the other answer (as well as the example data) by Ronak:
f2 <- function(x, thresholds){
if (all(x < thresholds[1])) return(rep(0, length(thresholds)))
if (all(x > thresholds[length(thresholds)])) return(rep(sum(x), length(thresholds)))
return(rev(cumsum(rev(tapply(x,
findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
}
test_th <- c(3, 5, 10)
test_x <- c(2, 3, 1, 19, 4, 6, 5, 15, 7:14, 16:18, 20)
vec_sum <- rep(NA,length(test_th))
for(i in seq_along(test_th)) {
vec_sum[i] <- sum(test_x[test_x>test_th[i]])
}
all(dplyr::near(f1(test_x, test_th), vec_sum))
# [1] TRUE
all(dplyr::near(f2(test_x, test_th), vec_sum))
# [1] TRUE
set.seed(123)
test_x <- rnorm(10000)
test_th <- sort(rnorm(100)) ## f2 requires sorted threshold values
vec_sum <- rep(NA,length(test_th))
for(i in seq_along(test_th)) {
vec_sum[i] <- sum(test_x[test_x>test_th[i]])
}
all(dplyr::near(f1(test_x, test_th), vec_sum))
# [1] TRUE
all(dplyr::near(f2(test_x, test_th), vec_sum))
# [1] FALSE
# Warning message:
# In x - y : longer object length is not a multiple of shorter object length
library(microbenchmark)
microbenchmark(
a = f1(test_x, test_th),
b = f2(test_x, test_th)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# a 587.116 682.864 900.3572 694.713 703.726 10647.206 100
# b 1157.213 1203.063 1260.0663 1223.600 1258.552 2143.069 100
Not sure if this is any faster, but we can use findInterval to cut x by thresholds. We take sum of each group using tapply and take cumsum in reverse.
as.integer(rev(cumsum(rev(tapply(x,
findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
Tested on
thresholds <- c(3, 5, 10)
x <- c(2, 3, 1, 19, 4, 6, 5, 15, 7:14, 16:18, 20) #1:20 in random order
vec_sum <- rep(NA,length(thresholds))
for(i in seq_along(thresholds)) {
vec_sum[i] <- sum(x[x>thresholds[i]])
}
vec_sum
#[1] 204 195 155
Using the proposed solution
as.integer(rev(cumsum(rev(tapply(x,
findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
#[1] 204 195 155
Explaining the answer. findInterval returns groups where each value of x belongs
findInterval(x, thresholds, left.open = TRUE)
#[1] 0 0 0 3 1 2 1 3 2 2 2 2 3 3 3 3 3 3 3 3
We use tapply to get sum of each group
tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)
# 0 1 2 3
# 6 9 40 155
0-group should be excluded since they are smaller than all the values of threshold (hence -1). Group 2 should also contain sum from group 1 and group 3 should contain sum of group 1 and 2. So we reverse the sequence and take cumsum
cumsum(rev(tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)[-1]))
# 3 2 1
#155 195 204
To get it in original order and to match it with threshold we reverse it again
rev(cumsum(rev(tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)[-1])))
# 1 2 3
#204 195 155
Edge Cases :
If there are all values below threshold or all values above threshold, we might need to do an extra check and return the following.
if (all(x < thresholds[1])) rep(0, length(thresholds))
if (all(x > thresholds[length(thresholds)])) rep(sum(x), length(thresholds))
I have a list of vectors that looks like
[[1]][1] 1 1 2
[[2]]
[1] 1 1 2
[[3]]
[1] 2 1 1
[[4]]
[1] 2 2 2
I would like the replace the first component of each of the vectors with a 9. I have tried
out <- append(vecs2T2[[1]], y, after=0)
but this just adds an 9 in at the start and does not replace it (see below).
[1] 9 1 1 2
I would like this entry to read 912.
lapply(ll, replace, 1, 9)
This goes vector by vector, and replaces the 1st item with 9. (Replace's arguments are: (data, list-of-indexes, list-of-values), with the list of values recycled to be as long as the list of indexes.)
replace() is just defined as:
replace <- function (x, list, values) {
x[list] <- values
x
}
so you can also use that method.
lapply(ll, function(x) { x[1] <- 9 ; x })
You can use either with purrr::map(), too:
purrr::map(ll, ~{ .x[1] <- 9 ; .x })
purrr::map(ll, replace, 1, 9)
Head-to-head (not the best microbenchmark setup in the world tho):
microbenchmark::microbenchmark(
purr_repl = purrr::map(ll, replace, 1, 9),
purr_op = purrr::map(ll, ~{ .x[1] <- 9 ; .x }),
lapp_repl = lapply(ll, replace, 1, 9),
lapp_op = lapply(ll, function(x) { x[1] <- 9 ; x }),
Map = Map(function(x, y)c(x, y[-1]), 9, ll)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## purr_repl 27.510 29.7555 49.98242 31.4735 33.4805 1506.400 100
## purr_op 84.415 86.9550 125.07364 90.0665 98.9465 2423.406 100
## lapp_repl 4.422 4.8350 5.94472 5.1965 5.5930 34.947 100
## lapp_op 4.672 5.4250 19.14590 5.9045 6.5015 1215.477 100
## Map 10.670 12.2490 28.94712 13.5935 14.7170 1238.311 100
Another idea is to use Map and concatenate 9 with the each vector minus its first element
Map(function(x, y)c(x, y[-1]), 9, l1)
A matrix I have has exactly 2 rows and n columns example
c(0,0,0,0,1,0,2,0,1,0,1,1,1,0,2)->a1
c(0,2,0,0,0,0,2,1,1,0,0,0,0,2,0)->a2
rbind(a1,a2)->matr
for a specific column ( in this example 9 with 1 in both rows) I do need to find to the left and to the right the first instance of 2/0 or 0/2 - in this example to the left is 2 and the other is 14)
The elements of every row can either be 0,1,2 - nothing else . Is there a way to do that operation on large matrixes (with 2 rows) fast? I need to to it 600k times so speed might be a consideration
library(compiler)
myfun <- cmpfun(function(m, cl) {
li <- ri <- cl
nc <- ncol(m)
repeat {
li <- li - 1
if(li == 0 || ((m[1, li] != 1) && (m[1, li] + m[2, li] == 2))) {
l <- li
break
}
}
repeat {
ri <- ri + 1
if(ri == nc || ((m[1, ri] != 1) && (m[1, ri] + m[2, ri] == 2))) {
r <- ri
break
}
}
c(l, r)
})
and, after taking into account #Martin Morgan's observations,
set.seed(1)
N <- 1000000
test <- rbind(sample(0:2, N, replace = TRUE),
sample(0:2, N, replace = TRUE))
library(microbenchmark)
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
expr min lq mean median uq max neval cld
# myfun(test, N/2) 4.658 20.033 2.237153e+01 22.536 26.022 85.567 100 a
# fun(test, N/2) 36685.750 47842.185 9.762663e+04 65571.546 120321.921 365958.316 100 b
# foo(test, N/2) 2622845.039 3009735.216 3.244457e+06 3185893.218 3369894.754 5170015.109 100 d
# AWebb(test, N/2) 121504.084 142926.590 1.990204e+05 193864.670 209918.770 489765.471 100 c
# RHertel(test, N/2) 65998.733 76805.465 1.187384e+05 86089.980 144793.416 385880.056 100 b
set.seed(123)
test <- rbind(sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)),
sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)))
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# myfun(test, N/2) 81.805 103.732 121.9619 106.459 122.36 307.736 100 a
# fun(test, N/2) 26362.845 34553.968 83582.9801 42325.755 106303.84 403212.369 100 b
# foo(test, N/2) 2598806.742 2952221.561 3244907.3385 3188498.072 3505774.31 4382981.304 100 d
# AWebb(test, N/2) 109446.866 125243.095 199204.1013 176207.024 242577.02 653299.857 100 c
# RHertel(test, N/2) 56045.309 67566.762 125066.9207 79042.886 143996.71 632227.710 100 b
I was slower than #Laterow, but anyhow, this is a similar approach
foo <- function(mtr, targetcol) {
matr1 <- colSums(mtr)
matr2 <- apply(mtr, 2, function(x) x[1]*x[2])
cols <- which(matr1 == 2 & matr2 == 0) - targetcol
left <- cols[cols < 0]
right <- cols[cols > 0]
c(ifelse(length(left) == 0, NA, targetcol + max(left)),
ifelse(length(right) == 0, NA, targetcol + min(right)))
}
foo(matr,9) #2 14
Combine the information by squaring the rows and adding them. The right result should be 4. Then, simply find the first column that is smaller than 9 (rev(which())[1]) and the first column that is larger than 9 (which()[1]).
fun <- function(matr, col){
valid <- which((matr[1,]^2 + matr[2,]^2) == 4)
if (length(valid) == 0) return(c(NA,NA))
left <- valid[rev(which(valid < col))[1]]
right <- valid[which(valid > col)[1]]
c(left,right)
}
fun(matr,9)
# [1] 2 14
fun(matr,1)
# [1] NA 2
fun(matrix(0,nrow=2,ncol=100),9)
# [1] NA NA
Benchmark
set.seed(1)
test <- rbind(sample(0:2,1000000,replace=T),
sample(0:2,1000000,replace=T))
microbenchmark::microbenchmark(fun(test,9))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun(test, 9) 22.7297 27.21038 30.91314 27.55106 28.08437 51.92393 100
Edit: Thanks to #MatthewLundberg for pointing out a lot of mistakes.
If you are doing this many times, precompute all the locations
loc <- which((a1==2 & a2==0) | (a1==0 & a2==2))
You can then find the first to the left and right with findInterval
i<-findInterval(9,loc);loc[c(i,i+1)]
# [1] 2 14
Note that findInterval is vectorized should you care to specify multiple target columns.
That is an interesting question. Here's how I would address it.
First a vector is defined which contains the product of each column:
a3 <- matr[1,]*matr[2,]
Then we can find the columns with pairs of (0/2) or (2/0) rather easily, since we know that the matrix can only contain the values 0, 1, and 2:
the02s <- which(colSums(matr)==2 & a3==0)
Next we want to find the pairs of (0/2) or (2/0) that are closest to a given column number, on the left and on the right of that column. The column number could be 9, for instance:
thecol <- 9
Now we have basically all we need to find the index (the column number in the matrix) of a combination of (0/2) or (2/0) that is closest to the column thecol. We just need to use the output of findInterval():
pos <- findInterval(thecol,the02s)
pos <- c(pos, pos+1)
pos[pos==0] <- NA # output NA if no column was found on the left
And the result is:
the02s[pos]
# 2 14
So the indices of the closest columns on either side of thecol fulfilling the required condition would be 2 and 14 in this case, and we can confirm that these column numbers both contain one of the relevant combinations:
matr[,14]
#a1 a2
# 0 2
matr[,2]
#a1 a2
# 0 2
Edit: I changed the answer such that NA is returned in the case where no column exists on the left and/or on the right of thecol in the matrix that fulfills the required condition.
I am looking for an efficient solution for the following problem:
b <- matrix(c(0,0,0,1,1,0), nrow = 2, byrow = T)
weight <- c(1,1)
times <- 5
abc <- do.call(rbind, replicate(times, b, simplify=FALSE))
weight <- rep.int(weight,times)
sum1 <- as.numeric(rep.int(NA,nrow(abc)))
##Rprof()
for(j in 1:nrow(abc)){
a <- abc[j,]
sum1[j] <- sum(weight[rowSums(t(a == t(abc)) + 0) == ncol(abc)])
}
##Rprof(NULL)
##summaryRprof()
Is there a faster way to do this? Rprof shows that rowSums(), t(), == and + are quite slow. If nrows is 20,000 it takes like 21 seconds.
Thanks for helping!
Edit: I have a matrix abc and a vector weight with length equal to nrow(abc). The first value of weight corresponds to the first row of matrix abc and so on... Now, I would like to determine which rows of matrix abc are equal. Then, I want to remember the position of those rows in order to sum up the corresponding weights which have the same position. The appropriate sum I wanna store for each row.
Here is a way that looks valid and fast:
ff <- function(mat, weights)
{
rs <- apply(mat, 1, paste, collapse = ";")
unlist(lapply(unique(rs),
function(x)
sum(weights[match(rs, x, 0) > 0])))[match(rs, unique(rs))]
}
ff(abc, weight)
# [1] 5 5 5 5 5 5 5 5 5 5
And comparing with your function:
ffOP <- function(mat, weights)
{
sum1 <- as.numeric(rep.int(NA,nrow(mat)))
for(j in 1:nrow(mat)) {
a <- mat[j,]
sum1[j] <- sum(weights[rowSums(t(a == t(mat)) + 0) == ncol(mat)])
}
sum1
}
ffOP(abc, weight)
# [1] 5 5 5 5 5 5 5 5 5 5
library(microbenchmark)
m = do.call(rbind, replicate(1e3, matrix(0:11, 3, 4), simplify = F))
set.seed(101); w = runif(1e3*3)
all.equal(ffOP(m, w), ff(m, w))
#[1] TRUE
microbenchmark(ffOP(m, w), ff(m, w), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# ffOP(m, w) 969.83968 986.47941 996.68563 1015.53552 1051.23847 10
# ff(m, w) 20.42426 20.64002 21.36508 21.97182 22.59127 10
For the record, I, also, implemented your approach in C and here are the benchmarkings:
#> microbenchmark(ffOP(m, w), ff(m, w), ffC(m, w), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# ffOP(m, w) 957.66691 967.09429 991.35232 1000.53070 1016.74100 10
# ff(m, w) 20.60243 20.85578 21.70578 22.13434 23.04924 10
# ffC(m, w) 36.24618 36.40940 37.18927 37.39877 38.83358 10