Repeating calculation based on conditions - r

What I am trying to do is pretty simple. However, I am new to R and have not learned much about loops and functions and am not sure what is the most efficient way to get the results. Basically, I want to count the number of rows that meet my conditions and do a division. Here is an example:
df1 <- data.frame(
Main = c(0.0089, -0.050667, -0.030379, 0.066484, 0.006439, -0.026076),
B = c(NA, 0.0345, -0.0683, -0.052774, 0.014661, -0.040537),
C = c(0.0181, 0, -0.056197, 0.040794, 0.03516, -0.022662),
D = c(-0.0127, -0.025995, -0.04293, 0.057816, 0.033458, -0.058382)
)
df1
# Main B C D
# 1 0.008900 NA 0.018100 -0.012700
# 2 -0.050667 0.034500 0.000000 -0.025995
# 3 -0.030379 -0.068300 -0.056197 -0.042930
# 4 0.066484 -0.052774 0.040794 0.057816
# 5 0.006439 0.014661 0.035160 0.033458
# 6 -0.026076 -0.040537 -0.022662 -0.058382
My criteria for the numerator is to count the number of B/C/D that is >0 when Main is >0; For denominator, count the number of B/C/D that is != 0 when Main is != 0. I can use length(which(df1$Main >0 & df1$B>0)) / length(which(df1$Main !=0 & df1$B !=0)) to get the ratios for each of the column individually. But my data set has many more columns, and I am wondering if there is a way to get those ratio all at once so that my result will be like:
# B C D
# 1 0.2 0.6 0.3

Use apply:
apply(df1[,-1], 2, function(x) length(which(df1$Main >0 & x>0)) / length(which(df1$Main !=0 & x !=0)))

You could do this vectorized (No apply or for is needed):
tail(colSums(df[df$Main>0,]>0, na.rm = T) / colSums(df[df$Main!=0,]!=0, na.rm = T), -1)
# B C D
#0.2000000 0.6000000 0.3333333

One way to do this would be with a for loop that loops over the columns and applies the function that you wrote. Something like this:
ratio1<-vector()
for(i in 2:ncol(df1)){
ratio1[i-1] <- length(which(df1$Main >0 & df1[,i]>0)) / length(which(df1$Main !=0 & df1[,i] !=0))
}
Maybe there is a better way to do this with apply or data.table, but this is a simple solution that I can come up with. Works on any number of columns. Use round() if you want the answer in one decimal.

criteria1 <- df1[which(df1$Main > 0), -1] > 0
criteria2 <- df1[which(df1$Main != 0), -1] != 0
colSums(criteria1, na.rm = T)/colSums(criteria2, na.rm = T)
## B C D
## 0.2000000 0.6000000 0.3333333
Edit: It appears Niek's method is quickest for this specific data
# Unit: microseconds
# expr min lq mean median uq max neval
# Jim(df1) 216.468 230.0585 255.3755 239.8920 263.6870 802.341 300
# emilliman5(df1) 120.109 135.5510 155.9018 142.4615 156.0135 1961.931 300
# Niek(df1) 97.118 107.6045 123.5204 111.1720 119.6155 1966.830 300
# nine89(df1) 211.683 222.6660 257.6510 232.2545 252.6570 2246.225 300
#[[1]]
# [,1] [,2] [,3] [,4]
#median 239.892 142.462 111.172 232.255
#ratio 1.000 0.594 0.463 0.968
#diff 0.000 -97.430 -128.720 -7.637
However, when there are many columns the vectorized approach is quicker.
Nrow <- 1000
Ncol <- 1000
mat <- matrix(runif(Nrow*Ncol),Nrow)
df1 <- data.frame(Main = sample(-2:2,Nrow,T), mat) #1001 columns
#Unit: milliseconds
# expr min lq mean median uq max
# Jim(df1) 46.75627 53.88500 66.93513 56.58143 62.04375 185.0460
#emilliman5(df1) 73.35257 91.87283 151.38991 178.53188 185.06860 292.5571
# Niek(df1) 68.17073 76.68351 89.51625 80.14190 86.45726 200.7119
# nine89(df1) 51.36117 56.79047 74.53088 60.07220 66.34270 191.8294
#[[1]]
# [,1] [,2] [,3] [,4]
#median 56.581 178.532 80.142 60.072
#ratio 1.000 3.155 1.416 1.062
#diff 0.000 121.950 23.560 3.491
functions
Jim <- function(df1){
criteria1 <- df1[which(df1$Main > 0), -1] > 0
criteria2 <- df1[which(df1$Main != 0), -1] != 0
colSums(criteria1, na.rm = T)/colSums(criteria2, na.rm = T)
}
emilliman5 <- function(df1){
apply(df1[,-1], 2, function(x) length(which(df1$Main >0 & x>0)) / length(which(df1$Main !=0 & x !=0)))
}
Niek <- function(df1){
ratio1<-vector()
for(i in 2:ncol(df1)){
ratio1[i-1] <- length(which(df1$Main >0 & df1[,i]>0)) / length(which(df1$Main !=0 & df1[,i] !=0))
}
ratio1
}
nine89 <- function(df){
tail(colSums(df[df$Main>0,]>0, na.rm = T) / colSums(df[df$Main!=0,]!=0, na.rm = T), -1)
}

Related

For a dataset of 0's and 1's, set all but the first 1 in each row to 0's

I have a data.frame of 1,480 rows and 1,400 columns like:
1 2 3 4 5 6 ..... 1399 1400
1 0 0 0 1 0 0 ..... 1 0 #first occurrence would be at 4
2 0 0 0 0 0 1 ..... 0 1
3 1 0 0 1 0 0 ..... 0 0
## and etc
Each row contains a series of 0's and 1's - predominantly 0's. For each row, I want to find at which column the first 1 shows up and set the remaining values to 0's.
My current implementation can efficiently find the occurrence of the first 1, but I've only figured out how to zero out the remaining values iteratively by row. In repeated simulations, this iterative process is taking too long.
Here is the current implementation:
N <- length(df[which(df$arm == 0), "pt_id"]) # of patients
M <- max_days
#
# df is like the data frame shown above
#
df[which(df$arm == 0), 5:length(colnames(df))] <- unlist(lapply(matrix(data = rep(pbo_hr, M*N), nrow=N, ncol = M), rbinom, n=1, size = 1))
event_day_post_rand <- apply(df[,5:length(colnames(df))], MARGIN = 1, FUN = function(x) which (x>0)[1])
df <- add_column(df, "event_day_post_rand" = event_day_post_rand, .after = "arm_id")
##
## From here trial days start on column 6 for df
##
#zero out events that occurred after the first event, since each patient can only have 1 max event which will be taken as the earliest event
for (pt_id in df[which(!is.na(df$event_day_post_rand)),"pt_id"]){
event_idx = df[which(df$pt_id == pt_id), "event_day_post_rand"]
df[which(df$pt_id == pt_id), as.character(5+event_idx+1):"1400"] <- 0
}
We can do
mat <- as.matrix(df) ## data frame to matrix
j <- max.col(mat, ties.method = "first")
mat[] <- 0
mat[cbind(1:nrow(mat), j)] <- 1
df <- data.frame(mat) ## matrix to data frame
I also suggest just using a matrix to store these values. In addition, the result will be a sparse matrix. So I recommend
library(Matrix)
sparseMatrix(i = 1:nrow(mat), j = j, x = rep(1, length(j)))
We can get a little more performance by setting the 1 elements to 0 whose rows are duplicates.
Since the OP is open to starting with a matrix rather than a data.frame, I'll do the same.
# dummy data
m <- matrix(sample(0:1, 1480L*1400L, TRUE, c(0.9, 0.1)), 1480L, 1400L)
# proposed solution
f1 <- function(m) {
ones <- which(m == 1L)
m[ones[duplicated((ones - 1L) %% nrow(m), nmax = nrow(m))]] <- 0L
m
}
# Zheyuan Li's solution
f2 <- function(m) {
j <- max.col(m, ties.method = "first")
m[] <- 0L
m[cbind(1:nrow(m), j)] <- 1L
m
}
microbenchmark::microbenchmark(f1 = f1(m),
f2 = f2(m),
check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> f1 9.1457 11.45020 12.04258 11.9011 12.3529 37.6716 100
#> f2 12.8424 14.92955 17.31811 15.3251 16.0550 43.6314 100
Zheyuan Li's suggestion to go with a sparse matrix is a good idea.
# convert to a memory-efficient nsparseMatrix
library(Matrix)
m1 <- as(Matrix(f1(m), dimnames = list(NULL, NULL), sparse = TRUE), "nsparseMatrix")
object.size(m)
#> 8288216 bytes
object.size(m1)
#> 12864 bytes
# proposed function to go directly to a sparse matrix
f3 <- function(m) {
n <- nrow(m)
ones <- which(m == 1L) - 1L
i <- ones %% n
idx <- which(!duplicated(i, nmax = n))
sparseMatrix(i[idx], ones[idx] %/% n, dims = dim(m), index1 = FALSE, repr = "C")
}
# going directly to a sparse matrix using Zheyuan Li's solution
f4 <- function(m) {
sparseMatrix(1:nrow(m), max.col(m, ties.method = "first"), dims = dim(m), repr = "C")
}
identical(m1, f3(m))
#> [1] TRUE
identical(m1, f4(m))
#> [1] TRUE
microbenchmark::microbenchmark(f1 = f1(m),
f3 = f3(m),
f4 = f4(m))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> f1 9.1719 9.30715 11.12569 9.52300 11.92740 83.8518 100
#> f3 7.4330 7.59875 12.62412 7.69610 11.08815 84.8291 100
#> f4 8.9607 9.31115 14.01477 9.49415 11.44825 87.1577 100

Does Column of Sentences Contain Word in Another Column of Sentences?

I have two large tables each containing a "sentence" column with a string of words. I am curious which records (true/false output) have a word that is found in any of the sentences in either column. My tables are very large and the below code I have can take a very long time. Is there a faster way to go about doing this?
Thank you!
# Determine if any "words" in either column of sentences match.
# Packages
library(tidyverse)
# Help functions
helper_in_2 <- function(b, a){
return(any(b %in% a))
}
helper_in <- function(a, b){
return(lapply(b, helper_in_2, a))
}
# Sample columns
sentence_col_a <- c("This is an example sentence.", "Here is another sample sentence?", "One more sentence that is not complicated.", "Last sentence to show an example!")
sentence_col_b <- c("Short string A.", "Another longer string.", "Final string example!")
# Extract words from each column
list_col_a <- str_to_lower(sentence_col_a) %>%
str_extract_all("[:alpha:]+")
list_col_b <- str_to_lower(sentence_col_b) %>%
str_extract_all("[:alpha:]+")
# Check for matches.
# (Code after first line isn't actually used in my code - it's just to show matches)
sapply(lapply(list_col_a, helper_in, list_col_b), as.numeric) %>%
t() %>%
as.data.frame() %>%
rename_at(vars(names(.)), function(x) sentence_col_b) %>%
mutate(rownames = sentence_col_a) %>%
tibble::column_to_rownames(var = "rownames")
Output:
Sentences
Short string A.
Another longer string.
Final string example!
This is an example sentence.
0
0
1
Here is another sample sentence?
0
1
0
One more sentence that is not complicated.
0
0
0
Last sentence to show an example!
0
0
1
Update after Ronak's Answer
library(microbenchmark)
microbenchmark("Original method:" = sapply(lapply(list_col_a, helper_in, list_col_b), as.numeric),
"Ronak's method:" = sapply(list_col_a, function(x) as.integer(grepl(sprintf('\\b(%s)\\b', paste0(x, collapse = '|')), list_col_b))))
#Unit: microseconds
# expr min lq mean median uq max neval
#Original method: 72.9 76.65 88.082 82.35 86.1 173.9 100
# Ronak's method: 262.1 277.40 354.741 286.40 348.6 3724.3 100
Here I can provide several options, but the nested for-loop method might be the most efficient one so far:
outer
TIC1 <- function() {
+outer(list_col_a, list_col_b, FUN = Vectorize(function(x, y) any(x %in% y)))
}
nested sapply
TIC2 <- function() {
sapply(
list_col_b,
function(x) {
sapply(
list_col_a,
function(y) sum(y %in% x)
)
}
)
}
nested for loops
TIC3 <- function() {
res <- matrix(nrow = length(list_col_a), ncol = length(list_col_b))
for (a in seq_along(list_col_a)) {
for (b in seq_along(list_col_b)) {
res[a, b] <- any(list_col_a[[a]] %in% list_col_b[[b]])
}
}
+res
}
Benchmarking
# Original solution
original <- function() {
sapply(lapply(list_col_a, helper_in, list_col_b), as.numeric) %>%
t() %>%
as.data.frame() %>%
rename_at(vars(names(.)), function(x) sentence_col_b) %>%
mutate(rownames = sentence_col_a) %>%
tibble::column_to_rownames(var = "rownames")
}
# Waldi's data.table solution
Waldi <- function() {
la <- data.table(id = 1:length(list_col_a), list_col_a)
lb <- data.table(id = 1:length(list_col_b), list_col_b)
la_long <- la[, .(words = unlist(list_col_a)), by = id]
lb_long <- lb[, .(words = unlist(list_col_b)), by = id]
unique(la_long[lb_long, on = .(words = words)][!is.na(id), .(idxa = id, idxb = i.id)])
}
mustafaakben1 <- function(rows = list_col_a, cols = list_col_b) {
to_matrix <- function(X_t) {
matrix(unlist(X_t),
nrow = length(list_col_a),
ncol = length(list_col_b),
byrow = T
)
}
to_matrix(lapply(
1:length(cols),
FUN = function(X) {
lapply(
X = 1:length(rows),
FUN = function(Y) {
sum(rows[[Y]] %in% cols[[X]])
}
)
}
))
}
library(fastmatch)
mustafaakben2 <- function() {
search_keywords <- unlist(list_col_b)[unlist(list_col_b) %in% unlist(list_col_a)]
b_col_filter <- which(unlist(lapply(list_col_b, function(X) any(X %in% search_keywords))))
a_row_filter <- which(unlist(lapply(list_col_a, function(X) any(X %in% search_keywords))))
res <- matrix(0,
nrow = length(list_col_a),
ncol = length(list_col_b)
)
for (a in a_row_filter) {
for (b in b_col_filter) {
res[a, b] <- any(list_col_a[[a]] %fin% list_col_b[[b]])
}
}
+res
}
# ThomasIsCoding's outer solution
TIC1 <- function() {
+outer(list_col_a, list_col_b, FUN = Vectorize(function(x, y) any(x %in% y)))
}
TIC2 <- function() {
sapply(
list_col_b,
function(x) {
sapply(
list_col_a,
function(y) sum(y %in% x)
)
}
)
}
TIC3 <- function() {
res <- matrix(nrow = length(list_col_a), ncol = length(list_col_b))
for (a in seq_along(list_col_a)) {
for (b in seq_along(list_col_b)) {
res[a, b] <- any(list_col_a[[a]] %in% list_col_b[[b]])
}
}
+res
}
microbenchmark::microbenchmark(
original(),
Waldi(),
mustafaakben1(),
mustafaakben2(),
TIC1(),
TIC2(),
TIC3(),
unit = "relative"
)
and you will see
Unit: relative
expr min lq mean median uq max
original() 172.895884 149.346066 49.841448 142.676077 134.111459 3.130206
Waldi() 107.441122 92.004380 30.290680 88.474026 83.690267 1.971249
mustafaakben1() 1.596981 1.551978 1.646884 1.610160 1.553021 1.683034
mustafaakben2() 1.635812 1.731991 2.186106 1.912535 1.831179 2.332797
TIC1() 3.854043 3.845866 1.977066 3.943445 3.707308 1.041416
TIC2() 2.888118 2.627955 1.607401 2.719427 2.538536 1.142211
TIC3() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
neval
100
100
100
100
100
100
100
Try nested lappy
to_matrix <- function(X_t){
matrix(unlist(X_t),
nrow = length(list_col_a),
ncol = length(list_col_b)),
byrow = T)
}
nested_lappy <- function(rows=list_col_a, cols=list_col_b) {
to_matrix(lapply(
1:length(cols),
FUN = function (X)
lapply(
X = 1:length(rows),
FUN = function(Y)
sum(rows[[Y]] %in% cols[[X]])
)
))
}
> nested_lappy()
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 0 1 0
[3,] 0 0 0
[4,] 0 0 1
Probably your matrix will be huge too. It would be better to use a sparse matrix. You can try to use the Matrix package. It may help you to carry out your analysis in a more memory-efficient way.
Here is the benchmark
microbenchmark::microbenchmark(
original(),
Waldi(),
TIC(),
nested_lappy(),
unit = "relative"
)
Unit: relative
expr min lq mean median uq max neval
original() 99.97881 89.869163 83.011249 67.88434 69.883301 260.704657 100
Waldi() 56.55076 51.185905 45.436361 39.35327 42.730942 46.438114 100
TIC() 2.27000 2.249311 1.986625 1.84108 1.837013 3.974149 100
nested_lappy() 1.00000 1.000000 1.000000 1.00000 1.000000 1.000000 100
Edits
I will cheat here a little bit because #ThomasIsCoding is an amazing coder. I need to cheat :)
So, because you have a huge table, you need to focus on an efficient way to search your keyword space. As you may notice that not all keywords have an intersection and shared uniformly in the sentences. So, even before starting to search, you can eliminate those sentences in the search space. By doing so, we can focus only on the words shared by both column and row dimensions.
search_keywords<- unlist(list_col_b)[unlist(list_col_b) %in% unlist(list_col_a)]
b_col_filter <- which(unlist(lapply(list_col_b, function(X) any(X %in% search_keywords))))
a_row_filter <- which(unlist(lapply(list_col_a, function(X) any(X %in% search_keywords))))
Then, use the fastmatch package to make the %in% faster.
library(fastmatch)
mustafaakben2 <- function() {
res <- matrix(0,
nrow = length(list_col_a),
ncol = length(list_col_b))
for (a in a_row_filter) {
for (b in b_col_filter) {
res[a, b] <- any(list_col_a[[a]] %fin% list_col_b[[b]])
}
}
+res
}
> mustafaakben2()
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 0 1 0
[3,] 0 0 0
[4,] 0 0 1
Benchmark results
microbenchmark::microbenchmark(
original(),
Waldi(),
TIC1(),
TIC2(),
TIC3(),
mustafaakben(),
mustafaakben2(),
unit = "relative"
)
Unit: relative
expr min lq mean median uq max neval cld
original() 288.620155 254.429012 193.446439 190.457965 171.914286 115.0415822 100 c
Waldi() 182.751938 153.864198 115.182908 115.778761 103.518095 36.9411765 100 b
TIC1() 6.581395 6.277778 5.074523 5.066372 4.685714 2.3732252 100 a
TIC2() 4.705426 4.385802 3.503269 3.466814 3.281905 1.5811359 100 a
TIC3() 1.767442 1.685185 1.360847 1.338496 1.249524 0.7728195 100 a
mustafaakben() 2.589147 2.330247 1.944260 2.017699 1.864762 0.7322515 100 a
mustafaakben2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
You could use a data.table join to get the id of the matching sentences.
library(data.table)
# Original solution
original <- function(){
sapply(lapply(list_col_a, helper_in, list_col_b), as.numeric) %>%
t() %>%
as.data.frame() %>%
rename_at(vars(names(.)), function(x) sentence_col_b) %>%
mutate(rownames = sentence_col_a) %>%
tibble::column_to_rownames(var = "rownames")
}
# data.table solution
new <- function(){
la <- data.table(id = 1:length(list_col_a),list_col_a)
lb <- data.table(id = 1:length(list_col_b),list_col_b)
la_long <- la[,.(words=unlist(list_col_a)),by= id]
lb_long <- lb[,.(words=unlist(list_col_b)),by= id]
unique(la_long[lb_long, on=.(words=words)][!is.na(id),.(idxa=id, idxb = i.id)])
}
new()
idxa idxb
1: 2 2
2: 1 3
3: 4 3
microbenchmark::microbenchmark(original(),new())
Unit: milliseconds
expr min lq mean median uq max neval cld
original() 4.1623 5.1190 5.857155 5.5528 6.18345 23.5442 100 b
new() 2.2492 2.7993 3.255741 3.1298 3.68645 5.1872 100 a
As data.table allows indexing, this could be much more efficient on a higher number of sentences / words : to be tested on a bigger dataset.
With the help of regular expressions, you can do this with one sapply call. We create a pattern with each value in list_col_a and check if any of it exists in list_col_b.
sapply(list_col_a, function(x) as.integer(grepl(sprintf('\\b(%s)\\b',
paste0(x, collapse = '|')), list_col_b)))
# [,1] [,2] [,3] [,4]
#[1,] 0 0 0 0
#[2,] 0 1 0 0
#[3,] 1 0 0 1
You can include your remaining code as it is to get the matches.

Fill all entries between two specified values

I have a long vector, thousands of entries, which has elements 0, 1, 2 in it sporadically. 0 means "no signal", 1 means "signal on", and 2 means "signal off". I am trying to find the runs from 1 to the next occurrence of 2 and fill the space with 1s. I also need to do the same thing between a 2 and the next occurrence of 1 but fill the space with 0s.
I currently have a solution for this issue using loops but it's slow and incredibly inefficient:
example vector:
exp = c(1,1,1,0,0,1,2,0,2,0,1,0,2)
desired result:
1,1,1,1,1,1,2,0,0,0,1,1,2
Thank you
You could use rle & shift from the data.table-package in the following way:
library(data.table)
# create the run-length object
rl <- rle(x)
# create indexes of the spots in the run-length object that need to be replaced
idx1 <- rl$values == 0 & shift(rl$values, fill = 0) == 1 & shift(rl$values, fill = 0, type = 'lead') %in% 1:2
idx0 <- rl$values == 2 & shift(rl$values, fill = 0) == 0 & shift(rl$values, fill = 2, type = 'lead') %in% 0:1
# replace these values
rl$values[idx1] <- 1
rl$values[idx0] <- 0
Now you will get the desired result by using inverse.rle:
> inverse.rle(rl)
[1] 1 1 1 1 1 1 2 0 0 0 1 1 2
As an alternative for the shift-function, you could also use the lag and lead functions from dplyr.
If you want to assess the speed of both approaches, the microbenchmark-package is a useful tool. Below you'll find 3 benchmarks, each for a different vector size:
# create functions for both approaches
jaap <- function(x) {
rl <- rle(x)
idx1 <- rl$values == 0 & shift(rl$values, fill = 0) == 1 & shift(rl$values, fill = 0, type = 'lead') %in% 1:2
idx0 <- rl$values == 2 & shift(rl$values, fill = 0) == 0 & shift(rl$values, fill = 2, type = 'lead') %in% 0:1
rl$values[idx1] <- 1
rl$values[idx0] <- 0
inverse.rle(rl)
}
john <- function(x) {
Reduce(f, x, 0, accumulate = TRUE)[-1]
}
Execute the benchmarks:
# benchmark on the original data
> microbenchmark(jaap(x), john(x), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
jaap(x) 58.766 61.2355 67.99861 63.8755 72.147 143.841 100 b
john(x) 13.684 14.3175 18.71585 15.7580 23.902 50.705 100 a
# benchmark on a somewhat larger vector
> x2 <- rep(x, 10)
> microbenchmark(jaap(x2), john(x2), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
jaap(x2) 69.778 72.802 84.46945 76.9675 87.3015 184.666 100 a
john(x2) 116.858 121.058 127.64275 126.1615 130.4515 223.303 100 b
# benchmark on a very larger vector
> x3 <- rep(x, 1e6)
> microbenchmark(jaap(x3), john(x3), times = 20)
Unit: seconds
expr min lq mean median uq max neval cld
jaap(x3) 1.30326 1.337878 1.389187 1.391279 1.425186 1.556887 20 a
john(x3) 10.51349 10.616632 10.689535 10.670808 10.761191 10.918953 20 b
From this you can conclude that the rle-approach has an advantage when applied to vectors that are larger than 100 elements (which is probably nearly always).
You could also use Reduce with the following function:
f <- function(x,y){
if(x == 1){
if(y == 2) 2 else 1
}else{
if(y == 1) 1 else 0
}
}
Then:
> x <- c(1,1,1,0,0,1,2,0,2,0,1,0,2)
> Reduce(f, x, 0, accumulate = TRUE)[-1]
[1] 1 1 1 1 1 1 2 0 0 0 1 1 2

Sum of subset based on second vector

I got two vectors:
a <- c(1,1,2,3,4,4,4,4,5,6)
b <- c(T,F,T,F,T,T,F,F,F,T)
I would like to have a vector that tells me how many TRUEs there are in b for each unique value in a. (the second column)
[,1] [,2]
[1,] 1 1
[2,] 2 1
[3,] 3 0
[4,] 4 2
[5,] 5 0
[6,] 6 1
The best I can come up here with is using sapply:
sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b)
This is fine, but for larger vectors it is rather slow. (I tried some subset variants.)
a <- sample(1:1000, 1e5, replace = TRUE)
b <- sample(c(T,F), 1e5, replace = TRUE)
microbenchmark::microbenchmark(
subset = sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b)
, iN = sapply(unique(a), FUN = function(uniqueA, a, b) sum(a %in% uniqueA & b), a = a, b = b)
, equal = sapply(unique(a), FUN = function(uniqueA, a, b) sum(a == uniqueA & b), a = a, b = b)
, times = 5
)
Unit: milliseconds
expr min lq mean median uq max neval
subset 389.1995 390.6002 413.6969 393.0396 445.6553 449.9897 5
iN 2746.8407 2798.0462 2797.3155 2806.9477 2814.6317 2820.1110 5
equal 1080.3430 1089.2507 1111.0267 1096.8082 1135.1957 1153.5358 5
Does anyone have an idea how to do this faster?
You could use aggregate:
aggregate(b, list(a), sum)
For the fastest performance, I'd suggest a data.table. It will take longer to set up, but the performance should be quite good for larger amounts of data.
library(data.table)
dt <- data.table(a = a, b = b)
dt[,sum(b), by = a]
Speed test comparing (1) aggregate, (2) sapply, (3) data.table, (4) tapply:
a <- sample(1:1000, 1e5, replace = TRUE)
b <- sample(c(T,F), 1e5, replace = TRUE)
summarize_dt <- function(x) {
dt <- data.table(a = a, b = b)
dt[,sum(b), by = a]
}
microbenchmark::microbenchmark(
aggregate = aggregate(b, list(a), sum),
sapply = sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b),
datatable = summarize_dt(),
tapply = tapply(b, a, sum)
)
#expr min lq mean median uq max neval
#aggregate 130.995347 133.672041 141.404597 135.301762 137.199151 213.730345 100
#sapply 335.344866 357.387474 394.432339 411.994214 425.604144 486.548520 100
#datatable 1.540011 1.914712 2.430220 2.027578 2.239999 5.297593 100
#tapply 3.075646 3.627395 4.719595 4.089434 5.934675 8.758332 100
Looks like data.table is the fastest by a lot
This one maybe using table in base R:
t <- table(a[b])
z <- as.numeric(names(t))
rbind(unname(cbind(z, t)), cbind(setdiff(unique(a),z),0))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 1
# [3,] 4 2
# [4,] 6 1
# [5,] 3 0
# [6,] 5 0
If you want those with non-zero number of TRUE's, just table(a[b]) would suffice.
Or we can use tidyverse
library(tidyverse)
tibble(a, b) %>%
group_by(a) %>%
summarise(b = sum(b))
A base R option would be
rowsum(+b, a)

Find elements in vector in R

A matrix I have has exactly 2 rows and n columns example
c(0,0,0,0,1,0,2,0,1,0,1,1,1,0,2)->a1
c(0,2,0,0,0,0,2,1,1,0,0,0,0,2,0)->a2
rbind(a1,a2)->matr
for a specific column ( in this example 9 with 1 in both rows) I do need to find to the left and to the right the first instance of 2/0 or 0/2 - in this example to the left is 2 and the other is 14)
The elements of every row can either be 0,1,2 - nothing else . Is there a way to do that operation on large matrixes (with 2 rows) fast? I need to to it 600k times so speed might be a consideration
library(compiler)
myfun <- cmpfun(function(m, cl) {
li <- ri <- cl
nc <- ncol(m)
repeat {
li <- li - 1
if(li == 0 || ((m[1, li] != 1) && (m[1, li] + m[2, li] == 2))) {
l <- li
break
}
}
repeat {
ri <- ri + 1
if(ri == nc || ((m[1, ri] != 1) && (m[1, ri] + m[2, ri] == 2))) {
r <- ri
break
}
}
c(l, r)
})
and, after taking into account #Martin Morgan's observations,
set.seed(1)
N <- 1000000
test <- rbind(sample(0:2, N, replace = TRUE),
sample(0:2, N, replace = TRUE))
library(microbenchmark)
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
expr min lq mean median uq max neval cld
# myfun(test, N/2) 4.658 20.033 2.237153e+01 22.536 26.022 85.567 100 a
# fun(test, N/2) 36685.750 47842.185 9.762663e+04 65571.546 120321.921 365958.316 100 b
# foo(test, N/2) 2622845.039 3009735.216 3.244457e+06 3185893.218 3369894.754 5170015.109 100 d
# AWebb(test, N/2) 121504.084 142926.590 1.990204e+05 193864.670 209918.770 489765.471 100 c
# RHertel(test, N/2) 65998.733 76805.465 1.187384e+05 86089.980 144793.416 385880.056 100 b
set.seed(123)
test <- rbind(sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)),
sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)))
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# myfun(test, N/2) 81.805 103.732 121.9619 106.459 122.36 307.736 100 a
# fun(test, N/2) 26362.845 34553.968 83582.9801 42325.755 106303.84 403212.369 100 b
# foo(test, N/2) 2598806.742 2952221.561 3244907.3385 3188498.072 3505774.31 4382981.304 100 d
# AWebb(test, N/2) 109446.866 125243.095 199204.1013 176207.024 242577.02 653299.857 100 c
# RHertel(test, N/2) 56045.309 67566.762 125066.9207 79042.886 143996.71 632227.710 100 b
I was slower than #Laterow, but anyhow, this is a similar approach
foo <- function(mtr, targetcol) {
matr1 <- colSums(mtr)
matr2 <- apply(mtr, 2, function(x) x[1]*x[2])
cols <- which(matr1 == 2 & matr2 == 0) - targetcol
left <- cols[cols < 0]
right <- cols[cols > 0]
c(ifelse(length(left) == 0, NA, targetcol + max(left)),
ifelse(length(right) == 0, NA, targetcol + min(right)))
}
foo(matr,9) #2 14
Combine the information by squaring the rows and adding them. The right result should be 4. Then, simply find the first column that is smaller than 9 (rev(which())[1]) and the first column that is larger than 9 (which()[1]).
fun <- function(matr, col){
valid <- which((matr[1,]^2 + matr[2,]^2) == 4)
if (length(valid) == 0) return(c(NA,NA))
left <- valid[rev(which(valid < col))[1]]
right <- valid[which(valid > col)[1]]
c(left,right)
}
fun(matr,9)
# [1] 2 14
fun(matr,1)
# [1] NA 2
fun(matrix(0,nrow=2,ncol=100),9)
# [1] NA NA
Benchmark
set.seed(1)
test <- rbind(sample(0:2,1000000,replace=T),
sample(0:2,1000000,replace=T))
microbenchmark::microbenchmark(fun(test,9))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun(test, 9) 22.7297 27.21038 30.91314 27.55106 28.08437 51.92393 100
Edit: Thanks to #MatthewLundberg for pointing out a lot of mistakes.
If you are doing this many times, precompute all the locations
loc <- which((a1==2 & a2==0) | (a1==0 & a2==2))
You can then find the first to the left and right with findInterval
i<-findInterval(9,loc);loc[c(i,i+1)]
# [1] 2 14
Note that findInterval is vectorized should you care to specify multiple target columns.
That is an interesting question. Here's how I would address it.
First a vector is defined which contains the product of each column:
a3 <- matr[1,]*matr[2,]
Then we can find the columns with pairs of (0/2) or (2/0) rather easily, since we know that the matrix can only contain the values 0, 1, and 2:
the02s <- which(colSums(matr)==2 & a3==0)
Next we want to find the pairs of (0/2) or (2/0) that are closest to a given column number, on the left and on the right of that column. The column number could be 9, for instance:
thecol <- 9
Now we have basically all we need to find the index (the column number in the matrix) of a combination of (0/2) or (2/0) that is closest to the column thecol. We just need to use the output of findInterval():
pos <- findInterval(thecol,the02s)
pos <- c(pos, pos+1)
pos[pos==0] <- NA # output NA if no column was found on the left
And the result is:
the02s[pos]
# 2 14
So the indices of the closest columns on either side of thecol fulfilling the required condition would be 2 and 14 in this case, and we can confirm that these column numbers both contain one of the relevant combinations:
matr[,14]
#a1 a2
# 0 2
matr[,2]
#a1 a2
# 0 2
Edit: I changed the answer such that NA is returned in the case where no column exists on the left and/or on the right of thecol in the matrix that fulfills the required condition.

Resources