Speeding up a function: checking NA count before computing mean - r

The function below calculates the mean of a vector. However, it first checks the proportion of NA's present in the vector
and if above a given threshold, returns NA instead of the mean.
My issue is that my current implementation is rather innefficient. It takes more than 7x longer than simply running mean(vec, na.rm=TRUE)
I tried an alternate method using na.omit, but that is even slower.
Given the size of my data, executing the single lapply is taking over 40 minutes.
Any suggestions on how to accomplish the same task more quickly?
UPDATE - RE: #thelatemail 's solution and #Arun's comment:
I am executing this function over several hundred groups, each group of varying size. The sample data (originally) provided in this question was provided as a neat data frame simply for ease of creating artificial data.
Alternate sample data to avoid the confusion
# Sample Data
# ------------
set.seed(1)
# slightly different sizes for each group
N1 <- 5e3
N2 <- N1 + as.integer(rnorm(1, 0, 100))
# One group has only a moderate amount of NA's
SAMP1 <- rnorm(N1)
SAMP1[sample(N1, .25 * N1, FALSE)] <- NA # add in NA's
# Another group has many NA's
SAMP2 <- rnorm(N2)
SAMP2[sample(N2, .95 * N2, FALSE)] <- NA # add in large number of NA's
# put them all in a list
SAMP.NEW <- list(SAMP1, SAMP2)
# keep it clean
rm(SAMP1, SAMP2)
# Execute
# -------
lapply(SAMP.NEW, meanIfThresh)
Original Sample Data, function etc
# Sample Data
# ------------
set.seed(1)
rows <- 20000 # actual data has more than 7M rows
cols <- 1000
SAMP <- replicate(cols, rnorm(rows))
SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA # add in NA's
# Select 5 random rows, and have them be 90% NA
tooSparse <- sample(rows, 5)
for (r in tooSparse)
SAMP[r, sample(cols, cols * .9, FALSE)] <- NA
# Function
# ------------
meanIfThresh <- function(vec, thresh=12/15) {
# Calculates the mean of vec, however,
# if the number of non-NA values of vec is less than thresh, returns NA
# thresh : represents how much data must be PRSENT.
# ie, if thresh is 80%, then there must be at least
len <- length(vec)
if( (sum(is.na(vec)) / len) > thresh)
return(NA_real_)
# if the proportion of NA's is greater than the threshold, return NA
# example: if I'm looking at 14 days, and I have 12 NA's,
# my proportion is 85.7 % = (12 / 14)
# default thesh is 80.0 % = (12 / 15)
# Thus, 12 NAs in a group of 14 would be rejected
# else, calculate the mean, removing NA's
return(mean(vec, na.rm=TRUE))
}
# Execute
# -----------------
apply(SAMP, 1, meanIfThresh)
# Compare with `mean`
#----------------
plain <- apply(SAMP, 1, mean, na.rm=TRUE)
modified <- apply(SAMP, 1, meanIfThresh)
# obviously different
identical(plain, modified)
plain[tooSparse]
modified[tooSparse]
microbenchmark( "meanIfThresh" = apply(SAMP, 1, meanIfThresh)
, "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
, times = 15L)
# With the actual data, the penalty is sevenfold
# Unit: seconds
# expr min lq median uq max neval
# meanIfThresh 1.658600 1.677472 1.690460 1.751913 2.110871 15
# mean (regular) 1.422478 1.485320 1.503468 1.532175 1.547450 15

Couldn't you just replace the high NA rows' mean values afterwards like so?:
# changed `result <- apply(SAMP,1,mean,na.rm=TRUE)`
result <- rowMeans(SAMP, na.rm=TRUE)
NArows <- rowSums(is.na(SAMP))/ncol(SAMP) > 0.8
result[NArows] <- NA
Some benchmarking:
Ricardo <- function(vec, thresh=12/15) {
len <- length(vec)
if( (sum(is.na(vec)) / len) > thresh)
return(NA_real_)
return(mean(vec, na.rm=TRUE))
}
DanielFischer <- function(vec, thresh=12/15) {
len <- length(vec)
nas <- is.na(vec)
Nna <- sum(nas)
if( (Nna / len) > thresh)
return(NA_real_)
return(sum(vec[!nas])/(len-Nna))
}
thelatemail <- function(mat) {
result <- rowMeans(mat, na.rm=TRUE)
NArows <- rowSums(is.na(mat))/ncol(mat) > 0.8
result[NArows] <- NA
result
}
require(microbenchmark)
microbenchmark(m1 <- apply(SAMP, 1, Ricardo),
m2 <- apply(SAMP, 1, DanielFischer),
m3 <- thelatemail(SAMP), times = 5L)
Unit: milliseconds
expr min lq median uq max neval
m1 <- apply(SAMP, 1, Ricardo) 2923.7260 2944.2599 3066.8204 3090.8127 3105.4283 5
m2 <- apply(SAMP, 1, DanielFischer) 2643.4883 2683.1034 2755.7032 2799.5155 3089.6015 5
m3 <- latemail(SAMP) 337.1862 340.6339 371.6148 376.5517 383.4436 5
all.equal(m1, m2) # TRUE
all.equal(m1, m3) # TRUE

Is it so that you have to go twice through your vector vec in your function? If you can store your NA first, maybe it could speed up your calculations a bit:
meanIfThresh2 <- function(vec, thresh=12/15) {
len <- length(vec)
nas <- is.na(vec)
Nna <- sum(nas)
if( (Nna / len) > thresh)
return(NA_real_)
return(sum(vec[!nas])/(len-Nna))
}
EDIT: I performed the similar benchmarking, to see the effect on this change:
> microbenchmark( "meanIfThresh" = apply(SAMP, 1, meanIfThresh)
+ , "meanIfThresh2" = apply(SAMP, 1, meanIfThresh2)
+ , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
+ , times = 15L)
Unit: seconds
expr min lq median uq max neval
meanIfThresh 2.009858 2.156104 2.158372 2.166092 2.192493 15
meanIfThresh2 1.825470 1.828273 1.829424 1.834407 1.872028 15
mean (regular) 1.868568 1.882526 1.889852 1.893564 1.907495 15

Related

How to speed up loop in loop calculations in R

Simply said I have 378742 observations (each observation has a launch and deadline date) and I want to check the overlap of the duration of each observation against all other (378741) observations and sum them up.
I am running the following code which takes forever (my estimate is that it 205 days) because of the nested loop. Is there a way to speed up the calculations? (I use DescToolspackage for the Overlap command.)
a <- c(1:378742)
for (i in 1:378742) {
mydata$competition[i] <- sum(a, na.rm = T)
for (j in 1:378742) {
a[j] <- Overlap(c(mydata$Launched[i], mydata$Deadline[i]), c(mydata$Launched[j], mydata$Deadline[j]))
}
}
You can save significant time by vectorizing your inner loop (I then use apply() for the outer loop):
# We'll need both DescTools and microbenchmark
library(DescTools)
library(microbenchmark)
# Make example data
set.seed(123) # setting seed for reproducibility
n <- 10
x <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n)
y <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n)
(mat <- cbind(x, y))
#> x y
#> [1,] 15222 17667
#> [2,] 17050 15827
#> [3,] 15665 16645
#> [4,] 17395 16262
#> [5,] 17603 14547
#> [6,] 14338 17454
#> [7,] 16098 15069
#> [8,] 17425 14325
#> [9,] 16181 15367
#> [10,] 15835 17650
# First get the answer using nested loops
a <- z <- 1:n
for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
# Noticed I've moved this sum to the bottom,
# so that our first element isn't just a sum from one to n
z[i] <- sum(a, na.rm = T)
}
z
#> [1] 16102 9561 7860 7969 18169 18140 6690 18037 6017 12374
apply(mat, 1, function(r) sum(Overlap(r, mat)))
#> [1] 16102 9561 7860 7969 18169 18140 6690 18037 6017 12374
microbenchmark(apply = apply(mat, 1, function(r) sum(Overlap(r, mat))),
loop = for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
# Noticed I've moved this sum to the bottom,
# so that our first element isn't just a sum from one to n
z[i] <- sum(a, na.rm = T)
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> apply 7.538967 7.688929 7.894379 7.767989 7.891177 13.57523 100
#> loop 76.051011 77.203810 80.045325 78.158369 79.206538 114.68139 100
#> cld
#> a
#> b
Created on 2018-10-20 by the reprex package (v0.2.1)
Now let's try to get a sense of how it scales with (slightly) bigger example data (if the data gets too big the benchmarks take forever):
#
n <- 100
x <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n, r = T)
y <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n, r = T)
mat <- cbind(x, y)
a <- z <- 1:n
for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
z[i] <- sum(a, na.rm = T)
}
# In case you're concerned it still works:
all.equal(z, apply(mat, 1, function(r) sum(Overlap(r, mat))))
#> [1] TRUE
microbenchmark(apply = apply(mat, 1, function(r) sum(Overlap(r, mat))),
loop = for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
# Noticed I've moved this sum to the bottom,
# so that our first element isn't just a sum from one to n
z[i] <- sum(a, na.rm = T)
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> apply 258.1151 262.8007 269.8172 265.9643 276.8799 296.2167 100
#> loop 5806.9834 5841.3362 5890.4988 5863.7317 5884.2308 6222.1670 100
#> cld
#> a
#> b
Created on 2018-10-20 by the reprex package (v0.2.1)
In Bioinformatics, we use for finding overlapping ranges the GenomicRanges packages.
I once also calculated using my usual for-loops and lapply functions that my computer would calculate 5 days long. But then I found the GenomicRanges package - and it did it in seconds!
(Shame on me, I still don't know how it exactly works ... will have to do with ordered tree structure and intersecting efficiently ... and partly also perhaps involving C++ code? .. )
The result is anyway that it is lightning fast.
You will be amazed!
GenomicRanges package for lightening fast range calculations
############################
# Install GenomicRanges package
############################
# since this year introduced: `BiocManager`
# Bioconductor is main code repository for Bionformaticians.
# It is kind of `CRAN` for Bioinformaticians programming with R
install.packages("BiocManager")
require(BiocManager)
BiocManager::install("GenomicRanges")
# In older systems, you have to do:
install.packages("BiocInstaller")
require(BiocInstaller)
biocLite("GenomicRanges")
############################
# Load the GenomicRanges package
############################
require(GenomicRanges)
############################
# create dates as positive intervals
############################
set.seed(123) # for reproducibility of random stuff
n <- 1000 # later: 378742
x <- sample(seq(as.Date("2008/10/20"), as.Date("2038/10/20"), "day"), replace=TRUE, n)
# y <- sapply(x, function(date) date + sample(1:1000, 1)) # too slow!
deltas <- sample(1:10000, replace=TRUE, n) # immediate response `sapply` needs very long
y <- x + deltas
df <- data.frame(seqnames="1", start=x, end=y)
gr <- GRanges(df)
gr <- sort(gr)
############################
# Be careful, GRanges obj is 1-based system and not 0-based!
############################
# each row is one index - gr behaves when indexing like a vector
gr[5] # selects fifth row
gr[4:7] # selects 4th to 7th row
############################
# which range overlaps with which range?
############################
system.time({hits <- findOverlaps(gr, gr)})
# system.time({ your-R-expression }) - very convenient speed measuring!
# the numbers in the table are the index (i-th row) in each of the tables
# query and subject table - which are in this case identical tables - gr
############################
# what is the amount of overlap?
############################
overlaps <- pintersect(gr[queryHits(hits)], gr[subjectHits(hits)])
amount.overlaps <- width(overlaps) - 1 # - 1 because 1-based systems do +1 when ranges
# 1-base versus 0-based coordinate systems: https://www.biostars.org/p/84686/

Combining the data of randomly selected participants with dplyr

I have the following data frame 'df'.
Each participant (here 10 participants) saw several stimuli (here 100), and made
a judgment about it (here a random number). For each stimuli, I know the true
answer (here a random number; a different number for each stimuli but always
the same answer for all participanst)
participant <- rep(1:10, each=100)
stimuli <- rep(1:100, 10)
judgment <- rnorm(1000)
df1 <- data.frame(participant, stimuli, judgment)
df2 <- data.frame(stimuli=1:100, criterion=rnorm(100))
df <- merge(df1, df2, by='stimuli') %>% arrange(participant, stimuli)
Here is what I am trying to do:
1) Taking n randomly selected participants (here n is between 1 and 10).
2) Computing the mean of their judgments per stimuli
3) Computing the correlation between this mean and the true answer
I want to perform step 1-3 for all n (that is, I want to take 1 randomly selected participants and perform steps 1-3, then I want to take 2 randomly selected participants and perform steps 1-3 ... 10 randomly selected participants and perform steps 1-3.
The results should be a data frame with 10 rows and 2 variables: N and the correlation. I want to work only with dplyr.
My solution is based on lapply. Here it is:
participants_id = unique (df$participant)
MyFun = function(Data) {
HelpFun = function(x, Data) {
# x is the index for the number of participants.
# It Will be used in the lapply call bellow
participants_x = sample(participants_id, x)
filter(Data, participant %in% participants_x) %>%
group_by(stimuli) %>%
summarise( mean_x = mean(judgment),
criterion = unique(criterion) ) %>%
summarise(cor = cor(.$mean_x, .$criterion))
}
N <- length(unique(Data$participant))
lapply(1:N, HelpFun, Data) %>% bind_rows()
}
MyFun(df)
The problem is that this code is slow. Since every selection is random, I
perform all this 10,000 times. And this slow. On my machine (Windows 10, 16 GB) 1000 simulations take 2 minutes. 10,000 simulations takes 20 minutes. (I also tried with loops but it did not help, although for some reasons it was a little bit faster). It has to be a solution faster. After all, a computations are not so complicated.
Below I wrote 100 simulations only in order to not interfere with your computer.
system.time(replicate(100, MyFun(df), simplify = FALSE ) %>% bind_rows())
Any idea about making all of this faster?
Using data.table and for loops we can get 10 times faster solution.
My function:
minem <- function(n) { # n - simulation count
require(data.table)
participants_id <- unique(df$participant)
N <- length(unique(df$participant))
dt <- as.data.table(df)
setkey(dt, stimuli)
L <- list()
for (j in 1:n) {
corss <- rep(0, N)
for (i in 1:N) {
participants_x <- sample(participants_id, i)
xx <- dt[participant %in% participants_x,
.(mean_x = mean(judgment),
criterion = first(criterion)),
by = stimuli]
corss[i] <- cor(xx$mean_x, xx$criterion)
}
L[[j]] <- corss
}
unlist(L)
}
head(minem(10))
# [1] 0.13642499 -0.02078109 -0.14418400 0.04966805 -0.09108837 -0.15403185
Your function:
Meir <- function(n) {
replicate(n, MyFun(df), simplify = FALSE) %>% bind_rows()
}
Benchmarks:
microbenchmark::microbenchmark(
Meir(10),
minem(10),
times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Meir(10) 1897.6909 1956.3427 1986.5768 1973.5594 2043.4337 2048.5809 10 b
# minem(10) 193.5403 196.0426 201.4132 202.1085 204.9108 215.9961 10 a
around 10 times faster
system.time(minem(1000)) # ~19 sek
Update
If your data size and memory limit allows then you can do it much faster with this approach:
minem2 <- function(n) {
require(data.table)
participants_id <- unique(df$participant)
N <- length(unique(df$participant))
dt <- as.data.table(df)
setkey(dt, participant)
L <- lapply(1:n, function(x)
sapply(1:N, function(i)
sample(participants_id, i)))
L <- unlist(L, recursive = F)
names(L) <- 1:length(L)
g <- sapply(seq_along(L), function(x) rep(names(L[x]), length(L[[x]])))
L <- data.table(participant = unlist(L), .id = as.integer(unlist(g)),
key = "participant")
L <- dt[L, allow.cartesian = TRUE]
xx <- L[, .(mean_x = mean(judgment), criterion = first(criterion)),
keyby = .(.id, stimuli)]
xx <- xx[, cor(mean_x, criterion), keyby = .id][[2]]
xx
}
microbenchmark::microbenchmark(
Meir(100),
minem(100),
minem2(100),
times = 2, unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval cld
# Meir(100) 316.34965 316.34965 257.30832 257.30832 216.85190 216.85190 2 c
# minem(100) 31.49818 31.49818 26.48945 26.48945 23.05735 23.05735 2 b
# minem2(100) 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 2 a
But you will need to test yourself.

Compare Matrices in R efficiently

I have an array a with some matrices in it. Now i need to efficiently check how many different matrices I have and what indices (in ascending order) they have in the array. My approach is the following: Paste the columns of the matrixes as character vectors and have a look at the frequency table like this:
n <- 10 #observations
a <- array(round(rnorm(2*2*n),1),
c(2,2,n))
paste_a <- apply(a, c(3), paste, collapse=" ") #paste by column
names(paste_a) <- 1:n
freq <- as.numeric( table(paste_a) ) # frequencies of different matrices (in ascending order)
indizes <- as.numeric(names(sort(paste_a[!duplicated(paste_a)])))
nr <- length(freq) #number of different matrices
However, as you increase n to large numbers, this gets very inefficient (it's mainly paste() that's getting slower and slower). Does anyone have a better solution?
Here is a "real" dataset with 100 observations where some matrices are actual duplicates (as opposed to my example above): https://pastebin.com/aLKaSQyF
Thank you very much.
Since your actual data is made up of the integers 0,1,2,3, why not take advantage of base 4? Integers are much faster to compare than entire matrix objects. (All occurrences of a below are of the data found in the real data set from the link.)
Base4Approach <- function() {
toBase4 <- sapply(1:dim(a)[3], function(x) {
v <- as.vector(a[,,x])
pows <- which(v > 0)
coefs <- v[pows]
sum(coefs*(4^pows))
})
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
And since the question is about efficiency, let's benchmark:
MartinApproach <- function() {
### commented this out for comparison reasons
# dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3]
a
}
identical(MartinApproach(), Base4Approach())
[1] TRUE
microbenchmark(Base4Approach(), MartinApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 291.658 303.525 339.2712 325.4475 352.981 636.361 100
MartinApproach() 983.855 1000.958 1160.4955 1071.9545 1187.321 3545.495 100
The approach by #d.b. doesn't really do the same thing as the previous two approaches (it simply identifies and doesn't remove duplicates).
DBApproach <- function() {
a[, , 9] = a[, , 1]
#Convert to list
mylist = lapply(1:dim(a)[3], function(i) a[1:dim(a)[1], 1:dim(a)[2], i])
temp = sapply(mylist, function(x) sapply(mylist, function(y) identical(x, y)))
temp2 = unique(apply(temp, 1, function(x) sort(which(x))))
#The indices in 'a' where the matrices are same
temp2[lengths(temp2) > 1]
}
However, Base4Approach still dominates:
microbenchmark(Base4Approach(), MartinApproach(), DBApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 298.764 324.0555 348.8534 338.899 356.0985 476.475 100
MartinApproach() 1012.601 1087.9450 1204.1150 1110.662 1162.9985 3224.299 100
DBApproach() 9312.902 10339.4075 11616.1644 11438.967 12413.8915 17065.494 100
Update courtesy of #alexis_laz
As mentioned in the comments by #alexis_laz, we can do much better.
AlexisBase4Approach <- function() {
toBase4 <- colSums(a * (4 ^ (0:(prod(dim(a)[1:2]) - 1))), dims = 2)
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
microbenchmark(Base4Approach(), MartinApproach(), DBApproach(), AlexisBase4Approach(), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Base4Approach() 11.67992 10.55563 8.177654 8.537209 7.128652 5.288112 100
MartinApproach() 39.60408 34.60546 27.930725 27.870019 23.836163 22.488989 100
DBApproach() 378.91510 342.85570 262.396843 279.190793 231.647905 108.841199 100
AlexisBase4Approach() 1.00000 1.00000 1.000000 1.000000 1.000000 1.000000 100
## Still gives accurate results
identical(MartinApproach(), AlexisBase4Approach())
[1] TRUE
My first attempt was actually really slow. So here is slightly changed version of yours:
dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3] #number of different matrices
idx <- dimnames(a)[[3]] # indices of left over matrices
I don't know if this is exactly what you want but here is a way you can extract indices where the matrices are same. More processing may be necessary to get what you want
#DATA
n <- 10
a <- array(round(rnorm(2*2*n),1), c(2,2,n))
a[, , 9] = a[, , 1]
temp = unique(apply(X = sapply(1:dim(a)[3], function(i)
sapply(1:dim(a)[3], function(j) identical(a[, , i], a[, , j]))),
MARGIN = 1,
FUN = function(x) sort(which(x))))
temp[lengths(temp) > 1]
#[[1]]
#[1] 1 9

Find elements in vector in R

A matrix I have has exactly 2 rows and n columns example
c(0,0,0,0,1,0,2,0,1,0,1,1,1,0,2)->a1
c(0,2,0,0,0,0,2,1,1,0,0,0,0,2,0)->a2
rbind(a1,a2)->matr
for a specific column ( in this example 9 with 1 in both rows) I do need to find to the left and to the right the first instance of 2/0 or 0/2 - in this example to the left is 2 and the other is 14)
The elements of every row can either be 0,1,2 - nothing else . Is there a way to do that operation on large matrixes (with 2 rows) fast? I need to to it 600k times so speed might be a consideration
library(compiler)
myfun <- cmpfun(function(m, cl) {
li <- ri <- cl
nc <- ncol(m)
repeat {
li <- li - 1
if(li == 0 || ((m[1, li] != 1) && (m[1, li] + m[2, li] == 2))) {
l <- li
break
}
}
repeat {
ri <- ri + 1
if(ri == nc || ((m[1, ri] != 1) && (m[1, ri] + m[2, ri] == 2))) {
r <- ri
break
}
}
c(l, r)
})
and, after taking into account #Martin Morgan's observations,
set.seed(1)
N <- 1000000
test <- rbind(sample(0:2, N, replace = TRUE),
sample(0:2, N, replace = TRUE))
library(microbenchmark)
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
expr min lq mean median uq max neval cld
# myfun(test, N/2) 4.658 20.033 2.237153e+01 22.536 26.022 85.567 100 a
# fun(test, N/2) 36685.750 47842.185 9.762663e+04 65571.546 120321.921 365958.316 100 b
# foo(test, N/2) 2622845.039 3009735.216 3.244457e+06 3185893.218 3369894.754 5170015.109 100 d
# AWebb(test, N/2) 121504.084 142926.590 1.990204e+05 193864.670 209918.770 489765.471 100 c
# RHertel(test, N/2) 65998.733 76805.465 1.187384e+05 86089.980 144793.416 385880.056 100 b
set.seed(123)
test <- rbind(sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)),
sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)))
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# myfun(test, N/2) 81.805 103.732 121.9619 106.459 122.36 307.736 100 a
# fun(test, N/2) 26362.845 34553.968 83582.9801 42325.755 106303.84 403212.369 100 b
# foo(test, N/2) 2598806.742 2952221.561 3244907.3385 3188498.072 3505774.31 4382981.304 100 d
# AWebb(test, N/2) 109446.866 125243.095 199204.1013 176207.024 242577.02 653299.857 100 c
# RHertel(test, N/2) 56045.309 67566.762 125066.9207 79042.886 143996.71 632227.710 100 b
I was slower than #Laterow, but anyhow, this is a similar approach
foo <- function(mtr, targetcol) {
matr1 <- colSums(mtr)
matr2 <- apply(mtr, 2, function(x) x[1]*x[2])
cols <- which(matr1 == 2 & matr2 == 0) - targetcol
left <- cols[cols < 0]
right <- cols[cols > 0]
c(ifelse(length(left) == 0, NA, targetcol + max(left)),
ifelse(length(right) == 0, NA, targetcol + min(right)))
}
foo(matr,9) #2 14
Combine the information by squaring the rows and adding them. The right result should be 4. Then, simply find the first column that is smaller than 9 (rev(which())[1]) and the first column that is larger than 9 (which()[1]).
fun <- function(matr, col){
valid <- which((matr[1,]^2 + matr[2,]^2) == 4)
if (length(valid) == 0) return(c(NA,NA))
left <- valid[rev(which(valid < col))[1]]
right <- valid[which(valid > col)[1]]
c(left,right)
}
fun(matr,9)
# [1] 2 14
fun(matr,1)
# [1] NA 2
fun(matrix(0,nrow=2,ncol=100),9)
# [1] NA NA
Benchmark
set.seed(1)
test <- rbind(sample(0:2,1000000,replace=T),
sample(0:2,1000000,replace=T))
microbenchmark::microbenchmark(fun(test,9))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun(test, 9) 22.7297 27.21038 30.91314 27.55106 28.08437 51.92393 100
Edit: Thanks to #MatthewLundberg for pointing out a lot of mistakes.
If you are doing this many times, precompute all the locations
loc <- which((a1==2 & a2==0) | (a1==0 & a2==2))
You can then find the first to the left and right with findInterval
i<-findInterval(9,loc);loc[c(i,i+1)]
# [1] 2 14
Note that findInterval is vectorized should you care to specify multiple target columns.
That is an interesting question. Here's how I would address it.
First a vector is defined which contains the product of each column:
a3 <- matr[1,]*matr[2,]
Then we can find the columns with pairs of (0/2) or (2/0) rather easily, since we know that the matrix can only contain the values 0, 1, and 2:
the02s <- which(colSums(matr)==2 & a3==0)
Next we want to find the pairs of (0/2) or (2/0) that are closest to a given column number, on the left and on the right of that column. The column number could be 9, for instance:
thecol <- 9
Now we have basically all we need to find the index (the column number in the matrix) of a combination of (0/2) or (2/0) that is closest to the column thecol. We just need to use the output of findInterval():
pos <- findInterval(thecol,the02s)
pos <- c(pos, pos+1)
pos[pos==0] <- NA # output NA if no column was found on the left
And the result is:
the02s[pos]
# 2 14
So the indices of the closest columns on either side of thecol fulfilling the required condition would be 2 and 14 in this case, and we can confirm that these column numbers both contain one of the relevant combinations:
matr[,14]
#a1 a2
# 0 2
matr[,2]
#a1 a2
# 0 2
Edit: I changed the answer such that NA is returned in the case where no column exists on the left and/or on the right of thecol in the matrix that fulfills the required condition.

Compare each row with other rows of matrix

I am looking for an efficient solution for the following problem:
b <- matrix(c(0,0,0,1,1,0), nrow = 2, byrow = T)
weight <- c(1,1)
times <- 5
abc <- do.call(rbind, replicate(times, b, simplify=FALSE))
weight <- rep.int(weight,times)
sum1 <- as.numeric(rep.int(NA,nrow(abc)))
##Rprof()
for(j in 1:nrow(abc)){
a <- abc[j,]
sum1[j] <- sum(weight[rowSums(t(a == t(abc)) + 0) == ncol(abc)])
}
##Rprof(NULL)
##summaryRprof()
Is there a faster way to do this? Rprof shows that rowSums(), t(), == and + are quite slow. If nrows is 20,000 it takes like 21 seconds.
Thanks for helping!
Edit: I have a matrix abc and a vector weight with length equal to nrow(abc). The first value of weight corresponds to the first row of matrix abc and so on... Now, I would like to determine which rows of matrix abc are equal. Then, I want to remember the position of those rows in order to sum up the corresponding weights which have the same position. The appropriate sum I wanna store for each row.
Here is a way that looks valid and fast:
ff <- function(mat, weights)
{
rs <- apply(mat, 1, paste, collapse = ";")
unlist(lapply(unique(rs),
function(x)
sum(weights[match(rs, x, 0) > 0])))[match(rs, unique(rs))]
}
ff(abc, weight)
# [1] 5 5 5 5 5 5 5 5 5 5
And comparing with your function:
ffOP <- function(mat, weights)
{
sum1 <- as.numeric(rep.int(NA,nrow(mat)))
for(j in 1:nrow(mat)) {
a <- mat[j,]
sum1[j] <- sum(weights[rowSums(t(a == t(mat)) + 0) == ncol(mat)])
}
sum1
}
ffOP(abc, weight)
# [1] 5 5 5 5 5 5 5 5 5 5
library(microbenchmark)
m = do.call(rbind, replicate(1e3, matrix(0:11, 3, 4), simplify = F))
set.seed(101); w = runif(1e3*3)
all.equal(ffOP(m, w), ff(m, w))
#[1] TRUE
microbenchmark(ffOP(m, w), ff(m, w), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# ffOP(m, w) 969.83968 986.47941 996.68563 1015.53552 1051.23847 10
# ff(m, w) 20.42426 20.64002 21.36508 21.97182 22.59127 10
For the record, I, also, implemented your approach in C and here are the benchmarkings:
#> microbenchmark(ffOP(m, w), ff(m, w), ffC(m, w), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# ffOP(m, w) 957.66691 967.09429 991.35232 1000.53070 1016.74100 10
# ff(m, w) 20.60243 20.85578 21.70578 22.13434 23.04924 10
# ffC(m, w) 36.24618 36.40940 37.18927 37.39877 38.83358 10

Resources