How to improve performance on counting columns in a matrix which are below a threshold? - r

In my code I am subtracting one column of a matrix from every other column of the same matrix.
Then I count how many of the new columns have only elements that are smaller than r.
I'm doing this for each column of the matrix. You can see my code below. I left out the part where I put values into the matrix.
Is there any way to improve the performance of this code? I can't seem to figure out a way to make this faster
B = matrix(NA,(m),(window_step))
B_m_r = c(1:(window_step))
for (i in 1:(window_step)){
B_m_r[i] = sum(apply(abs(B[,-i]-B[,i]), 2,function(x) max(x) < r))
}
Solution
B = matrix(NA,(m),(window_step))
B_m_r = c(1:(window_step))
buffer_B = matrix(NA,(window_step-1),(window_step-1))
for (i in 1:(window_step-2)){
buffer_B[i,c(i:(window_step-1))] = apply(abs(B[,-c(1:i)]-B[,i]),2,function(x) max(x) < r)
B_m_r[i] = (sum(buffer_B[i,c(i:(window_step-1))])+sum(buffer_B[1:i,i]))
}
B_m_r[window_step] = sum(buffer_B[1:(window_step-1),(window_step-1)])
B_m_r[window_step-1] = sum(buffer_B[1:(window_step-2),(window_step-2)])
Ok so based on the help from Яaffael I found a solution, that doesn't calculate the differences twice.
Instead I save the result of the comparison with r from previous loops in the matrix buffer_B and use them for the next loop to calculate the sum of all columns who are smaller than r.
Now the code takes only half the time to finish.
Thanks!

You can for example reduce the calculation time by 50% by only checking "< r" for half of the column differences because they are effectively symmetric.
You are calculating abs(first of B - last of B) and abs(last of B - first of B).
PLUS you can precalculate the handled difference matrix instead of using a for loop to set it up step by step.
# I am using single-row matrices to keep it simple
> A <- matrix(1:4,ncol=4)
> A[,1:ceiling(ncol(A)/2)]
[1] 1 2
> A[,ncol(A):(floor(ncol(A)/2)+1)]
[1] 4 3
> A <- matrix(1:5,ncol=5)
> A[,1:ceiling(ncol(A)/2)]
[1] 1 2 3
> A[,ncol(A):(floor(ncol(A)/2)+1)]
[1] 5 4 3
> abs(A[,1:ceiling(ncol(A)/2)] - A[,ncol(A):(floor(ncol(A)/2)+1)])
[1] 4 2 0
When you want to speed up code in R then first thing you should try is to turn all loops into vectorized expressions using R functions. A loop will run within R. Vectorized function calls allow R to execute essentially compiled C code.

Related

R: looking for a more efficient way of running rbinom() taking probabilities from a large matrix

I have a 20000x90 matrix M of probabilities ranging between 0 and 1. I want to use these probabilities to return a 20000x90 matrix of 1's and 0's (i.e. for each probability p in the matrix I want to run rbinom(1,1,p)).
This is my current solution which works:
apply(M,1:2,function(x) rbinom(1,1,x))
However, it is quite slow and I am wondering if there is a faster way of achieving the same thing.
rbinom is vectorised, so you can pass the vector (or matrix) of probabilities; you just need to change the n, the number of observations, from 1 to the number of values in M.
An example:
# with loop
set.seed(74309281)
M = matrix(runif(20000*90), nr=20000, nc=90)
a1 = apply(M,1:2,function(x) rbinom(1,1,x))
# without loop
set.seed(74309281)
M = matrix(runif(20000*90), nr=20000, nc=90)
a2 = matrix(rbinom(length(M),1,M), nr=nrow(M), nc=ncol(M))
all.equal(a1, a2)
# [1] TRUE

The sum of the first n odd integers

I am trying to create a function that takes the sum of the first n odd integers, i.e the summation from i=1 to n of (2i-1).
If n = 1 it should output 1
If n = 2 it should output 4
I'm having problems using a for loop which only outputs the nth term
n <-2
for (i in 1:n)
{
y<-((2*i)-1)
}
y
In R programming we try avoiding for loops
cumsum ( seq(1,2*n, by=2) )
Or just use 'sum' if you don't want the series of partial sums.
There's actually no need to use a loop or to construct the sequence of the first n odd numbers here -- this is an arithmetic series so we know the sum of the first n elements in closed form:
sum.first.n.odd <- function(n) n^2
sum.first.n.odd(1)
[1] 1
sum.first.n.odd(2)
[1] 4
sum.first.n.odd(100)
[1] 10000
This should be a good deal more efficient than any solution based on for or sum because it never computes the elements of the sequence.
[[Just seeing the title -- the OP apparently knows the analytic result and wanted something else...]]
Try this:
sum=0
n=2
for(i in seq(1,2*n,2)){
sum=sum+i
}
But, of course, R is rather slow when working with loops. That's why one should avoid them.

i not showing up as number in loop

so I have a loop that finds the position in the matrix where there is the largest difference in consecutive elements. For example, if thematrix[8] and thematrix[9] have the largest difference between any two consecutive elements, the number given should be 8.
I made the loop in a way that it will ignore comparisons where one of the elements is NaN (because I have some of those in my data). The loop I made looks like this.
thenumber = 0 #will store the difference
for (i in 1:nrow(thematrix) - 1) {
if (!is.na(thematrix[i]) & !is.na(thematrix[i + 1])) {
if (abs(thematrix[i] - thematrix[i + 1]) > thenumber) {
thenumber = i
}
}
}
This looks like it should work but whenever I run it
Error in if (!is.na(thematrix[i]) & !is.na(thematrix[i + 1])) { :
argument is of length zero
I tried this thing but with a random number in the brackets instead of i and it works. For some reason it only doesn't work when I use the i specified in the beginning of the for-loop. It doesn't recognize that i represents a number. Why doesn't R recognize i?
Also, if there's a better way to do this task I'd appreciate it greatly if you could explain it to me
You are pretty close but when you call i in 1:nrow(thematrix) - 1 R evaluates this to make i = 0 which is what causes this issue. I would suggest either calling i in 1:nrow(thematrix) or i in 2:nrow(thematrix) - 1 to start your loop at i = 1. I think your approach is generally pretty intuitive but one suggestion would be to frequently use the print() function to evaluate how i changes over the course of your function.
The issue is that the : operator has higher precedence than -; you just need to use parentheses around (nrow(thematrix)-1). For example,
thematrix <- matrix(1:10, nrow = 5)
##
wrong <- 1:nrow(thematrix) - 1
right <- 1:(nrow(thematrix) - 1)
##
R> wrong
#[1] 0 1 2 3 4
R> right
#[1] 1 2 3 4
Where the error message is coming from trying to access the zero-th element of thematrix:
R> thematrix[0]
integer(0)
The other two answers address your question directly, but I must say this is about the worst possible way to solve this problem in R.
set.seed(1) # for reproducible example
x <- sample(1:10,10) # numbers 1:10 in random order
x
# [1] 3 4 5 7 2 8 9 6 10 1
which.max(abs(diff(x)))
# [1] 9
The diff(...) function calculates sequential differences, and which.max(...) identifies the element number of the maximum value in a vector.

counting matching elements of two vectors but not including repeated elements in the count

I've search a lot in this forum. However, I didn't found a similar problem as the one I'm facing.
My question is:
I have two vectors
x <- c(1,1,2,2,3,3,3,4,4,4,6,7,8) and z <- c(1,1,2,4,5,5,5)
I need to count the number of times x or z appears in each other including if they are repeated or not.
The answer for this problem should be 4 because :
There are two number 1, one number 2, and one number 4 in each vector.
Functions like match() don't help since they will return the answer of repeated for non repeated numbers. Using unique() will also alter the final answer from 4 to 3
What I came up with was a loop that every time it found one number in the other, it would remove from the list so it won't be counted again.
The loop works fine for this size of this example; however, searching for larger vectors numerous times makes my loop inefficient and too slow for my purposes.
system.time({
for(n in 1:1000){
x <- c(1,1,2,2,3,3,3,4,4,4,6,7,8)
z <- c(1,1,2,4,5,5,5)
score <- 0
for(s in spectrum){
if(s %in% sequence){
sequence <- sequence[-which(sequence==s)[1]]
score <- score + 1
}
}
}
})
Can someone suggest a better method?
I've tried using lapply, for short vectors it is faster, but it became slower for longer ones..
Use R's vectorization to your advantage here. There's no looping necessary.
You could use a table to look at the frequencies,
table(z[z %in% x])
#
# 1 2 4
# 2 1 1
And then take the sum of the table for the total
sum(table(z[z %in% x]))
# [1] 4

optimizing markov chain transition matrix calculations?

As an intermediate R user, I know that for loops can very often be optimized by using functions like apply or otherwise. However, I am not aware of functions that can optimize my current code to generate a markov chain matrix, which is running quite slowly. Have I max-ed out on speed or are there things that I am overlooking? I am trying to find the transition matrix for a Markov chain by counting the number of occurrences in 24-hour time periods before given alerts. The vector ids contains all possible id's (about 1700).
The original matrix looks like this, as an example:
>matrix
id time
1 1376084071
1 1376084937
1 1376023439
2 1376084320
2 1372983476
3 1374789234
3 1370234809
And here is my code to try to handle this:
matrixtimesort <- matrix[order(-matrix$time),]
frequency = 86400 #number of seconds in 1 day
# Initialize matrix that will contain probabilities
transprobs <- matrix(data=0, nrow=length(ids), ncol=length(ids))
# Loop through each type of event
for (i in 1:length(ids)){
localmatrix <- matrix[matrix$id==ids[i],]
# Loop through each row of the event
for(j in 1:nrow(localmatrix)) {
localtime <- localmatrix[j,]$time
# Find top and bottom row number defining the 1-day window
indices <- which(matrixtimesort$time < localtime & matrixtimesort$time >= (localtime - frequency))
# Find IDs that occur within the 1-day window
positiveids <- unique(matrixtimesort[c(min(indices):max(indices)),]$id)
# Add one to each cell in the matrix that corresponds to the occurrence of an event
for (l in 1:length(positiveids)){
k <- which(ids==positiveids[l])
transprobs[i,k] <- transprobs[i,k] + 1
}
}
# Divide each row by total number of occurrences to determine probabilities
transprobs[i,] <- transprobs[i,]/nrow(localmatrix)
}
# Normalize rows so that row sums are equal to 1
normalized <- transprobs/rowSums(transprobs)
Can anyone make any suggestions to optimize this for speed?
Using nested loops seems a bad idea. Your code can be vectorized to speed up.
For example, why find the top and bottom of row numbers? You can simply compare the time value with "time_0 + frequency": it is a vectorized operation.
HTH.

Resources