R - vectorizing a which operation - r

Hi I have a function in R that I'm trying to optimize for performance. I need to vectorize a for loop. My problem is the slightly convoluted data structure and the way I need to perform lookups using the 'which' command.
Lets say we are dealing with 5 elements (1,2,3,4,5), the 10x2 matrix pairs is a combination of all unique pairs the 5 elements (i.e. (1,2), (1,3),(1,4) ....(4,5)). all_prods is a 10x1 matrix that I need to look up using the pairs while iterating through all the 5 elements.
So for 1, I need to index rows 1, 2, 3, 4 (pairs 1,2 1,3 1,4 and 1,5) from all_prods and so on for 1, 2, 3, 4, 5.
I have only recently switched to R from matlab so would really appreciate any help.
foo <- function(AA , BB , CC ){
pa <- AA*CC;
pairs <- t(combn(seq_len(length(AA)),2));
all_prods <- pa[pairs[,1]] * pa[pairs[,2]];
result <- matrix(0,1,length(AA));
# WANT TO VECTORIZE THIS BLOCK
for(st in seq(from=1,to=length(AA))){
result[st] <- sum(all_prods[c(which(pairs[,1]==st), which(pairs[,2]==st))])*BB[st];
}
return(result);
}
AA <- seq(from=1,to=5); BB<-seq(from=11,to=15); CC<-seq(from=21,to=25);
results <- foo(AA,BB,CC);
#final results is [7715 164208 256542 348096 431250]
I want to convert the for loop into a vectorised version. Instead of looping through every element st, I'd like to do it in one command that gives me a results vector (rather than building it up element by element)

You could write your function like this:
foo <- function(AA, BB, CC) {
pa <- AA*CC
x <- outer(pa, pa)
diag(x) <- 0
res <- colSums(x)*BB
return(res)
}
The key idea is to not break the symmetry. Your use of ordered pairs corresponds to the upper right triangle of my matrix x. Although this seems like just half as many values to compute, the syntactic and computational overhead becomes quite large. You are distinguishing situations where st is the first element in the pair from those where it is the second. Later on this leads to quite some trouble to get rid of that distinction. Having the full symmetric matrix, you don't have to worry about order, and things vectorize smoothly.

Related

In R, is it possible to use a pair, tuple or equivalent in a matrix?

I am trying to create a matrix of coordinates(indexes) that I randomly pick one from using the sample function. I then use these to select a cell in another matrix. What is the best way to do this? The trouble is how to store these integers in the matrix so that they are easy to separate. Right now I have them stored as strings with a comma, that I then split. Someone suggested I use a pair, or a string, but I cannot seam to get these to work with a matrix. Thanks!
EDIT:What i currently have looks like this (changed a little to make sense out of context):
probs <- matrix(c(0,0,0.6,0,0,
0,0.7,1,0.7,0,
0.6,1,0,1,0.6,
0,0.7,1,0.7,0,
0,0,0.6,0,0),5,5)
cordsMat <- matrix("",5,5)
for (x in 1:5){
for (y in 1:5){
cordsMat[x,y] = paste(x,y,sep=",")
}
}
cords <- sample(cordsMat,1,,probs)
cordsVec <- unlist(strsplit(cords,split = ","))
cordX <- as.numeric(cordsVec[1])
cordY <- as.numeric(cordsVec[2])
otherMat[cordX,cordY]
It sort of works but i would also be interested for a better way, as this will get repeated a lot.
If you want to set the probabilities it can easily be done by providing it to sample
# creating the matrix
matrix(sample(rep(1:6, 15:20), 25), 5) -> other.mat
# set the probs vec
probs <- c(0,0,0.6,0,0,
0,0.7,1,0.7,0,
0.6,1,0,1,0.6,
0,0.7,1,0.7,0,
0,0,0.6,0,0)
# the coordinates matrix
mat <- as.matrix(expand.grid(1:nrow(other.mat),1:ncol(other.mat)))
# sampling a row randomly
sample(mat, 1, prob=probs) -> rand
# getting the value
other.mat[mat[rand,1], mat[rand,2]]
[1] 6

Indexing variables in R

I am normally a maple user currently working with R, and I have a problem with correctly indexing variables.
Say I want to define 2 vectors, v1 and v2, and I want to call the nth element in v1. In maple this is easily done:
v[1]:=some vector,
and the nth element is then called by the command
v[1][n].
How can this be done in R? The actual problem is as follows:
I have a sequence M (say of length 10, indexed by k) of simulated negbin variables. For each of these simulated variables I want to construct a vector X of length M[k] with entries given by some formula. So I should end up with 10 different vectors, each of different length. My incorrect code looks like this
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
for(k in 1:sims){
x[k]<-rep(NA,M[k])
X[k]<-rep(NA,M[k])
for(i in 1:M[k]){x[k][i]<-runif(1,min=0,max=1)
if(x[k][i]>=0 & x[i]<=0.1056379){
X[k][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[k][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
}
The error appears to be that x[k] is not a valid name for a variable. Any way to make this work?
Thanks a lot :)
I've edited your R script slightly to get it working and make it reproducible. To do this I had to assume that eks_2016_kasko was an integer value of 10.
require(MASS)
sims<-10
# Because you R is not zero indexed add one
M<-rnegbin(sims, 10*exp(-2.17173), 840.1746) + 1
# Create a list
x <- list()
X <- list()
for(k in 1:sims){
x[[k]]<-rep(NA,M[k])
X[[k]]<-rep(NA,M[k])
for(i in 1:M[k]){
x[[k]][i]<-runif(1,min=0,max=1)
if(x[[k]][i]>=0 & x[[k]][i]<=0.1056379){
X[[k]][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[[k]][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
This will work and I think is what you were trying to do, BUT is not great R code. I strongly recommend using the lapply family instead of for loops, learning to use data.table and parallelisation if you need to get things to scale. Additionally if you want to read more about indexing in R and subsetting Hadley Wickham has a comprehensive break down here.
Hope this helps!
Let me start with a few remarks and then show you, how your problem can be solved using R.
In R, there is most of the time no need to use a for loop in order to assign several values to a vector. So, for example, to fill a vector of length 100 with uniformly distributed random variables, you do something like:
set.seed(1234)
x1 <- rep(NA, 100)
for (i in 1:100) {
x1[i] <- runif(1, 0, 1)
}
(set.seed() is used to set the random seed, such that you get the same result each time.) It is much simpler (and also much faster) to do this instead:
x2 <- runif(100, 0, 1)
identical(x1, x2)
## [1] TRUE
As you see, results are identical.
The reason that x[k]<-rep(NA,M[k]) does not work is that indeed x[k] is not a valid variable name in R. [ is used for indexing, so x[k] extracts the element k from a vector x. Since you try to assign a vector of length larger than 1 to a single element, you get an error. What you probably want to use is a list, as you will see in the example below.
So here comes the code that I would use instead of what you proposed in your post. Note that I am not sure that I correctly understood what you intend to do, so I will also describe below what the code does. Let me know if this fits your intentions.
# define M
library(MASS)
eks_2016_kasko <- 486689.1
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
# define the function that calculates X for a single value from M
calculate_X <- function(m) {
x <- runif(m, min=0,max=1)
X <- ifelse(x > 0.1056379, rlnorm(m, 6.228244, 0.3565041),
rlnorm(m, 8.910837, 1.1890874))
}
# apply that function to each element of M
X <- lapply(M, calculate_X)
As you can see, there are no loops in that solution. I'll start to explain at the end:
lapply is used to apply a function (calculate_X) to each element of a list or vector (here it is the vector M). It returns a list. So, you can get, e.g. the third of the vectors with X[[3]] (note that [[ is used to extract elements from a list). And the contents of X[[3]] will be the result of calculate_X(M[3]).
The function calculate_X() does the following: It creates a vector of m uniformly distributed random values (remember that m runs over the elements of M) and stores that in x. Then it creates a vector X that contains log normally distributed random variables. The parameters of the distribution depend on the value x.

Easiest way in R to get vector of frequencies of elements in vector

Suppose I have a vector of values v. What is the easiest way to get a vector f of length equal to v, where the ith element of f is the frequency of the ith element of v in v?
The only way I know to do it seems unnecessarily complicated:
v = sample(1:10,100,replace=TRUE)
D = data.frame( idx=1:length(v), v=v )
E = merge( D, data.frame(table(v)) )
E = E[ with(E,order(idx)), ]
f = E$Freq
Surely there's a simpler way to do this, along the lines of "frequencies(v)"?
For a vector of small positive integers v, as in the question, the expression
tabulate(v)[v]
is particularly simple as well as speedy.
For more general numerical vectors v you can persuade ecdf to help you out, as in
w <- sapply(v, ecdf(v)) * length(v)
tabulate(w)[w]
It's probably better to do the coding of the underlying algorithm yourself, though--and it certainly avoids the floating point rounding error implicit in the preceding solution:
frequencies <- function(x) {
i <- order(x)
v <- x[i]
w <- cumsum(c(TRUE, v[-1] != v[-length(x)]))
f <- tabulate(w)[w]
return(f[order(i)])
}
This algorithm sorts the data, assigns sequential identifiers 1, 2, 3, ... to the values as it encounters them (by summing a binary indicator of when the values change), uses the preceding tabulate()[] trick to obtain the frequencies efficiently, and then unsorts the results to make the output match the input, component by component.
I think the best solution here is:
ave(v,v,FUN=length)
It is simply ave()'s design to replicate and map the return value of FUN() back to every index of the input vector whose element was part of the group for which that particular invocation of FUN() was performed.
Something like this works for me:
sapply(v, function(elmt, vec) sum(vec == elmt), vec=v)
i would suggest you use table and as.vector:
as.vector(table(dataInVector))

How to print the name of current row when using apply in R?

For example, I have a matrix k
> k
d e
a 1 3
b 2 4
I want to apply a function on k
> apply(k,MARGIN=1,function(p) {p+1})
a b
d 2 3
e 4 5
However, I also want to print the rowname of the row being apply so that I can know which row the function is applied on at that time.
It may looks like this:
apply(k,MARGIN=1,function(p) {print(rowname(p)); p+1})
But I really don't do how to do that in R.
Does anyone has any idea?
Here's a neat solution to what I think you're asking. (I've called the input matrix mat rather than k for clarity - in this example, mat has 2 columns and 10 rows, and the rows are named abc1 through to abc10.)
In the code below, the result out1 is the thing you wanted to calculate (the outcome of the apply command). The result out2 comes out identically to out1 except that it prints out the rownames that it is working on (I put in a delay of 0.3 seconds per row so you can see it really does do this - take this out when you want the code to run full speed obviously!)
The trick I came up with was to cbind the row numbers (1 to n) onto the left of mat (to create a matrix with one additional column), and then use this to refer back to the rownames of mat. Note the line x = y[-1] which means that the actual calculation within the function (here, adding 1) ignores the first column of row numbers, which means it's the same as the calculation done for out1. Whatever sort of calculation you want to perform on the rows can be done this way - just pretend that y never existed, and formulate your desired calculation using x. Hope this helps.
set.seed(1234)
mat = as.matrix(data.frame(x = rpois(10,4), y = rpois(10,4)))
rownames(mat) = paste("abc", 1:nrow(mat), sep="")
out1 = apply(mat,1,function(x) {x+1})
out2 = apply(cbind(seq_len(nrow(mat)),mat),1,
function(y) {
x = y[-1]
cat("Doing row:",rownames(mat)[y[1]],"\n")
Sys.sleep(0.3)
x+1
}
)
identical(out1,out2)
You can use a variable outside of the apply call to keep track of the row index and pass the row names as an extra argument to your function:
idx <- 1
apply(k, 1, function(p, rn) {print(rn[idx]); idx <<- idx + 1; p + 1}, rownames(k))
This should work. The cat() function is what you want to use when printing results during evaluation of a function. paste(), conversely, just returns a character vector but doesn't send it to the command window.
The solution below uses a counter created as a closure, allowing it to "remember" how many times the function has been run before. Note the use of the global assign <<-. If you really want to understand what's going on here, I recommend reading through this wiki https://github.com/hadley/devtools/wiki/
Note there may be an easier way to do this; my solution assumes that there is no way to access the rownumber or rowname of a current row using typical means within an apply function. As previously mentioned, this would be no problem in a loop.
k <- matrix(c(1,2,3,4),ncol=2)
rownames(k) <- c("a","b")
colnames(k) <- c("d","e")
make.counter <- function(x){
i <- 0
function(){
i <<- i+1
i
}
}
counter1 <- make.counter()
apply(k,MARGIN=1,function(p){
current.row <- rownames(k)[counter1()]
cat(current.row,"\n")
return(p+1)
})
As far as I know you cannot do that with apply, but you could loop through the rownames of your data frame. Lame example:
lapply(rownames(mtcars), function(x) sprintf('The mpg of %s is %s.', x, mtcars[x, 1]))

Make nested loops more efficient?

I'm analyzing large sets of data using the following script:
M <- c_alignment
c_check <- function(x){
if (x == c_1) {
1
}else{
0
}
}
both_c_check <- function(x){
if (x[res_1] == c_1 && x[res_2] == c_1) {
1
}else{
0
}
}
variance_function <- function(x,y){
sqrt(x*(1-x))*sqrt(y*(1-y))
}
frames_total <- nrow(M)
cols <- ncol(M)
c_vector <- apply(M, 2, max)
freq_vector <- matrix(nrow = sum(c_vector))
co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector))
insertion <- 0
res_1_insertion <- 0
for (res_1 in 1:cols){
for (c_1 in 1:conf_vector[res_1]){
res_1_insertion <- res_1_insertion + 1
insertion <- insertion + 1
res_1_subset <- sapply(M[,res_1], c_check)
freq_vector[insertion] <- sum(res_1_subset)/frames_total
res_2_insertion <- 0
for (res_2 in 1:cols){
if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){
for (c_2 in 1:max(c_vector[res_2])){
res_2_insertion <- res_2_insertion + 1
both_res_subset <- apply(M, 1, both_c_check)
co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total
co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total
}
}
}
}
}
covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector)))
variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector))
correlation_coefficient_matrix <- covariance_matrix/variance_matrix
A model input would be something like this:
1 2 1 4 3
1 3 4 2 1
2 3 3 3 1
1 1 2 1 2
2 3 4 4 2
What I'm calculating is the binomial covariance for each state found in M[,i] with each state found in M[,j]. Each row is the state found for that trial, and I want to see how the state of the columns co-vary.
Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons.
The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know for loops are terribly slow in R, but I'm not sure how I can use the apply function here. If anyone has a suggestion as to properly using apply here, I'd really appreciate it. Right now the script takes several hours. Thanks!
I thought of writing a comment, but I have too much to say.
First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed.
Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech)
Then, look at the help page of ?subset and the warning given there:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
[, and in particular the non-standard evaluation of argument subset
can have unanticipated consequences.
Always. Use. Indices.
Further, You recalculate the same values over and over again. fre_res_2 for example is calculated for every res_2 and state_2 as many times as you have combinations of res_1 and state_1. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again.
Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation:
cov <- (freq_both - (freq_res_1)*(freq_res_2)) /
(sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2)))
As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it cov, cov is a function). Exit loops. Enter fast code.
Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R.
Let this be a start: The R Inferno
It's not really the 4 way nested loops but the way your code is growing memory on each iteration. That's happening 4 times where I've placed # ** on the cbind and rbind lines. Standard advice in R (and Matlab and Python) in situations like this is to allocate in advance and then fill it in. That's what the apply functions do. They allocate a list as long as the known number of results, assign each result to each slot, and then merge all the results together at the end. In your case you could just allocate the correct size matrix in advance and assign into it at those 4 points (roughly speaking). That should be as fast as the apply family, and you might find it easier to code.

Resources