Vectorizing R-loop for better performance - r

I have a problem to find a vectorization representation for a specific loop in R. My objective is to enhance the performance of the loop, because it has to be run thousands of times in the algorithm.
I want to find the position of the lowest value in a particular array section defined by a vector 'Level' for each row.
Example:
Level = c(2,3)
Let first row of array X be: c(2, -1, 3, 0.5, 4).
Searching for the position of the lowest value in the range 1:Level[1] of the row (that is (2, -1)), I get a 2, because -1 < 2 and -1 stands on second position of the row. Then, searching the position of the lowest value in the second range (Level[1]+1):(Level[1]+Level[2]) (that is (3, 0.5, 4)), I get a 4, because 0.5 < 3 < 4 and 0.5 stands on fourth position of the row.
I have to perform this over each row in the array.
My solution to the problem works as follows:
Level = c(2,3,3) #elements per section, here: 3 sections with 2,3 and 3 levels
rows = 10 #number of rows in array X
X = matrix(runif(rows*sum(Level),-5,5),rows,sum(Level)) #array with 10 rows and sum(Level) columns, here: 8
Position_min = matrix(0,rows,length(Level)) #array in which the position of minimum values for each section and row are stored
for(i in 1:rows){
for(j in 1:length(Level)){ #length(Level) is number of intervals, here: 3
if(j == 1){coeff=0}else{coeff=1}
Position_min[i,j] = coeff*sum(Level[1:(j-1)]) + which(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])] == min(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])]))
}
}
It works fine but I would prefer a solution with better performance. Any ideas?

This will remove the outer level of the loop:
Level1=c(0,cumsum(Level))
for(j in 1:(length(Level1)-1)){
Position_min[,j]=max.col(-X[,(Level1[j]+1):Level1[j+1]])+(Level1[j])
}

Here is a "fully vectorized" solution with no explicit loops:
findmins <- function(x, level) {
series <- rep(1:length(Level), Level)
x <- split(x, factor(series))
minsSplit <- as.numeric(sapply(x, which.min))
minsSplit + c(0, cumsum(level[-length(level)]))
}
Position_min_vectorized <- t(apply(X, 1, findmins, Level))
identical(Position_min, Position_min_vectorized)
## [1] TRUE
You can get better performance by making your matrix into a list, and then using parallel's mclapply():
X_list <- split(X, factor(1:nrow(X)))
do.call(rbind, parallel::mclapply(X_list, findmins, Level))
## [,1] [,2] [,3]
## 1 1 5 6
## 2 2 3 6
## 3 1 4 7
## 4 1 5 6
## 5 2 5 7
## 6 2 4 6
## 7 1 5 8
## 8 1 5 8
## 9 1 3 8
## 10 1 3 8

Related

Calculating moving differences across columns per row in r

I would like to do calculations across columns in my data, by row. The calculations are "moving" in that I would like to know the difference between two numbers in column 1 and 2, then columns 3 and 4, and so on. I have looked at "loops" and "rollapply" functions, but could not figure this out. Below are three options of what was attempted. Only the third option gives me the result I am after, but it is very lengthy code and also does not allow for automation (the input data will be a much larger matrix, so typing out the calculation for each row won't work).
Please advice how to make this code shorter and/or any other packages/functions to check out which will do the job. THANK YOU!
MY TEST SCRIPT IN R + errors/results
Sample data set
a<- c(1,2,3, 4, 5)
b<- c(1,2,3, 4, 5)
c<- c(1,2,3, 4, 5)
test.data <- data.frame(cbind(a,b*2,c*10))
names(test.data) <- c("a", "b", "c")
Sample of calculations attempted:
OPTION 1
require(zoo)
rollapply(test.data, 2, diff, fill = NA, align = "right", by.column=FALSE)
RESULT 1 (not what we're after. What we need is at the bottom of Option 3)
# a b c
#[1,] NA NA NA
#[2,] 1 2 10
#[3,] 1 2 10
#[4,] 1 2 10
#[5,] 1 2 10
OPTION 2:
results <- for (i in 1:length(nrow(test.data))) {
diff(as.numeric(test.data[i,]), lag=1)
print(results)}
RESULT 2: (again not what we're after)
# NULL
OPTION 3: works, but long way, so would like to simplify code and make generic for any length of observations in my dataframe and any number of columns (i.e. more than 3). I would like to "automate" the steps below, if know number of observations (i.e. rows).
row1=diff(as.numeric(test[1,], lag=1))
row2=diff(as.numeric(test[2,], lag=1))
row3=diff(as.numeric(test[3,], lag=1))
row4=diff(as.numeric(test[4,], lag=1))
row5=diff(as.numeric(test[5,], lag=1))
results.OK=cbind.data.frame(row1, row2, row3, row4, row5)
transpose.results.OK=data.frame(t(as.matrix(results.OK)))
names(transpose.results.OK)=c("diff.ab", "diff.bc")
Final.data = transpose.results.OK
print(Final.data)
RESULT 3: (THIS IS WHAT I WOULD LIKE TO GET, "row1" can be "obs1" etc)
# diff.ab diff.bc
#row1 1 8
#row2 2 16
#row3 3 24
#row4 4 32
#row5 5 40
THE END
Here are the 3 options redone plus a 4th option:
# 1
library(zoo)
d <- t(rollapplyr(t(test.data), 2, diff, by.column = FALSE))
# 2
d <- test.data[-1]
for (i in 1:nrow(test.data)) d[i, ] <- diff(unlist(test.data[i, ]))
# 3
d <- t(diff(t(test.data)))
# 4 - also this works
nc <- ncol(test.data)
d <- test.data[-1] - test.data[-nc]
For any of them to set the names:
colnames(d) <- paste0("diff.", head(names(test.data), -1), colnames(d))
(2) and (4) give this data.frame and (1) and (3) give the corresponding matrix:
> d
diff.ab diff.bc
1 1 8
2 2 16
3 3 24
4 4 32
5 5 40
Use as.matrix or as.data.frame if you want the other.
An apply based solution using diff on row-wise can be achieved as:
# Result
res <- t(apply(test.data, 1, diff)) #One can change it to data.frame
# Name of the columns
colnames(res) <- paste0("diff.", head(names(test.data), -1),
tail(names(test.data), -1))
res
# diff.ab diff.bc
# [1,] 1 8
# [2,] 2 16
# [3,] 3 24
# [4,] 4 32
# [5,] 5 40

Create n by n matrix with unique values from 1:n

I want generate a random n by n matrix in R with discrete values ranging from 1 to n. The tricky part is that I want each value to be unique both in the row and on the column.
For example, if n=3 the matrix could look like:
1 2 3
2 3 1
3 1 2
or it could look like this:
2 3 1
1 2 3
3 1 2
anyone has any idea of how to generate this kind of matrix?
What you want is called a Latin square. Here's one function (from the Cookbook for R; see also here and a bunch of other search results online) allowing to generate them:
latinsquare <- function(len, reps=1, seed=NA, returnstrings=FALSE) {
# Save the old random seed and use the new one, if present
if (!is.na(seed)) {
if (exists(".Random.seed")) { saved.seed <- .Random.seed }
else { saved.seed <- NA }
set.seed(seed)
}
# This matrix will contain all the individual squares
allsq <- matrix(nrow=reps*len, ncol=len)
# Store a string id of each square if requested
if (returnstrings) { squareid <- vector(mode = "character", length = reps) }
# Get a random element from a vector (the built-in sample function annoyingly
# has different behavior if there's only one element in x)
sample1 <- function(x) {
if (length(x)==1) { return(x) }
else { return(sample(x,1)) }
}
# Generate each of n individual squares
for (n in 1:reps) {
# Generate an empty square
sq <- matrix(nrow=len, ncol=len)
# If we fill the square sequentially from top left, some latin squares
# are more probable than others. So we have to do it random order,
# all over the square.
# The rough procedure is:
# - randomly select a cell that is currently NA (call it the target cell)
# - find all the NA cells sharing the same row or column as the target
# - fill the target cell
# - fill the other cells sharing the row/col
# - If it ever is impossible to fill a cell because all the numbers
# are already used, then quit and start over with a new square.
# In short, it picks a random empty cell, fills it, then fills in the
# other empty cells in the "cross" in random order. If we went totally randomly
# (without the cross), the failure rate is much higher.
while (any(is.na(sq))) {
# Pick a random cell which is currently NA
k <- sample1(which(is.na(sq)))
i <- (k-1) %% len +1 # Get the row num
j <- floor((k-1) / len) +1 # Get the col num
# Find the other NA cells in the "cross" centered at i,j
sqrow <- sq[i,]
sqcol <- sq[,j]
# A matrix of coordinates of all the NA cells in the cross
openCell <-rbind( cbind(which(is.na(sqcol)), j),
cbind(i, which(is.na(sqrow))))
# Randomize fill order
openCell <- openCell[sample(nrow(openCell)),]
# Put center cell at top of list, so that it gets filled first
openCell <- rbind(c(i,j), openCell)
# There will now be three entries for the center cell, so remove duplicated entries
# Need to make sure it's a matrix -- otherwise, if there's just
# one row, it turns into a vector, which causes problems
openCell <- matrix(openCell[!duplicated(openCell),], ncol=2)
# Fill in the center of the cross, then the other open spaces in the cross
for (c in 1:nrow(openCell)) {
# The current cell to fill
ci <- openCell[c,1]
cj <- openCell[c,2]
# Get the numbers that are unused in the "cross" centered on i,j
freeNum <- which(!(1:len %in% c(sq[ci,], sq[,cj])))
# Fill in this location on the square
if (length(freeNum)>0) { sq[ci,cj] <- sample1(freeNum) }
else {
# Failed attempt - no available numbers
# Re-generate empty square
sq <- matrix(nrow=len, ncol=len)
# Break out of loop
break;
}
}
}
# Store the individual square into the matrix containing all squares
allsqrows <- ((n-1)*len) + 1:len
allsq[allsqrows,] <- sq
# Store a string representation of the square if requested. Each unique
# square has a unique string.
if (returnstrings) { squareid[n] <- paste(sq, collapse="") }
}
# Restore the old random seed, if present
if (!is.na(seed) && !is.na(saved.seed)) { .Random.seed <- saved.seed }
if (returnstrings) { return(squareid) }
else { return(allsq) }
}
mats is a list of such matrices. It uses r2dtable to generate N random n x n matrices whose elements are chosen from 0, 1, ..., n-1 and whose margins are each given by margin. Then it filters out those for which all columns columns have one each of 0:(n-1) and adds one to each matrix to give result. The number of matrices returned can vary and you have to generate a huge number of matrices N to get just a few as n gets larger. When I tried n <- 3 below mats was a list of 24 matrices out of 100 but with n <- 4 it only found 1 out of 100.
set.seed(123)
N <- 100 # no of tries
n <- 3 # rows of matrix (= # cols)
check <- function(x) all(apply(x, 2, sort) == seq_len(nrow(x))-1)
margin <- sum(seq_len(n))-n
margins <- rep(margin, n)
L <- r2dtable(N, r = margins, c = margins)
mats <- lapply(Filter(check, L), "+", 1)
Here is an attempt:
x <- c(1,2,3)
out <- NULL
for(i in 1:3){
y <- c(x[1 + (i+0) %% 3], x[1 + (i+1) %% 3], x[1 + (i+2) %% 3])
out <- rbind(out,y)
}
This gives:
> out
[,1] [,2] [,3]
y 2 3 1
y 3 1 2
y 1 2 3
For the general case:
n <- 4
x <- 1:n
out <- NULL
for(i in 1:n){
y <- x[1 + ((i+0:(n-1))%%n)]
out <- rbind(out,y)
}
If I'm not wrong this is the expected result:
> out
[,1] [,2] [,3] [,4]
y 2 3 4 1
y 3 4 1 2
y 4 1 2 3
y 1 2 3 4
Shorter:
n < 4
x <- 1:n
vapply(x, function(i) x[1 + ((i+0:(n-1))%%n)], numeric(n))
Here is one version that generates all possible rows for such matrix and then takes them one by one, restricting the selection to the valid choices each time:
n <- 9
allrows <- combinat::permn(n)
takerows <- function(taken, all) {
available <- rep(TRUE, length(all))
for(i in 1:nrow(taken)) {
available <- sapply(all, function(x) all((x-taken[i,])!=0)) & available
}
matrix(all[[which(available)[sample(sum(available), 1)]]], nrow=1)
}
magicMat <- takerows(matrix(rep(0, n), ncol=n), allrows)
for(i in 1:(n-1)) {
magicMat <- rbind(magicMat, takerows(magicMat, allrows))
}
> magicMat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 5 3 1 4 2 8 6 7 9
[2,] 9 8 6 2 1 3 7 4 5
[3,] 4 5 7 8 9 2 3 6 1
[4,] 3 9 2 1 6 7 5 8 4
[5,] 1 6 5 3 8 4 2 9 7
[6,] 7 2 4 9 3 5 8 1 6
[7,] 6 4 8 5 7 1 9 3 2
[8,] 8 1 9 7 5 6 4 2 3
[9,] 2 7 3 6 4 9 1 5 8

Ranking and Counting Matrix Elements in R

I know there are similar questions but I couldn't find an answer to my question. I'm trying to rank elements in a matrix and then extract data of 5 highest elements.
Here is my attempt.
set.seed(20)
d<-matrix(rnorm(100),nrow=10,ncol=10)
start<-d[1,1]
for (i in 1:10) {
for (j in 1:10) {
if (start < d[i,j])
{high<-d[i,j]
rowind<-i
colind<-j
}
}
}
Although this gives me the data of the highest element, including row and column numbers, I can't think of a way to do the same for elements ranked from 2 to 5. I also tried
rank(d, ties.method="max")
But it wasn't helpful because it just spits out the rank in vector format.
What I ultimately want is a data frame (or any sort of table) that contains
rank, column name, row name, and the data(number) of highest 5 elements in matrix.
Edit
set.seed(20)
d<-matrix(rnorm(100),nrow=10,ncol=10)
d[1,2]<-5
d[2,1]<-5
d[1,3]<-4
d[3,1]<-4
Thanks for the answers. Those perfectly worked for my purpose, but as I'm running this code for correlation chart -where there will be duplicate numbers for every pair- I want to count only one of the two numbers for ranking purpose. Is there any way to do this? Thanks.
Here's a very crude way:
DF = data.frame(row = c(row(d)), col = c(col(d)), v = c(d))
DF[order(DF$v, decreasing=TRUE), ][1:5, ]
row col v
91 1 10 2.208443
82 2 9 1.921899
3 3 1 1.785465
32 2 4 1.590146
33 3 4 1.556143
It would be nice to only have to partially sort, but in ?order, it looks like this option is only available for sort, not for order.
If the matrix has row and col names, it might be convenient to see them instead of numbers. Here's what I might do:
dimnames(d) <- list(letters[1:10], letters[1:10])
DF = data.frame(as.table(d))
DF[order(DF$Freq, decreasing=TRUE), ][1:5, ]
Var1 Var2 Freq
91 a j 2.208443
82 b i 1.921899
3 c a 1.785465
32 b d 1.590146
33 c d 1.556143
The column names don't make much sense here, unfortunately, but you can change them with names(DF) <- as usual.
Here is one option with Matrix
library(Matrix)
m1 <- summary(Matrix(d, sparse=TRUE))
head(m1[order(-m1[,3]),],5)
# i j x
#93 3 10 2.359634
#31 1 4 2.234804
#23 3 3 1.980956
#55 5 6 1.801341
#16 6 2 1.678989
Or use melt
library(reshape2)
m2 <- melt(d)
head(m2[order(-m2[,3]), ], 5)
Here is something quite simple in base R.
# set.seed(20)
# d <- matrix(rnorm(100), nrow = 10, ncol = 10)
d.rank <- matrix(rank(-d), nrow = 10, ncol = 10)
which(d.rank <= 5, arr.ind=TRUE)
row col
[1,] 3 1
[2,] 2 4
[3,] 3 4
[4,] 2 9
[5,] 1 10
d[d.rank <= 5]
[1] 1.785465 1.590146 1.556143 1.921899 2.208443
Results can (easily) be made clearer (see comment from Frank):
cbind(which(d.rank <= 5, arr.ind=TRUE), v = d[d.rank <= 5], rank = rank(-d[d.rank <= 5]))
row col v rank
[1,] 3 1 1.785465 3
[2,] 2 4 1.590146 4
[3,] 3 4 1.556143 5
[4,] 2 9 1.921899 2
[5,] 1 10 2.208443 1

creating a sum matrix from counts

I have one matrix of mutation counts, say "counts". This matrix has column names V1, V2,...,Vi,...Vn where not every "i" is there. Thus it can jump, such as V1, V2, V5 say. Further, most of columns have a 0 in them.
I need to create a sum matrix, called "answer", where element i, j is the sum of the number of the number counts at both i and j. At the i, i element it just shows the number of counts at i.
Here's a quick data set up. I already have the correct dimensioned matrix set up in my code called "answer". Thus what I would need to automate are the last several lines where I fill in the matrix.
counts <- matrix(data = c(0,2,0,5,0,6,0), nrow = 1, ncol = 7, dimnames=list("",c("V1","V2","V3","V4","V5","V6","V7")))
answer <- matrix(data =0, nrow = 3, ncol = 3, dimnames = list(c("V2","V4","V6"),c("V2","V4","V6")))
answer[1,1] <- 2
answer[1,2] <- 7
answer[1,3] <- 8
answer[2,1] <- 7
answer[2,2] <- 5
answer[2,3] <- 11
answer[3,1] <- 8
answer[3,2] <- 11
answer[3,3] <- 6
I understand I can do this with 2 nested for loops, but surely there must be a better way no? Thanks!
This could be done with the right use of expand.grid and rowSums:
n = counts[, counts > 0]
answer = matrix(rowSums(expand.grid(n, n)), nrow=length(n), dimnames=list(names(n), names(n)))
diag(answer) = n
To show how it works, n would end up being:
V2 V4 V5
2 5 6
and expand.grid(n, n) would be:
Var1 Var2
1 2 2
2 5 2
3 6 2
4 2 5
5 5 5
6 6 5
7 2 6
8 5 6
9 6 6
The last line (diag) is necessary because otherwise the diagonal would be twice the original vector (adding 2+2, 5+5, or 6+6).

Calculate cumulative sums of certain values

Assume you have a data frame like this:
df <- data.frame(Nums = c(1,2,3,4,5,6,7,8,9,10), Cum.sums = NA)
> df
Nums Cum.sums
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
and you want an output like this:
Nums Cum.sums
1 1 0
2 2 0
3 3 0
4 4 3
5 5 5
6 6 7
7 7 9
8 8 11
9 9 13
10 10 15
The 4. element of the column Cum.sum is the sum of 1 and 2, the 5. element of the Column Cum.sum is the sum of 2 and 3 and so on...
This means, I would like to build the cumulative sum of the first row and save it in the second row. However I don't want the normal cumulative sum but the sum of the element 2 rows above the current row plus the element 3 rows above the current row.
I allready tried to play a little bit around with the sum and cumsum function but I failed.
Any ideas?
Thanks!
You could use the embed function to create the appropriate lags, rowSums to sum, then lag appropriately (I used head).
df$Cum.sums[-(1:3)] <- head(rowSums(embed(df$Nums,2)),-2)
You don't need any special function, just use normal vector operations (these solutions are all equivalent):
df$Cum.sums[-(1:3)] <- head(df$Nums, -3) + head(df$Nums[-1], -2)
or
with(df, Cum.sums[-(1:3)] <- head(Nums, -3) + head(Nums[-1], -2))
or
df$Cum.sums[-(1:3)] <- df$Nums[1:(nrow(df)-3)] + df$Nums[2:(nrow(df)-2)]
I believe the first 3 sums SHOULD be NA, not 0, but if you prefer zeroes, you can initialize the sums first:
df$Cum.sums <- 0
Another solution, elegant and general, using matrix multiplication - and so very inefficient for large data. So it's not much practical, though a nice excercise:
len <- nrow(df)
sr <- 2 # number of rows to sum
lag <- 3
mat <- matrix(
head(c(
rep(0, lag * len),
rep(rep(1:0, c(sr, len - sr + 1)), len)
), len * len),
nrow = 10, byrow = TRUE
)
mat %*% df$Nums

Resources