Enhancing speed / vectorization of for loop including sample-function R - r

I am looking for a fast way to create a matrix with integer values that have a certain probability to be chosen. Given a vector L=c(3,4,2) and a probability vector Prob=c(0.4,0.35,0.25,0.1,0.25,0.4,0.25,0.6,0.4) with sum(L) elements, I want to choose, for example, an element between 1:L[1] = 1:3 with probability Prob[1:L[1]] = c(0.4,0.35,0.25). This should be performed over all elements of L for several times determined by the parameter rows and be stored into a matrix named POP.
My solution is very slow because of two for-loops and I am searching a solution with much better performance through vectorization or other techniques.
My solution to this problem looks as follows:
L = c(3,4,2)
L_cum = c(0,cumsum(L)) #vector to call vector sections from Prob
Prob = c(0.4,0.35,0.25,0.1,0.25,0.4,0.25,0.6,0.4) #probability vector for sum(L) elements
rows = 5 #number of rows of matrix POP
POP = matrix(0,rows,length(L))
for(i in 1:rows){
for(j in 1:length(L)){
POP[i,j] = sample(1:L[j],1,prob=Prob[(L_cum[j]+1):L_cum[j+1]])
}
}

I'd just try:
set.seed(1234)
#set the number of extractions
n<-10
vapply(split(Prob,rep(seq_along(L),L)),
function(x) sample(length(x),n,replace=TRUE,prob=x),
integer(n))
# 1 2 3
# [1,] 1 4 1
# [2,] 2 2 1
# [3,] 2 3 1
# [4,] 2 1 1
# [5,] 3 3 1
# [6,] 2 4 2
# [7,] 1 3 1
# [8,] 1 3 2
# [9,] 2 3 2
#[10,] 2 3 1

Related

Count instances of overlap in two vectors in R

I am hoping to create a matrix that shows a count of instances of overlapping values for a grouping variable based on a second variable. Specifically, I am hoping to determine the degree to which primary studies overlap across meta-analyses in order to create a network diagram.
So, in this example, I have three meta-analyses that include some portion of three primary studies.
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,3,2,1,2,3))
metas studies
1 1 1
2 1 3
3 1 2
4 2 1
5 3 2
6 3 3
I would like it to return:
v1 v2 v3
1 3 1 2
2 1 1 0
3 2 0 2
The value in row 1, column 1 indicates that Meta-analysis 1 had three studies in common with itself (i.e., it included three studies). Row 1, column 2 indicates that Meta-analysis 1 had one study in common with Meta-analysis 2. Row 1, column 3 indicates that Meta-analysis 1 had two studies in common with Meta-analysis 3.
I believe you are looking for a symmetric matrix of intersecting studies.
dfspl <- split(df$studies, df$metas)
out <- outer(seq_along(dfspl), seq_along(dfspl),
function(a, b) lengths(Map(intersect, dfspl[a], dfspl[b])))
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
If you need names on them, you can go with the names as defined by df$metas:
rownames(out) <- colnames(out) <- names(dfspl)
out
# 1 2 3
# 1 3 1 2
# 2 1 1 0
# 3 2 0 2
If you need the names defined as v plus the meta name, go with
rownames(out) <- colnames(out) <- paste0("v", names(dfspl))
out
# v1 v2 v3
# v1 3 1 2
# v2 1 1 0
# v3 2 0 2
If you need to understand what this is doing, outer creates an expansion of the two argument vectors, and passes them all at once to the function. For instance,
outer(seq_along(dfspl), seq_along(dfspl), function(a, b) { browser(); 1; })
# Called from: FUN(X, Y, ...)
debug at #1: [1] 1
# Browse[2]>
a
# [1] 1 2 3 1 2 3 1 2 3
# Browse[2]>
b
# [1] 1 1 1 2 2 2 3 3 3
# Browse[2]>
What we ultimately want to do is find the intersection of each pair of studies.
dfspl[[1]]
# [1] 1 3 2
dfspl[[3]]
# [1] 2 3
intersect(dfspl[[1]], dfspl[[3]])
# [1] 3 2
length(intersect(dfspl[[1]], dfspl[[3]]))
# [1] 2
Granted, we are doing it twice (once for 1 and 3, once for 3 and 1, which is the same result), so this is a little inefficient ... it would be better to filter them to only look at the upper or lower half and transferring it to the other.
Edited for a more efficient process (only calculating each intersection pair once, and never calculating self-intersection.)
eg <- expand.grid(a = seq_along(dfspl), b = seq_along(dfspl))
eg <- eg[ eg$a < eg$b, ]
eg
# a b
# 4 1 2
# 7 1 3
# 8 2 3
lens <- lengths(Map(intersect, dfspl[eg$a], dfspl[eg$b]))
lens
# 1 1 2 ## btw, these are just names, from eg$a
# 1 2 0
out <- matrix(nrow = length(dfspl), ncol = length(dfspl))
out[ cbind(eg$a, eg$b) ] <- lens
out
# [,1] [,2] [,3]
# [1,] NA 1 2
# [2,] NA NA 0
# [3,] NA NA NA
out[ lower.tri(out) ] <- out[ upper.tri(out) ]
diag(out) <- lengths(dfspl)
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
Same idea as #r2evans, also Base R (and a bit less eloquent) (edited as required):
# Create df using sample data:
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,7,2,1,2,3))
# Test for equality between the values in the metas vector and the rest of
# of the values in the dataframe -- Construct symmetric matrix from vector:
m1 <- diag(v1); m1[,1] <- m1[1,] <- v1 <- rowSums(data.frame(sapply(df$metas, `==`,
unique(unlist(df)))))
# Coerce matrix to dataframe setting the names as desired; dropping non matches:
df_2 <- setNames(data.frame(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)]),
paste0("v", 1:ncol(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)])))

multiple loops on matrix

I have below information:
coordinate <- read.table(text = " 18.915 13.462 31.598
17.898 14.453 32.160
18.220 15.420 32.853
19.208 12.313 32.573
20.393 11.524 32.110
20.344 10.809 31.085
21.595 16.610 29.912")
amnumber <- c(1,1,2,3,3,3,4)
atname <-as.data.frame( c("A","B","A","C","D","C","H"),stringsAsFactors = F)
library(geometry)
tri <- delaunayn(coordinate)
tri
[,1] [,2] [,3] [,4]
[1,] 1 3 7 2
[2,] 4 1 6 2
[3,] 4 1 3 2
[4,] 4 1 3 7
[5,] 5 4 3 7
[6,] 5 1 6 7
[7,] 5 4 1 7
[8,] 5 4 1 6
tridmatrix
I want to perform two loops on tri mamtrix such that value 1 in the first row has relations between each other next values like 3,7 and 2. So, in the output matrix of our loops, we have to put 1 between these indices. Then, value 3 of the first row has relations between two other values like 7 and 2. And so on. The output result would be a matrix that only contains 0,1 values. To this end I wrote the below loops:
for (k in 1:nrow(tri)){
for (i in 1:4){
for (j in i+1){
c <- abs(amnumber[tri[k,i]]-amnumber[tri[k,j]])
if (c>=1){
if (!((atname[tri[k,i],]%in%"N")&&(atname[tri[k,j],]%in%"C")&&(c%in%1)||
(atname[tri[k,i],]%in%"C")&&(atname[tri[k,j],]%in%"N")&&(c%in%1))){
d <- sqrt(sum((coordinate[tri[k,i],]-coordinate[tri[k,j],])^2))
if (d<=tridist){
adj_tri[tri[k,i],tri[k,j]] <- 1
adj_tri[tri[k,j],tri[k,i]] <- 1
adj_tri[is.na(adj_tri)] <- 0
}
}
}
}
}
}
But it did not work. And I faced error. i index is equal to the number of columns in tri matrix and I think the problem is in the third loop. However, I could not fix it. Any help would be appreciated.
Besides, this is too slow. Would you please help me to change it lapply to speed up the progress.

How to identify all possible permutations of a time series according to order of permutation

I am trying to figure out a way to translate a financial time series into a symbolic time series that account for all "meaningful" permutations according to a given order (in R):
Example:
Given a Time Series: ts= c(1,2,3,4,5)
If Order=2 I would like to extract the following patterns:
1) 1 1 (ts[i]==ts[i+1])
2) 1 2 (ts[i]<ts[i+1])
3) 2 1 (ts[i]>ts[i+1])
(pattern 2 2 is redundant since equality is accounted for via pattern 1 1)
If Order=3 I would like to extract the following patterns:
1) 1 2 3 (ts[i]<ts[i+1]<ts[i+2])
2) 1 2 2 (ts[i]<ts[i+1]==ts[i+2])
3) 1 2 1 (ts[i]<ts[i+1]>ts[i+2])
4) 2 2 3 (ts[i]==ts[i+1]<ts[i+2])
5) 2 2 2 (ts[i]==ts[i+1]==ts[i+2])
6) 2 2 1 (ts[i]==ts[i+1]>ts[i+2])
7) 3 2 1 (ts[i]>ts[i+1]>ts[i+2])
8) 3 2 2 (ts[i]>ts[i+1]==ts[i+2])
9) 3 2 3 (ts[i]>ts[i+1]<ts[i+2])
What I am looking for is a scalable (in terms of Order being 2,3,4,5 and so on) and automated (function-wise) way to do this.
I am striving with packages such as "permute", "gtools", "combinat" but to no avail. I think what I seek is a special case of permutations. Can anyone help me with this problem?
My quest started from reading papers on "Permutation Entropy", a Google-scholar search shall provide you with relevant bibliography for anyone further interested.
Try this:
library(zoo)
ts <- c(1,3,2,4,5,4,3,3,2)
rollapply(ts, 2, rank, ties='min')
[,1] [,2]
[1,] 1 2
[2,] 2 1
[3,] 1 2
[4,] 1 2
[5,] 2 1
[6,] 2 1
[7,] 1 1
[8,] 2 1
When order = 3:
rollapply(ts, 3, rank, ties='min')
[,1] [,2] [,3]
[1,] 1 3 2
[2,] 2 1 3
[3,] 1 2 3
[4,] 1 3 1
[5,] 3 2 1
[6,] 3 1 1
[7,] 2 2 1
That's not quite what you want but it is close. The main issue is seen in the first two rows where you don't wish to distinguish the ranks of the first and third values when both are higher or lower than the middle observation. Here is a fix.
z <- rollapply(ts, 3, rank, ties='min')
lohilo <- z[,1] < z[,2] & z[,3] < z[,2]
hilohi <- z[,1] > z[,2] & z[,3] > z[,2]
z[lohilo,] <- rep(c(1,2,1),rep(sum(lohilo),3))
z[hilohi,] <- rep(c(2,1,2),rep(sum(hilohi),3))
z
[,1] [,2] [,3]
[1,] 1 2 1
[2,] 2 1 2
[3,] 1 2 3
[4,] 1 2 1
[5,] 3 2 1
[6,] 3 1 1
[7,] 2 2 1
Permutations of a time series are computed within the function below, specifically made for permutation entropy (Source):
# Function to compute the ordinal patterns for a given time series.
# Input (2 arguments. Null arguments are not vaild)
# x = Given time series (type=numeric vector)
# dim = Embedding dimension (type=numeric)
# Commonly used value of dim ranges from 3 to 7
# Output is a numeric vector of size=(dim)!
ordinal_pattern<-function(x,dim){
# Generate ordinal numbers to assign. For example if dim =3, then
# ordinal number=0,1,2
ordinal_numbers<-seq(0,(dim-1),by=1)
# Compute all possible permutations of the ordinal numbers.
# Maximum size of possible_pattern=dim!
possible_pattern<-(combinat::permn(ordinal_numbers))
# Initialize result. Result is the output.
result<-0
result[1:length(possible_pattern)]<-0
# Loop for computation of ordinal pattern
for(i in 1:(length(x)-(dim-1))){
temp<-x[i:(i+(dim-1))]
tempseq<-seq(0,dim-1,by=1)
tempdata<-data.frame(temp,tempseq)
tempdata<-tempdata[order(temp),]
for(j in 1: length(possible_pattern)){
if (all(possible_pattern[[j]]==tempdata$tempseq)){
result[j]<-result[j]+1
}
}
}
return(result)
}

Printing the sorted elements of a matrix in descending order with array indices in the fastest fashion

This seems like a simple problem but I am having trouble doing this in a fast manner.
Say I have a matrix and I want to sort this matrix and store the indices of the elements in descending order. Is there a quick way to do this? Right now, I am extracting the maximum, storing the result, changing it to -2, and then extracting the next maximum in a for loop. Which is probably the most inefficient way to do it.
My problem actually requires me to work on a 20,000 X 20,000 matrix. Memory is not an issue. Any ideas about the fastest way to do it would be great.
For example if I have a matrix
>m<-matrix(c(1,4,2,3),2,2)
>m
[,1] [,2]
[1,] 1 2
[2,] 4 3
I want the result to indicate the numbers in descending order:
row col val
2 1 4
2 2 3
1 2 2
1 1 1
Here's a possible data.table solution
library(data.table)
rows <- nrow(m) ; cols <- ncol(m)
res <- data.table(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
setorder(res, -val)
res
# row col val
# 1: 2 1 4
# 2: 2 2 3
# 3: 1 2 2
# 4: 1 1 1
Edit: a base R alternative
res <- cbind(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
res[order(-res[, 3]),]
# row col val
# [1,] 2 1 4
# [2,] 2 2 3
# [3,] 1 2 2
# [4,] 1 1 1

Converting a frequency matrix to a list of ordered pairs in R

Let's say I have a 3x3 matrix of frequency values. Each cell in this matrix represents the number of entities detected at a certain location (the specifics about these locations don't matter for our purposes). For example, there were 0 entities detected in Row 1, Column 1; there were 3 entities detected in Row 2, Column 3; etc.
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 2 3
[3,] 0 1 1
I want to convert this matrix to an nx2 data frame, where n is the total number of entities detected in all locations. In this case, there should be 9 rows because there were a total of 9 entities detected. In the data frame we create, column 1 should contain the row index of the entity, and column 2 should contain the column index of the entity.
We can use the following code to accomplish this, but this method is pretty slow for larger matrices.
mat <- matrix(c(0,1,0,1,2,3,0,1,1), nrow = 3, byrow = TRUE)
x <- rep(NA, 9)
y <- rep(NA, 9)
count <- 0
for(i in 1:3){
for(j in 1:3){
while(mat[i,j] > 0){
count <- count + 1
x[count] <- i
y[count] <- j
mat[i,j] <- mat[i,j] - 1
}
}
}
df <- data.frame(x, y)
This code gives us the following 9x2 data frame:
x y
1 2
2 1
2 2
2 2
2 3
2 3
2 3
3 2
3 3
The way we interpret this data frame is by saying there was one entity detected at location (1,2), one entity detected at location (2,1), two entities detected at location (2,2), and so on. This output is correct, but I would prefer to use a faster method to obtain it.
Is there a better way to do this?
The arr.ind parameter delivers the starting point, namely the indices that have non-zero entries and hten you can rep()-eat them by the entry values:
idxs <- which(dat > 0, arr.ind=TRUE)
idxs[ rep(1:nrow(idxs), dat[dat>0]), ]
row col
[1,] 2 1
[2,] 1 2
[3,] 2 2
[4,] 2 2
[5,] 3 2
[6,] 2 3
[7,] 2 3
[8,] 2 3
[9,] 3 3

Resources