So I have a rather complex (at least for me) problem in R.
I want to calculate distances between two pair of distributions, for nearly 10k pairs.
I have a distance function from package philentropy, which takes two vectors x y and calculates the distance between them such as:
d <- distance(x, y, method="desired_method")
Another option is to create a matrix with each row representing a distribution, so that the function will calculate all pairwise distances among all distributions in the matrix:
d <- distance(x, method="desired_method")
I have two correlation matrices a and b with nearly 10k rows each, corresponding to 10k correlation distributions. Both matrices have the same number of rows, and my goal is to contrast first row of matrix a with first row of matrix b, second a row with second b row and so on, iteratively.
I can select each desired rows and perform the first distance usage, or I can merge the two matrices with rbind and perform all pairwise distances with second distance usage.
The problem is, with first approach, I do not know how to generate a for loop to iteratively get the nth row of each matrix, and perform distance calculation, while storing the result in a vector.
Additionally, if I perform the second option, I do not want to get all pairwise distances, but just distances corresponding to:
d[i,i+nrow(a)]
And doing so iteratively to generate a corresponding vector of nrow(a) values.
Any help?
If you have two matrices, mat_x and mat_y, each with the same number of rows, then the for loop would be:
answer <- vector(mode = 'numeric', length = 10000L)
for (i in 1:10000){
answer[[i]] <- distance(mat_x[i,], mat_y[i,], method="desired_method")
}
Related
I would like to extract the coordinate from a vector that is closest to a test coordinate.
The task would be very similar to the previously posted:(Find the approximate value in the vector) but adapted to nDimensional cases and with multiple inputs.
In other words, given:
test=t(data.frame(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
))
reference=t(data.frame(
c(1,0,0.5),
c(2,2,2),
c(3.3,3.3,3.3),
c(9,9,9),
c(10,11,12)
))
result <- approximate(test,reference)
1 0 0.5
9 9 9
10 11 12
I programmed a function using euclidean distances and old school loops but when the inputs dataframes are big it results in looong executing times.
Anyone can figure it out a more efficient way of doing it?
Thank you in advance.
PS:This is the function I created that works but takes a while (in case someone could find it useful)
approximate_function<- function(approximate,reference){
# Function that returns for each entrance of approximate the closest value of reference
# It uses a euclidean distance.
# each entrance must be a row in the dataframe
# the number of columns of the df indicates the dimension of the points
# Sub function to calculate euclidean distance
distance_function<- function(a,b){
squaresum<-0
for(id in 1:length(a)){
squaresum=squaresum+(a[id]-b[id])^2
}
result=sqrt(squaresum)
return(result)
}
result<-data.frame()
#Choose 1 item from vector to aproximate at a time
for(id_approximate in 1:nrow(approximate)){
distance=c()
#Compare the value to aproximate with the reference points and chose the one with less distance
for(id_reference in 1:nrow(reference)){
distance[id_reference]<-distance_function(approximate[id_approximate,],reference[id_reference,])
}
result<-rbind(
result,
reference[which.min(distance),]
)
}
return(result)
}
This way the calculation is done instantly.
approximate_function<- function(approximate,reference){
# Function that returns for each entrance of approximate the closest value of reference
# It uses a euclidean distance.
# each entrance must be a row in the dataframe
# the number of columns of the df indicates the dimension of the points
results=data.frame()
#Choose 1 item from vector to aproximate at a time
for(id in 1:nrow(approximate)){
#calculates euclidean distances regardless the dimension
sumsquares=rep(0,nrow(reference))
for(dim in 1:ncol(approximate)){
sumsquares = sumsquares + (approximate[id,dim]-reference[,dim])^2
}
distances=sqrt(sumsquares)
results<- rbind(
results,
reference[which.min(distances),]
)
}
return(results)
}
You've got a few calculations that will be slow.
First:
test=t(data.frame(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
))
This one probably doesn't matter, but it would be better as
test=rbind(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
)
Same for setting up reference.
Second and third: You set up result as a dataframe, then add rows to it one at a time. Dataframes are much slower for row operations than matrices, and gradually growing structures in R is slow. So set it up as a matrix from the beginning at the right size, and assign results into specific rows.
EDITED to add:
Fourth: there's no need for the inner loop. You can calculate all the squared differences in one big matrix, then use rowSums or colSums to get the squared distances. This is easiest if you're working with matrix columns instead of rows, because vectors will be properly replicated automatically.
Fifth: There's no need to take the square root; if the squared distance is minimized, so is the distance.
Here's the result:
approximate <- function(test, reference){
# transpose the reference
reference <- t(reference)
# set up the result, not transposed
result <- test*NA
#Choose 1 item from vector to aproximate at a time
for(id in seq_len(nrow(test))){
squareddist <- colSums((test[id,] - reference)^2)
result[id,] <- reference[, which.min(squareddist)]
}
return(result)
}
Given two matrix MAT1 and MAT2 that contains a set of vector-columns of different coordinates for three elements (F,L,G), I would like to test which of the two matrices has the higher entropy. In other words, the data points of each vector sometimes are very close to each other for the same element, and in other cases too far from each other. When a vector is in balance the points of one class (F, L, or G) are in the same position or almost very close.
Which is the right way to test this, if for example the number of vector columns may vary from each experiment? Some toy data:
MAT1<- matrix(c(runif(32)),8,4)
row.names(MAT1) <- c("F","G","G","L","F","F","L","G")
MAT2<- matrix(c(runif(24)),8,3)
row.names(MAT1) <- c("F","G","G","L","F","F","L","G")
I've been using the R function DescTools::MutInf(MAT1) but as you can see, I am not an expert in this field!
Thank you,
I am trying to generate a data frame that contains all of the results of possible combinations. I'm using the function
combn(x,m)
x <- 17
m <- some range of the numbers between 2 and 16
in a loop where each iteration corresponds to a new value of m. Each iteration of the loop returns a vector of length choose(n,k) where n is equivalent to m and x is equivalent to k. I want to append each resulting vector as a column in a dataframe that contains all of the results, but this is not straightforward since the length of each vector varies. I have been able to accomplish this by first establishing a dataframe of NA values (data.frame) that is then incrementally filled by the values of the new.vector with the below loop:
n <- max(length(data.frame), length(new.vector))
for(l in 0:n) {
data.frame[l,j-1] <- new.vector[l]
}
I have two questions:
Is there a better way to append a new column that differs in length from the previous columns in the data frame that uses the power of R and vector operations rather than doing this via a loop?
Since this method works, I can go with it, but I've struggled to find the way to set the maximum number of rows in the dataframe that I initialize. It should be the maximum of choose(n,k1), choose(n,k2), choose(n,k3) ... choose(n,kn). I'm currently using the below to initialize the dataframe, but it generates the absolute maximum for a given n, which may be more rows than necessary depending on the range of k values.
dataframe <- data.frame(matrix(NA, nrow = ncol(combn(n,length(n)/2)),
ncol = max.n-min.n+1))
I've got 2 dataframes each with 150 rows and 10 columns + column and row IDs. I want to correlate every row in one dataframe with every row in the other (e.g. 150x150 correlations) and plot the distribution of the resulting 22500 values.(Then I want to calculate p values etc from the distribution - but that's the next step).
Frankly I don't know where to start with this. I can read my data in and see how to correlate vectors or matching slices of two matrices etc., but I can't get handle on what I'm trying to do here.
set.seed(42)
DF1 <- as.data.frame(matrix(rnorm(1500),150))
DF2 <- as.data.frame(matrix(runif(1500),150))
#transform to matrices for better performance
m1 <- as.matrix(DF1)
m2 <- as.matrix(DF2)
#use outer to get all combinations of row numbers and apply a function to them
#22500 combinations is small enough to fit into RAM
cors <- outer(seq_len(nrow(DF1)),seq_len(nrow(DF2)),
#you need a vectorized function
#Vectorize takes care of that, but is just a hidden loop (slow for huge row numbers)
FUN=Vectorize(function(i,j) cor(m1[i,],m2[j,])))
hist(cors)
You can use cor with two arguments:
cor( t(m1), t(m2) )
I have a list like this
list1<- list(c(12,45,12,0,0),c(12,45,12,0,1),c(14,45,12,0,2),c(12,15,12,0,3),c(12,45,17,0,4))
I want to iterate through this list by using foreach in R. The goal here is to compare a random vector like c(1,1,2,0,6) with these vectors in the list. By "compare", I mean I need to calculate the euclidean distance between these vectors and find the closest one to my random vector.
The most efficient approach to calculate the distances could be achieved with the dist function.
# a random vector
rvec <- c(1,1,2,0,6)
# a list of coordinates
list1 <- list(c(12,45,12,0,0),
c(12,45,12,0,1),
c(14,45,12,0,2),
c(12,15,12,0,3),
c(12,45,17,0,4))
# calculate distances between the random vector and the list elements:
dist(rbind(rvec, t(matrix(unlist(list1), length(list1)))))[seq_along(list1)]
[1] 46.82948 46.71188 47.12749 20.63977 47.81213