How to calculate the distance between an array and a matrix - r

Consider a matrix A and an array b. I would like to calculate the distance between b and each row of A. For instance consider the following data:
A <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15), 3, 5, byrow=TRUE)
b <- c(1, 2, 3, 4, 5)
I would expect as output some array of the form:
distance_array = c(0, 11.18, 22.36)
where the value 11.18 comes from the euclidean distance between a[2,] and b:
sqrt(sum((a[2,]-b)^2))
This seems pretty basic but so far all R functions I have found allow to compute distance matrices between all the pairs of rows of a matrix, but not this array-matrix calculation.

I would recommend putting the rows a A in list instead of a matrix as it might allow for faster processing time. But here's how I would do it with respect to your example
A <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15), 3, 5, byrow=TRUE)
b <- c(1, 2, 3, 4, 5)
apply(A,1,function(x)sqrt(sum((x-b)^2)))

Related

R function for calculating difference for points in a vector

I am trying to create a function to calculate differences between points in a vector. Ultimately, I want to apply to cumulative density function (pnorm) to find the probability that a given point (i) is different from another point (j), and then sum the probability for all
observations where i does not equal j. Specifically:
Sum(i != j):
pnorm {
(a(i) - a(j)) /(b(i) + b(j))
}
I would like to implement that formula for each observation in my dataset, so that each observation has a summed probability associated with it. Using the following simple example:
a = 1, 2, 3, 4, 5
b = 6, 7, 8, 9, 10
dat <- cbind(a,b)
pnorm{
(1 - 2)/(6 + 7)}
And then repeat for j equals element 2[2,7] to j equals element 5[5,10]. That would give the result for i=1, which I'd then repeat for i=2:5.
Thank you!

Calculating the value to know the trend in a set of numeric values

I have a requirement where I have set of numeric values for example: 2, 4, 2, 5, 0
As we can see in above set of numbers the trend is mixed but since the latest number is 0, I would consider the value is getting DOWN. Is there any way to measure the trend (either it is getting up or down).
Is there any R package available for that?
Thanks
Suppose your vector is c(2, 4, 2, 5, 0) and you want to know last value (increasing, constant or decreasing), then you could use diff function with a lag of 1. Below is an example.
MyVec <- c(2, 4, 2, 5, 0)
Lagged_vec <- diff(MyVec, lag=1)
if(MyVec[length(MyVec)]<0){
print("Decreasing")}
else if(MyVec[length(MyVec)]==0){
print("Constant")}
else {print("Increasing")}
Please let me know if this is what you wanted.

Interaction of sample and rnorm function in R?

I am setting my R code for doing a Monte Carlo, however I need a sample of 1 number with a random distribution, so in order to test the function of the sample in R, I set the code below, however I do not understand the reason of the different results.
x <- rnorm(1,8,0)
x
#8
y <-sample(x=rnorm(1,8,0), size=1)
y
#4
Quoting ?sample,
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1,
sampling via sample takes place from 1:x.
you're actually drawing from c(1, 2, 3, 4, 5, 6, 7, 8) and not from c(8).
However, it works if we draw from "character" class.
as.numeric(sample(as.character(rnorm(1,8,0)), size=1))
# [1] 8

Bray-Curtis Pairwise Analysis in R

I am trying to calculate and visualize the Bray-Curtis dissimilarity between communities at paired/pooled sites using the Vegan package in R.
Below is a simplified example dataframe:
Site = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
PoolNumber = c(1, 3, 4, 2, 4, 1, 2, 3, 4, 4)
Sp1 = c(3, 10, 7, 0, 12, 9, 4, 0, 4, 3)
Sp2 = c(2, 1, 17, 1, 2, 9, 3, 1, 6, 7)
Sp3 = c(5, 12, 6, 10, 2, 4, 0, 1, 3, 3)
Sp4 = c(9, 6, 4, 8, 13, 5, 2, 20, 13, 3)
df = data.frame(Site, PoolNumber, Sp1, Sp2, Sp3, Sp4)
"Site" is a variable indicating the location where each sample was taken
The "Sp" columns indicate abundance values of species at each site.
I want to compare pairs of sites that have the same "PoolNumber" and get a dissimilarity value for each comparison.
Most examples suggest I should create a matrix with only the "Sp" columns and use this code:
matrix <- df[,3:6]
braycurtis = vegdist(matrix, "bray")
hist(braycurtis)
However, I'm not sure how to tell R which rows to compare if I eliminate the columns with "PoolNumber" and "Site". Would this involve organizing by "PoolNumber", using this as a row name and then writing a loop to compare every 2 rows?
I am also finding the output difficult to interpret. Lower Bray-Curtis values indicate more similar communities (closer to a value of 0), while higher values (closer to 1) indicate more dissimilar communities, but is there a way to tell directionality, which one of the pair is more diverse?
I am a beginner R user, so I apologize for any misuse of terminology/formatting. All suggestions are appreciated.
Thank you
Do you mean that you want to get a subset of dissimilarities with equal PoolNumber? The vegdist function will get you all dissimilarities, and you can pick your pairs from those. This is easiest when you first transform dissimilarities into a symmetric matrix and then pick your subset from that symmetric matrix:
braycurtis <- vegdist(df[,3:6])
as.matrix(braycurtis)[df$PoolNumber==4,df$PoolNumber==4]
as.dist(as.matrix(braycurtis)[df$PoolNumber==4,df$PoolNumber==4])
If you only want to have averages, vegan::meandist function will give you those:
meandist(braycurtis, df$PoolNumber)
Here diagonal values will be mean dissimilarities within PoolNumber and off-diagonal mean dissimilarities between different PoolNumbers. Looking at the code of vegan::meandist you can see how this is done.
Bray-Curtis dissimilarities (like all normal dissimilarities) are a symmetric measure and it has no idea on the concept of being diverse. You can assess the degree of being diverse for each site, but then you need to first tell us what do you mean with "diverse" (diversity or something else?). Then you just need to use those values in your calculations.
If you just want to look at number of items (species), the following function will give you the differences in the lower triangle (and the upper triangle values will be the same with a switch of a sign):
designdist(df[,3:6], "A-B", "binary")
Alternatively you can work with row-wise statistics and see their differences. This is an example with Shannon-Weaver diversity index:
H <- diversity(df[,3:6])
outer(H, H, "-")
To get the subsets, work similarly as with the Bray-Curtis index.

Simple example calculating Mahalanobis distance between two groups in R

I'm trying to reproduce this example using Excel to calculate the Mahalanobis distance between two groups.
To my mind the example provides a good explanation of the concept. However, I'm not able to reproduce in R.
The result obtained in the example using Excel is Mahalanobis(g1, g2) = 1.4104.
Following the answer given here for R and apply it to the data above as follows:
# dataset used in the Excel example
g1 <- matrix(c(2, 2, 2, 5, 6, 5, 7, 3, 4, 7, 6, 4, 5, 3, 4, 6, 2, 5, 1, 3), ncol = 2, byrow = TRUE)
g2 <- matrix(c(6, 5, 7, 4, 8, 7, 5, 6, 5, 4), ncol = 2, byrow = TRUE)
# function adopted from R example
D.sq <- function (g1, g2) {
dbar <- as.vector(colMeans(g1) - colMeans(g2))
S1 <- cov(g1)
S2 <- cov(g2)
n1 <- nrow(g1)
n2 <- nrow(g2)
V <- as.matrix((1/(n1 + n2 - 2)) * (((n1 - 1) * S1) + ((n2 - 1) * S2)))
D.sq <- t(dbar) %*% solve(V) %*% dbar
res <- list()
res$D.sq <- D.sq
res$V <- V
res
}
D.sq(g1,g2)
and executing the function on the data returns the following output:
$D.sq
[,1]
[1,] 1.724041
$V
[,1] [,2]
[1,] 3.5153846 0.3153846
[2,] 0.3153846 2.2230769
Afaik $D.sq represents the distance and 1.724 is significantly different to the 1.4101 result from the Excel example. As I'm new to the concept of the Mahalanobis distance I was wondering if I did something wrong and/or there's a better way to calculate this e.g. using mahalanobis()?
The reasons why do you get different result are
The Excel algorithm is actually different to the R algorithm in how you calculate the pooled covariance matrix, the R version gives you the result of unbiased estimate of covariance matrix, while the Excel version gives you the MLE estimate. In the R version, you calculate the matrix like: ((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2 - 2); while in Excel version: ((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2).
The last calculation step in the Excel post you refer to is incorrect, the result should be 1.989278 instead.
Edit:
The unbiased estimator for pooled covariance matrix is the standard way, as is in the Wikipedia page: https://en.wikipedia.org/wiki/Pooled_variance . A related fact is that in R, when you use cov or var, you get an unbiased estimator instead of MLE estimator for covariance matrix.
Edit2:
The mahalanobis function in R calculates the mahalanobis distance from points to a distribution. It does not calculate the mahalanobis distance of two samples.
Conclusion: In sum, the most standard way to calculate mahalanobis distance between two samples is the R code in the original post, which uses the unbiased estimator of pooled covariance matrix.

Resources