Covariance with colinear vectors - r

I'm trying to calculate the covariance of a matrix which has two colinear vectors. I have read that it was impossible with the "cov" function from R.
Does a different function exist on R to calculate the covariance of a matrix which has two colinear vectors (since it works on Matlab and Excel).
Thank you in advance for your answers

Please consider providing a reproducible example with sample of your data and the corresponding code. Broadly speaking, a covariance matrix can be created with use of the code below:
# Vectors
V1 <- c(1:4)
V2 <- c(4:8)
V3 <- runif(n = 4)
V4 <- runif(n = 4)
#create matrix
M <- cbind(V1,V2, V3, V4)
# Covariance
cov(M)
I'm guessing that you may be getting the following error:
number of rows of result is not a multiple of vector length (arg 1)
You could first try to use the cov function as discussed here.

Related

Calculate Euclidean distance between multiple pairs of points in dataframe in R

I'm trying to calculate the Euclidean distance between pairs of points in a dataframe in R, and there's an ID for each pair:
ID <- sample(1:10, 10, replace=FALSE)
P <- runif(10, min=1, max=3)
S <- runif(10, min=1, max=3)
testdf <- data.frame(ID, P, S)
I found several ways to calculate the Euclidean distance in R, but I'm either getting an error, returning only 1 value (so it's computing the distance between the entire vector), or I end up with a matrix when all I need is a 4th column with the distance between each pair (columns 'P' and 'S.') I'm a bit confused by matrices so I'm not sure how to work with that result.
Tried making a function and applying it to the 2 columns but I get an error:
testdf$V <- apply(testdf[ , c('P', 'S')], 1, function(P, S) sqrt(sum((P^2, S^2)))
# Error in FUN(newX[, i], ...) : argument "S" is missing, with no default
Then tried using the dist() function in the stats package but it only returns 1 value:
(Same problem if I follow the method here: https://www.statology.org/euclidean-distance-in-r/)
P <- testdf$P
S <- testdf$S
testProbMatrix <- rbind(P, S)
stats::dist(testProbMatrix, method = "euclidean")
# returns only 1 distance
Returns a matrix
(Here's a nice explanation why: Calculate the distances between pairs of points in r)
stats::dist(cbind(P, S), method = "euclidean")
But I'm confused how to pull the distances out of the matrix and attach them to the correct ID for each pair of points. I don't understand why I have to make a matrix instead of just applying the function to the dataframe - matrices have always confused me.
I think this is the same question as here (Finding euclidean distance between all pair of points) but for R instead of Python
Thanks for the help!
Try this out if you would just like to add another column to your dataframe
testdf$distance <- sqrt((P^2 + S^2))

correlation matrix of a bunch of categorical variables in R

I have about 20 variables about different cities labeled "Y" or "N" and are factors. The variables are like "has co-op" and the such. I want to find some correlations and possibly use the corrplot package to display the connections between all these variables. But for some reason I cannot coerce the variables so that they are read in a way corrplot or even cor() likes so that I can get them in a matrix. I tried:
M <- cor(model.matrix(~.-1,data=mydata[c(25:44)]))
but the results in corrplot came out really weird. Does anyone have a fast way to turn a bunch of Y/N answers into a correlation matrix? Thanks!
You can use the sjp.corr function or sjt.corr function for graphical or tabular output, both from the sjPlot-package.
DF <- data.frame(v1 = sample(c("Y","N"), 100, T),
v2 = sample(c("Y","N"), 100, T),
v3 = sample(c("Y","N"), 100, T),
v4 = sample(c("Y","N"), 100, T),
v5 = sample(c("Y","N"), 100, T))
DF[] <- lapply(DF,as.integer)
library(sjPlot)
sjp.corr(DF)
sjt.corr(DF)
The plot:
The table (in RStudio viewer pane):
You can use many parameters to modify the appearance of the plot or table, see some examples here.
For binary variables, you might consider cross tabs (the table function in R).
However, getting the correlation matrix is pretty straightforward:
# example data
set.seed(1)
DF <- data.frame(x=sample(c("Y","N"),100,T),y=sample(c("Y","N"),100,T))
# how to get correlation
DF[] <- lapply(DF,as.integer)
cor(DF)
# x y
# x 1.0000000 -0.0369479
# y -0.0369479 1.0000000
# visualize it
library(corrplot)
corrplot(cor(DF))
When you convert to integer in this example, "N" is 1 and "Y" is 2. I'm not sure if that holds generally (for R's storage of factors). To have a look at the mapping for your data, try lapply(DF,levels) before converting to integer.
To me, the plot makes sense. If you have questions about the statistical interpretation of correlations in this context, you should consider having a look at http://stats.stackexchange.com

Using mat2listw function in R to create spatial weights matrix

I am attempting to create a weights object in R with the mat2listw function. I have a very large spatial weights matrix (roughly 22,000x22,000)
that was created in Excel and read into R, and I'm now trying to implement:
library(spdep)
SW=mat2listw(matrix)
I am getting the following error:
Error in if (any(x<0)) stop ("values in x cannot be negative"): missing
value where TRUE/FALSE needed.
What's going wrong here? My current matrix is all 0's and 1's, with no
missing values and no negative elements. What am I missing?
I'd appreciate any advice. Thanks in advance for your help!
Here is a simple test to your previous comment:
library(spdep)
m1 <-matrix(rbinom(100, 1, 0.5), ncol =10, nrow = 10) #create a random 10 * 10 matrix
m2 <- m1 # create a duplicate of the first matrix
m2[5,4] <- NA # assign an NA value in the second matrix
SW <- mat2listw(m1) # create weight list matrix
SW2 <- mat2listw(m2) # create weight list matrix
The first matrix one does not fail, but the second matrix does. The real question is now why your weight matrix is created containing NAs. Have you considered creating spatial weight matrix in r? Using dnearneigh or other function.

using k-NN in R with categorical values

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.

dmnorm() function with multiple means

I've been searching the answer but didn't find any information about this function except the off. R docs.
If i want to calculate the values of 1-dimentional normal distribution in the same x with different means or standard deviations i'll just call
dnorm(x, mu, sigma)
where mu and sigma will be arrays with desired means and sigmas.
Is there any way to perform same trick with dmnorm function from mnormt module, when x and mu are vectors and sigma is a covariation matrix?
P.S.: Sorry for my English, thanks for answers.
In R the collections of functions are called "packages". If a function is not vectorized in its parameters, you can pass it one parameter as a vector with sapply or as a parallelized set of list with mapply. So you should consider the mathematical issue, especially that the 'mean' is no longer a single number but rather a vector, and that sigma (which dmnorm is calling 'varcov')is no longer a single number but rather a matrix. The first example in the help page gives you the densities of 21 different x,y,z's and a single mean vector and sigma matrix.
Using that example as a starting point, make a list of 7 x,y,x and 7 varying means and sigmas and then mapply it to the first 7 items in the xyz's :
x <- seq(-2,4)
y <- 2*x+10
z <- x+cos(y)
mu <- c(1,12,2)
Sigma <- matrix(c(1,2,0,2,5,0.5,0,0.5,3), 3, 3)
lsig <- lapply(seq(-2,4)/10, "+", Sigma); lmean<-lapply(seq(-2,4)/10, "+",mu)
mapply(dmnorm, x=as.data.frame(t(cbind(x,y,z)[1:7,])), mean=lmean, varcov=lsig)
# V1 V2 V3 V4 V5 V6 V7
# 6.177e-06 6.365e-04 5.364e-03 3.309e-02 2.205e-02 6.898e-03 1.077e-03

Resources