Weighted Pearson's Correlation with one Object - r

I want to create a correlation matrix using data but weighted based on significant edges.
m <- matrix(data = rnorm(36), nrow = 6, ncol = 6)
x <- LETTERS[1:6]
for (a in 1:length(x)) y <- c(y, paste("c", a, sep = ""))
mCor <- cor(t(m))
w <- sample(x = seq(0.5, 0.8, by = 0.01), size = 36)
The object w represents the weights for mCor. I know other packages that provide correlation for input data that has to be the same length for vectors x and y. I want to calculate a pairwise weighted Pearson's correlation table, using data for each row across all columns.
I just want to make sure it's correct, but I thought about using a weighted cor for each row A and B by multiplying each value by the given weight. You typically need three vectors all the same length, two for data, and one for the weights.
I am using the data.table package so speedy solutions are welcomed. Also, not sure if I should pass a table with two columns for connections and one for weights. Do the existing functions preserve order or automatically match?
weight <- data.table(x = rep(LETTERS[1:3], each = 12), y = rep(LETTERS[4:6], times = 3), w = w)

Related

R: Applying function to every element of matrix using elements of different matrix as function input

I wish to apply a custom function to each element of a matrix whilst also using elements of a different matrix as inputs to the function.
Specifically, my function generates random samples from a von Mises distribution (circular normal distribution), calling the Rfast package's rvonmises function.
I have one matrix (radians) which records the angle I wish to use for the central tendency of the random generation (similar to the mean), and another matrix (kappa) which records the concentration parameter of the von Mises I wish to use (similar to standard deviation).
I wish to use (for example) element [1, 1] of the radians matrix together with element [1, 1] of the kappa matrix in a call to the von Mises random generator. So, my call for one element would be:
rvonmises(n = 1, m = radians[1, 1], k = kappa[1, 1])
But of course I want this applied across all elements of the matrices. (The rvonmises function doesn't accept multiple m or k values, so for example I couldn't use rvonmises(4, m = c(1, 2, 3, 4), k = c(1, 1.2, 1.4, 1.6)).)
To summarise: I am basically after a more principled (and faster!) way of doing this:
for(i in 1:nrow(radians)){
for(j in 1:ncol(radians)){
result[i, j] <- Rfast::rvonmises(1, radians[i, j], kappa[i, j])
}
}
What I have tried
Based on this post, I have tried to use mapply:
library(Rfast)
set.seed(42)
# random radians to use as input
radians <- matrix(data = runif(12, 0, 2 * pi),
ncol = 4)
# random concentration parameters of the von Mises distribution
kappa <- matrix(data = rgamma(12, 70, 30),
ncol = 4)
# function to generate random von mises sample with angle x and
# concentration parameter k
my_function <- function(m, k){
Rfast::rvonmises(1, m, k)
}
# my attempt
out <- matrix(mapply(my_function, m = as.data.frame(radians), k = kappa),
ncol = 4, byrow = TRUE)
However, I don't think this is working. For example, if I test it by the following (where the central tendency in test_radians increases steadily and I use large values for kappa which leads to precise estimates):
test_radians <- matrix(data = seq(from = 1, to = 2 * pi, length.out = 12),
ncol = 4)
test_kappa <- matrix(data = rep(20, times = 12),
ncol = 4)
test <- matrix(mapply(my_function, m = as.data.frame(test_radians),
k = test_kappa),
ncol = 4, byrow = TRUE)
test[1, 1] should be smaller (on average), and test[3, 4] should be largest. (I know due to random variability this won't always be the case, but I've tried it with many replications.)
So, the mapping and matching between matrices isn't working as I had anticipated.
Any guidance welcomed.
You cannot compute the mean of circular observations by simply calling "mean". This is wrong. The correct way is to compute the mean of the cosinus and sinus of the angles and then use the arc tangent. See pcakcges for directional or circular data for this.
Secondly, you gave us an idea, to return a matrix of von Mises generated data. But, since brms does this job for you, at the moment I would go there.

How can I find numérical intervals of k-means clusters?

I'm trying to discretize a numerical variable using Kmeans.
It worked pretty well but I'm wondering how I can find the intervals in my cluster.
I work with FactoMineR to do my kmeans.
I found 3 clusters according to the following graph :
My point now is to identify the intervals of my numerical variable within the clusters.
Is there any option or method in FactoMineR or other package to do it ?
I can do it manually but as I have to do it for a certain amount of variables, I'd like to found an easy way to identify them.
Since you have not provided data I have used the example from the kmeans documentation, which produces two groups for data with two columns x and y. You may split the original data by the cluster each row belongs to and then extract data from each group. I am not sure if my example data resembles your data, but in below code I have simply used the difference between min value of column x and max value of column y as the boundaries of a potential interval (depending on the use case this makes sense or not). Does that help you?
data <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(data) <- c("x", "y")
cl <- kmeans(data, 2)
data <- as.data.frame(cbind(data, cluster = cl$cluster))
lapply(split(data, data$cluster), function(x) {
min_x <- min(x$x)
max_y <- max(x$y)
diff <- max_y-min_x
c(min_x = min_x , max_y = max_y, diff = diff)
})
# $`1`
# min_x max_y diff
# -0.6906124 0.5123950 1.2030074
#
# $`2`
# min_x max_y diff
# 0.2052112 1.6941800 1.4889688

Association matrix in r

The way corrplot allows you to plot a correlation matrix in R
Any idea how i can plot a association matrix in R
where the method of association is using any user specified method like Cramer's V
The answer to your question strongly depends on the data you've got and specific correlation method. I assume you have a bunch of nominal variables and want to see whether they are correlated using Cramer's V on the correlation plot. In this case, a way to do this is following:
Calculate Cramer's V correlation coefficient for every pair of
variables.I used vcd library, as it has method to calculate Cramer's V.
Put these coefficients together and basically get correlation matrix
Visualize the matrix
Ugly but working code to do this is listed below. I played around outer - the clearest and most precise way to work with row and column indexes, but encountered problems with indexing columns in df using row and column index from m: for some reason it just didn't want to get variable from df.
install.packages("vcd")
library(vcd)
# Simulate some data or paste your own
df <- data.frame(x1 = sample(letters[1:5], 20, replace = TRUE),
x2 = sample(letters[1:5], 20, replace = TRUE),
x3 = sample(letters[1:5], 20, replace = TRUE))
# Initialize empty matrix to store coefficients
empty_m <- matrix(ncol = length(df),
nrow = length(df),
dimnames = list(names(df),
names(df)))
# Function that accepts matrix for coefficients and data and returns a correlation matrix
calculate_cramer <- function(m, df) {
for (r in seq(nrow(m))){
for (c in seq(ncol(m))){
m[[r, c]] <- assocstats(table(df[[r]], df[[c]]))$cramer
}
}
return(m)
}
cor_matrix <- calculate_cramer(empty_m ,data)
corrplot(cor_matrix)
Building upon the example by Alexey Knorre:
library(DescTools)
library(corrplot)
# Simulate data
df <- data.frame(x1 = sample(letters[1:5], 20, replace = TRUE),
x2 = sample(letters[1:5], 20, replace = TRUE),
x3 = sample(letters[1:5], 20, replace = TRUE))
# Use CramerV as input for corrplot
corrplot::corrplot(DescTools::PairApply(df, DescTools::CramerV))
library(vcd)
library(corrplot)
I would suggest corrplot(PairApply(df, cramerV),diag = F,is.corr = F) to change color scale from -1,1 (is.corr = T) to 0,1 (is.corr = F).

removing specific columns in R

I am using findCorrelation function in R:
highCorr <- findCorrelation(correlations, cutoff = .60,names = FALSE)
The function return columns numbers/names that are 0.6 an above correlated.
I want to remove these columns.
I don't know how to do this because first if i remove them one at a time the column number change but, I want to try few cutoff threshold and would like to do this automatically.
If your original data are a correlation matrix you can do the following:
library(caret) #findCorrelation comes from this library
set.seed(1)
#create simulated data for correlation matrix
mydata <- matrix(data = rnorm(100,mean = 100, sd = 3), nrow = 10, ncol = 10)
#create correlation matrix
correlations <- cor(mydata)
#index correlations at cutoff
corr_ind <- findCorrelation(correlations, cutoff = .2)
#remove columns from original data based on index value
remove_corrs <- mydata[-c(corr_ind)]

automation of subset process

It is probably easy, but I can't figure it out.
I have a data frame with over 70 variables. I make predictions using all those variables. For sensitivity analysis I would like to subset the data frame automatically to see how the prediction performs on this specific subset.
I have done this manually but with over 100 different subset options it is very tedious.
Here is the data/code and my desired solution:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
a = c(1.7, 3.3, 5.1)
df = data.frame(n, s, b, a)
df
To calculate the accuracy of prediction a
df$calc <- df$a - df$n
df$difference <- sqrt(df$calc * df$calc)
With these values I can now calculate the Mean and SD
Mean <- mean(df$difference)
SD <- sd(df$difference)
Let's say I would like to get an overview of the prediction accuracy for all cases where b = TRUE. (Or other subsets of the data)
Ideally I would like a data frame to look like this:
subset = c("b=TRUE", "b=FALSE", "s=aa")
amount = c(2, 1, 1) # count the number this subset occurs
Mean = c(0.22, 0.3, 0.1)
SD = c(0.1, 0.2, 0.5)
OV = data.frame(subset, amount, Mean, SD)
OV
Considering that I have more than 100 different subsets that I would like to create, I need a fast solution that generates an overview like the OV data frame. I tried a loop, but I have trouble defining a vector for subsetting the data.
Thanks!

Resources