Is there a way in R for doing a pairwise-weighted correlation matrix? - r

I have a survey with a lot of numeric variables (both continuous and dummy-binary) and more than 800 observations. Of course, there is missing data for most of the variables (at a different rate). I need to use a weighted correlation table because some samples represent more population than others. Also, I want to minimize the not used samples, and in this way keep the max. of observations for each pair of variables. I know how to do a pairwise correlation matrix (e.g., cor(data, use="pairwise.complete.obs")). Also I know how to do a weighted correlation matrix (e.g., cov.wt(data %>% select(-weight), wt=data$weight, cor=TRUE)). However, I couldn't find a way (yet) to use both together. Is there a way for doing a pairwise-weighted correlation matrix in R? Super appreciate it if any help or recommendations.

Good question
Here how I do it
It is not fast but faster than looping.
df_correlation is a dataframe with only the variables I want to compute the correlations
and newdf is my original dataframe with the weight and other variables
data_list <- combn(names(df_correlation),2,simplify = FALSE)
data_list <- map(data_list,~c(.,"BalancingWeights"))
dimension <- length(names(df_correlation))
allcorr <- matrix(data =NA,nrow = dimension,ncol = dimension)
row.names(allcorr)<-names(df_correlation)
colnames(allcorr) <- names(df_correlation)
myfunction<- function(data,x,y,weight){
indice <-!(is.na(data[[x]])|is.na(data[[y]]))
return(wCorr::weightedCorr(data[[x]][indice],
data[[y]][indice], method = c("Pearson"),
weights = data[[weight]][indice], ML = FALSE, fast = TRUE))
}
b <- map_dbl(data_list,~myfunction(newdf,.[1],.[2],.[3]))
allcorr[upper.tri(allcorr, diag = FALSE)]<- b
allcorr[lower.tri(allcorr,diag=FALSE)] <- b
view(allcorr)

Related

2-sample independent t-test where each of two columns is in different data frame

I need to run a 2-sample independent t-test, comparing Column1 to Column2. But Column1 is in DataframeA, and Column2 is in DataframeB. How should I do this?
Just in case relevant (feel free to ignore): I am a true beginner. My experience with R so far has been limited to running 2-sample matched t-tests within the same data frame by doing the following:
t.test(response ~ Column1,
data = (Dataframe1 %>%
gather(key = "Column1", value = "response", "Column1", "Column2")),
paired = TRUE)
TL;DR
t_test_result = t.test(DataframeA$Column1, DataframeB$Column2, paired=TRUE)
Explanation
If the data is paired, I assume that both dataframes will have the same number of observations (same number of rows). You can check this with nrow(DataframeA) == nrow(DataframeB) .
You can think of each column of a dataframe as a vector (an ordered list of values). The way that you have used t.test is by using a formula (y~x), and you were essentially saying: Given the dataframe specified in data, perform a t test to assess the significance in the difference in means of the variable response between the paired groups in Column1.
Another way of thinking about this is by grabbing the data in data and separating it into two vectors: the vector with observations for the first group of Column1, and the one for the second group. Then, for each vector, you compute the mean and stdev and apply the appropriate formula that will give you the t statistic and hence the p value.
Thus, you can just extract those 2 vectors separately and provide them as arguments to the t.test() function. I hope it was beginner-friendly enough ^^ otherwise let me know
EDIT: a few additions
(I was going to reply in the comments but realized I did not have space hehe)
Regarding the what #Ashish did in order to turn it into a Welch's test, I'd say it was to set var.equal = FALSE. The paired parameter controls whether the t-test is run on paired samples or not, and since your data frames have unequal number of rows, I'm suspecting the observations are not matched.
As for the Cohen's d effect size, you can check this stats exchange question, from which I copy the code:
For context, m1 and m2 are the group's means (which you can get with n1 = mean(DataframeA$Column1)), s1 and s2 are the standard deviations (s2 = sd(DataframeB$Column2)) and n1 and n2 the sample sizes (n2 = length(DataframeB$Column2))
lx <- n1- 1 # Number of observations in group 1
ly <- n2- 1 # # Number of observations in group 1
md <- abs(m1-m2) ## mean difference (numerator)
csd <- lx * s1^2 + ly * s2^2
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d
This should work for you
res = t.test(DataFrameA$Column1, DataFrameB$Column2, alternative = "two.sided", var.equal = FALSE)

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

using k-NN in R with categorical values

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.

Random sample from given bivariate discrete distribution

Suppose I have a bivariate discrete distribution, i.e. a table of probability values P(X=i,Y=j), for i=1,...n and j=1,...m. How do I generate a random sample (X_k,Y_k), k=1,...N from such distribution? Maybe there is a ready R function like:
sample(100,prob=biprob)
where biprob is 2 dimensional matrix?
One intuitive way to sample is the following. Suppose we have a data.frame
dt=data.frame(X=x,Y=y,P=pij)
Where x and y come from
expand.grid(x=1:n,y=1:m)
and pij are the P(X=i,Y=j).
Then we get our sample (Xs,Ys) of size N, the following way:
set.seed(1000)
Xs <- sample(dt$X,size=N,prob=dt$P)
set.seed(1000)
Ys <- sample(dt$Y,size=N,prob=dt$P)
I use set.seed() to simulate the "bivariateness". Intuitively I should get something similar to what I need. I am not sure that this is correct way though. Hence the question :)
Another way is to use Gibbs sampling, marginal distributions are easy to compute.
I tried googling, but nothing really relevant came up.
You are almost there. Assuming you have the data frame dt with the x, y, and pij values, just sample the rows!
dt <- expand.grid(X=1:3, Y=1:2)
dt$p <- runif(6)
dt$p <- dt$p / sum(dt$p) # get fake probabilities
idx <- sample(1:nrow(dt), size=8, replace=TRUE, prob=dt$p)
sampled.x <- dt$X[idx]
sampled.y <- dt$Y[idx]
It's not clear to me why you should care that it is bivariate. The probabilities sum to one and the outcomes are discrete, so you are just sampling from a categorical distribution. The only difference is that you are indexing the observations using rows and columns rather than a single position. This is just notation.
In R, you can therefore easily sample from your distribution by reshaping your data and sampling from a categorical distribution. Sampling from a categorical can be done using rmultinom and using which to select the index, or, as Aniko suggests, using sample to sample the rows of the reshaped data. Some bookkeeping can take care of your exact case.
Here's a solution:
library(reshape)
# Reshape data to long format.
data <- matrix(data = c(.25,.5,.1,.4), nrow=2, ncol=2)
pmatrix <- melt(data)
# Sample categorical n times.
rcat <- function(n, pmatrix) {
rows <- which(rmultinom(n,1,pmatrix$value)==1, arr.ind=TRUE)[,'row']
indices <- pmatrix[rows, c('X1','X2')]
colnames(indices) <- c('i','j')
rownames(indices) <- seq(1,nrow(indices))
return(indices)
}
rcat(3,pmatrix)
This returns 3 random draws from your matrix, reporting the i and j of the rows and columns:
i j
1 1 1
2 2 2
3 2 2

Resources