Association matrix in r - r

The way corrplot allows you to plot a correlation matrix in R
Any idea how i can plot a association matrix in R
where the method of association is using any user specified method like Cramer's V

The answer to your question strongly depends on the data you've got and specific correlation method. I assume you have a bunch of nominal variables and want to see whether they are correlated using Cramer's V on the correlation plot. In this case, a way to do this is following:
Calculate Cramer's V correlation coefficient for every pair of
variables.I used vcd library, as it has method to calculate Cramer's V.
Put these coefficients together and basically get correlation matrix
Visualize the matrix
Ugly but working code to do this is listed below. I played around outer - the clearest and most precise way to work with row and column indexes, but encountered problems with indexing columns in df using row and column index from m: for some reason it just didn't want to get variable from df.
install.packages("vcd")
library(vcd)
# Simulate some data or paste your own
df <- data.frame(x1 = sample(letters[1:5], 20, replace = TRUE),
x2 = sample(letters[1:5], 20, replace = TRUE),
x3 = sample(letters[1:5], 20, replace = TRUE))
# Initialize empty matrix to store coefficients
empty_m <- matrix(ncol = length(df),
nrow = length(df),
dimnames = list(names(df),
names(df)))
# Function that accepts matrix for coefficients and data and returns a correlation matrix
calculate_cramer <- function(m, df) {
for (r in seq(nrow(m))){
for (c in seq(ncol(m))){
m[[r, c]] <- assocstats(table(df[[r]], df[[c]]))$cramer
}
}
return(m)
}
cor_matrix <- calculate_cramer(empty_m ,data)
corrplot(cor_matrix)

Building upon the example by Alexey Knorre:
library(DescTools)
library(corrplot)
# Simulate data
df <- data.frame(x1 = sample(letters[1:5], 20, replace = TRUE),
x2 = sample(letters[1:5], 20, replace = TRUE),
x3 = sample(letters[1:5], 20, replace = TRUE))
# Use CramerV as input for corrplot
corrplot::corrplot(DescTools::PairApply(df, DescTools::CramerV))

library(vcd)
library(corrplot)
I would suggest corrplot(PairApply(df, cramerV),diag = F,is.corr = F) to change color scale from -1,1 (is.corr = T) to 0,1 (is.corr = F).

Related

How to create correlation matrix after mice multiple imputation

I'm using the mice package to create multiple imputations. I want to create a correlations matrix (and a matrix of p-values for the correlation coefficients. I use miceadds::micombine.cor to do this. But this gives a dataframe with variables in the first to columns, and then a number of columns to contain r, p, t-values, and the like.
I'm looking for a way to turn this dataframe into a "good old" matrix with the correlation coefficient between x and y in position [x,y], and a matrix with p-values Does anyone have an easy way to do this?
Here's some code to reproduce:
data <- mtcars
mt.mis <- prodNA(mtcars, noNA = 0.1)
imputed <-mice(iris.mis, m = 5, maxit = 5, method = "pmm")
correlations<- miceadds::micombine.cor(mi.res=iris.mis, variables = c(1:3))
What I'm looking for is something like the output from cor(mtcars). Who can help?
I ended up writing my own function. Can probably be done much more efficiently, but this is what I made.
cormatrix <- function(r, N){
x <- 1
cormatrix <- matrix(nrow = N, ncol = N) # create empty matrix
for (i in 1:N) {
for (j in i:N) {
if(j>i){
cormatrix[i,j] <- r[x]
cormatrix[j,i] <- r[x]
x <- x + 1
}
}
}
diag(cormatrix) <- 1
cormatrix
}
You can call it with the output of micombine.cor and the number of variables in your model as arguments. So for example cormatrix(correlations$r,ncol(df)).

For Loop t.test, Comparing Means by Factor Class in R

I want to loop a lot of one sided t.tests, comparing mean crop harvest value by pattern for a set of different crops.
My data is structured like this:
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("mono", "inter"), 15),
stringsAsFactors = TRUE)
I would like the output to provide results from a t.test, comparing mean harvest of each crop by pattern (i.e. compare harvest of mono-cropped potatoes to intercropped potatoes), where the alternative is greater value for the intercropped pattern.
Help!
Here's an example using base R.
# Generate example data
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("inter", "mono"), 15),
stringsAsFactors = TRUE)
# Create a list which will hold the output of the test for each crop
crops <- unique(df$crop)
test_output <- vector('list', length = length(crops))
names(test_output) <- crops
# For each crop, save the output of a one-sided t-test
for (crop in crops) {
# Filter the data to include only observations for the particular crop
crop_data <- df[df$crop == crop,]
# Save the results of a t-test with a one-sided alternative
test_output[[crop]] <- t.test(formula = value ~ pattern,
data = crop_data,
alternative = 'greater')
}
It's important to note that when calling t-test with the formula interface (e.g. y ~ x) and where your independent variable is a factor, then using the setting alternative = 'greater' will test whether the mean in the lower factor level (in the case of your data, "inter") is greater than the mean in the higher factor level (here, that's "mono").
Here's the elegant "tidyverse" approach, which makes use of the tidy function from broom which allows you to store the output of a t-test as a data frame.
Instead of a formal for loop, the group_by and do functions from the dplyr package are used to accomplish the same thing as a for loop.
library(dplyr)
library(broom)
# Generate example data
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("inter", "mono"), 15),
stringsAsFactors = TRUE)
# Group the data by crop, and run a t-test for each subset of data.
# Use the tidy function from the broom package
# to capture the t.test output as a data frame
df %>%
group_by(crop) %>%
do(tidy(t.test(formula = value ~ pattern,
data = .,
alternative = 'greater')))
Consider by, object-oriented wrapper to tapply designed to subset a data frame by factor(s) and run operations on subsets:
t_test_list <- by(df, df$crop, function(sub)
t.test(formula = value ~ pattern,
data = sub, alternative = 'greater')
)

Weighted Pearson's Correlation with one Object

I want to create a correlation matrix using data but weighted based on significant edges.
m <- matrix(data = rnorm(36), nrow = 6, ncol = 6)
x <- LETTERS[1:6]
for (a in 1:length(x)) y <- c(y, paste("c", a, sep = ""))
mCor <- cor(t(m))
w <- sample(x = seq(0.5, 0.8, by = 0.01), size = 36)
The object w represents the weights for mCor. I know other packages that provide correlation for input data that has to be the same length for vectors x and y. I want to calculate a pairwise weighted Pearson's correlation table, using data for each row across all columns.
I just want to make sure it's correct, but I thought about using a weighted cor for each row A and B by multiplying each value by the given weight. You typically need three vectors all the same length, two for data, and one for the weights.
I am using the data.table package so speedy solutions are welcomed. Also, not sure if I should pass a table with two columns for connections and one for weights. Do the existing functions preserve order or automatically match?
weight <- data.table(x = rep(LETTERS[1:3], each = 12), y = rep(LETTERS[4:6], times = 3), w = w)

removing specific columns in R

I am using findCorrelation function in R:
highCorr <- findCorrelation(correlations, cutoff = .60,names = FALSE)
The function return columns numbers/names that are 0.6 an above correlated.
I want to remove these columns.
I don't know how to do this because first if i remove them one at a time the column number change but, I want to try few cutoff threshold and would like to do this automatically.
If your original data are a correlation matrix you can do the following:
library(caret) #findCorrelation comes from this library
set.seed(1)
#create simulated data for correlation matrix
mydata <- matrix(data = rnorm(100,mean = 100, sd = 3), nrow = 10, ncol = 10)
#create correlation matrix
correlations <- cor(mydata)
#index correlations at cutoff
corr_ind <- findCorrelation(correlations, cutoff = .2)
#remove columns from original data based on index value
remove_corrs <- mydata[-c(corr_ind)]

how to sample distributions, given n, distribution name, and parameters in a dataframe?

I have a dataframe:
priors <- data.frame(dist = c('lnorm', 'beta', 'gamma'),
a = c(0.5, 1, 10),
b = c(0.4, 25, 4),
n = c(100, 100, 100)
)
and I would like to take n samples from the distribution with parameters a and b.
I have written this function:
pr.samp <- function(n,dist,a,b) {eval (parse (
text =
paste("r",dist,"(",n,",",a,",",b,")",sep = "")
))}
I would like to know:
is there a better approach?
how would I use one of the apply functions to run this on each row?
do I have to convert the dataframe to a matrix to do this?
Thanks in advance!
see ?do.call
pr.samp <- function(n,dist,a,b) {
do.call(paste('r',dist,sep=""),list(n,a,b))
}
Using an apply is difficult, as you have mixed character and numeric vectors in your dataframe. using apply on the rows will give you character vectors, which will cause errors. Converting to a matrix will give a character matrix. I'd do something like :
sapply(1:nrow(priors),function(x){
pr.samp(priors$n[x],priors$dist[x],priors$a[x],priors$b[x])})
Alternatively, the solution of Joshua is cleaner :
sapply(1:nrow(priors), function(x) do.call(pr.samp,as.list(priors[x,])))

Resources