How to create correlation matrix after mice multiple imputation - r

I'm using the mice package to create multiple imputations. I want to create a correlations matrix (and a matrix of p-values for the correlation coefficients. I use miceadds::micombine.cor to do this. But this gives a dataframe with variables in the first to columns, and then a number of columns to contain r, p, t-values, and the like.
I'm looking for a way to turn this dataframe into a "good old" matrix with the correlation coefficient between x and y in position [x,y], and a matrix with p-values Does anyone have an easy way to do this?
Here's some code to reproduce:
data <- mtcars
mt.mis <- prodNA(mtcars, noNA = 0.1)
imputed <-mice(iris.mis, m = 5, maxit = 5, method = "pmm")
correlations<- miceadds::micombine.cor(mi.res=iris.mis, variables = c(1:3))
What I'm looking for is something like the output from cor(mtcars). Who can help?

I ended up writing my own function. Can probably be done much more efficiently, but this is what I made.
cormatrix <- function(r, N){
x <- 1
cormatrix <- matrix(nrow = N, ncol = N) # create empty matrix
for (i in 1:N) {
for (j in i:N) {
if(j>i){
cormatrix[i,j] <- r[x]
cormatrix[j,i] <- r[x]
x <- x + 1
}
}
}
diag(cormatrix) <- 1
cormatrix
}
You can call it with the output of micombine.cor and the number of variables in your model as arguments. So for example cormatrix(correlations$r,ncol(df)).

Related

Find slope by linear regression of 2 matrices (R)

I have 2 matrices. One contains the quantities a client bought of products. The matrix looks like this quantitymatrix:
the other one contains unitprices at which a client bought the products. The matrix looks like this pricematrix:
How can I run a linear regression with the matrices so that I obtain the slope for each product?
Your data:
quantity <- matrix(c(4,2,6, 9,4,3, 1,1,2, 3,1,5), 3, 4)
price <- matrix(c(1,0.5,8, 4.2,1.2,2, 2,5,2, 1,2.5,1), 3, 4)
First, you have to transform your two matrices into a single data frame. (Although you can avoid that if you want, but I think it makes it much more straightforward if you do so):
df <- data.frame(quantity = as.numeric(quantity),
price = as.numeric(price),
product = rep(1:4, each = 3), ID = 1:3)
Then, run the linear models by groups:
lms <- by(df, df$product, FUN = function(x) lm(price~quantity, data = x))
And get the slopes:
slopes <- sapply(lms, coef)[2,]
If however, you want to keep the orignial matrices as they are, you can run a simple loop:
slopes <- numeric(dim(price)[2])
for (i in 1:dim(price)[2]) {
model <- lm(price[,i]~quantity[,i])
slopes[i] <- coef(model)[2]
}
NB: this solution assumes that the two matrices have identical dimensions.
And if you want to avoid loops, the following solution may be faster:
f <- function(x,y) coef(lm(y~x))[2]
l <- function(m) lapply(seq_len(ncol(m)), function(i) m[,i])
mapply(f, l(quantity), l(price))

How do I add new columns to a data set for each regression loop iteration?

I'm trying to test the predictive power of a model by breaking the observations into 1/4th and 3/4th groups (test and train respectively), running a first-order regression with the independent variable train sample, using these coefficients to produce predicted values from the independent variable test sample, and then I would like to add new columns of these predicted values to the dependent variable test data for each iteration of the loop.
For context: TSIP500 is the full sample; iv is independent variable; dv is dependent variable, a max of 50 iterations is simply a test that isn't too large in quantity of iterations.
I was having trouble with the predict function so I did the equation manually. My code is below:
for(i in 1:50){
test_index <- sample(nrow(TSIP500iv), (1/4)*nrow(TSIP500iv), replace=FALSE)
train_500iv <- TSIP500[-test_index,"distance"]
test_500iv <- TSIP500[test_index,"distance"]
train_500dv <- TSIP500[-test_index,"percent_of_max"]
test_500dv <- TSIP500[test_index,"percent_of_max"]
reg_model <- lm(train_500dv~train_500iv)
int <- reg_model$coeff[1]
B1 <- reg_model$coeff[2]
predicted <- (int + B1*test_500iv)
predicted <- data.frame(predicted)
test_500dv <- data.frame(test_500dv)
test_500dv[,i] <- apply(predicted)
}
I've tried different approaches for the last line, but I always just get a singular column added. Any help would be tremendously appreciated.
for(i in 1:50){
test_index <- sample(nrow(TSIP500iv), (1/4)*nrow(TSIP500iv), replace=FALSE)
train_500iv <- TSIP500[-test_index,"distance"]
test_500iv <- TSIP500[test_index,"distance"]
train_500dv <- TSIP500[-test_index,"percent_of_max"]
test_500dv <- TSIP500[test_index,"percent_of_max"]
reg_model <- lm(train_500dv~train_500iv)
int <- reg_model$coeff[1]
B1 <- reg_model$coeff[2]
temp_results <- paste('pred',i,sep='_')
assign(temp_results, as.data.frame(int + B1*test_500iv))
test_500dv <- cbind(data.frame(test_500dv),temp_results)
}

Computing Spearman's rho for increasing subsets of rows in for Loop

I am trying to fit a for Loop in R in order to run correlations for multiple subsets in a data frame and then store the results in a vector.
What I have in this loop is a data frame with 2 columns, x and y, and 30 rows of different continuous measurement values in each column. The process should be repeated 100 times. The data can be invented.
What I need, is to compute the Spearman's rho for the first five rows (between x and y) and then for increasing subsets (e.g., the sixth first rows, the sevenths first rows etc.). Then, I'd need to store the rho results in a vector that I can further use.
What I had in mind (but does not work):
sortvector <- 1:(30)
for (i in 1:100)
{
sortvector <- sample(sortvector, replace = F)
xtemp <- x[sortvector]
rho <- cor.test(xtemp,y, method="spearman")$estimate
}
The problem is that the code gives me one value of rho for the whole dataframe, but I need it for increments of subsets.
How can I get rho for subsets of increasing values in a for-loop? And how can i store the coefficients in a vector that i can use afterwards?
Any help would be much appreciated, thanks.
Cheers
The easiest approach is to convertfor loop into sapply function, which returns a vector of rho's as a result of your bootstrapping:
sortvector <- 1:(30)
x <- rnorm(30)
y <- rnorm(30)
rho <- sapply(1:100, function(i) {
sortvector <- sample(sortvector, replace = F)
xtemp <- x[sortvector]
cor.test(xtemp, y, method = "spearman")$estimate
})
head(rho)
Output:
rho rho rho rho rho rho
0.014460512 -0.239599555 0.003337041 -0.126585095 0.007341491 0.264516129

How to calculate correlation matrix between binary variables in r?

I have dataframe of 10 binary variables, looked like this:
V1 V2 V3...
0 1 1
1 1 0
1 0 1
0 0 1
I need to get the correlation matrix then I can do factor analysis.
psych::corr.test can calculate calculate the correlation matrix,but has only person,spearman,kendall methods,not used for binary data.
Then, how to calculate the correlation matrix of this dataframe?
# create data
m <- matrix(sample(x = 0:1,size = 200,replace = T),ncol = 10)
colnames(m) <- LETTERS[1:10]
m
# create cor matrix
res <- data.frame()
for(i in seq(ncol(m))){
z <- m[,i]
z <- apply(m,2,function(x){sum(x==z)/length(z)})
res <- rbind(res,z)
}
colnames(res) <- colnames(m)
rownames(res) <- colnames(m)
res <- as.matrix(res)
res
Correl methods are suitable for continuous data. https://www.quora.com/Is-it-possible-to-calculate-correlations-between-binary-variables
Can u you try non parametric methods try http://www.cedar.buffalo.edu/papers/articles/CVPRIP03_propbina.pdf
You can still achieve factor analysis, calculate % match and remove variable matching >x%. This way you can remove the dimension of the data.
You can use hierarchical clustering on columns
hclus(x)
or even better you can choose a clustering method from "ward.D", "ward.D2", "single", "complete"...
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust
Another solution will be to visualize your binary matrix as a heatmap, a similar variable with common features

Computing linear regressions for every possible permutation of matrix columns

I have a (k x n) matrix. I have initially managed to linearly regress (using the lm function) column 1 with each and every other column and extracted only the coefficients.
fore.choose <- matrix(0, 1, NCOL(assets))
for(i in seq(1, NCOL(assets), 1))
{
abc <- lm(assets[,1]~assets[,i])$coefficients
fore.choose[1,i] <- abc[2:length(abc)]
}
The coefficients are placed in the fore.choose matrix.
What I now need to do is to linearly regress column 2 with each and every other column, and then column 3 and so on and so forth and extract only the coefficients.
The output will be a square matrix of OLS univariate coefficients. Kind of similar to a correlation matrix, but it is the beta coefficients I am interested in.
fore.choose <- matrix(0, 1, NCOL(assets))
will initially need to become
fore.choose <- matrix(0, NCOL(assets), NCOL(assets))
I'd just compute the coefficients directly from the correlation matrix, using beta = cor(x,y)*sd(x)/sd(y), like this:
# set up some sample data
set.seed(1)
d <- matrix(rnorm(50), ncol=5)
# get the coefficients
s <- apply(d, 2, sd)
cor(d)*outer(s, s, "/")
You could also use lsfit to get the coefficients of one term on all the others at once and then only have one loop to do:
sapply(1:ncol(d), function(i) {
coef(lsfit(d[,i], d))[2,]
})
I'm sure there must be a more elegant way than to nested loops.
fore.choose <- matrix(NA, NCOL(assets), NCOL(assets))
abc <- NULL
for(i in seq_len(ncol(assets))){ # loop over "dependant" columns
for(j in seq_len(ncol(assets))){ # loop over "independant" columns
abc <- lm(assets[,i]~assets[,j])$coefficients
fore.choose[i,j] <- abc[-1]
}
}

Resources