matrix of correlations - r

I'm new to using R and I am trying to create a matrix of correlations. I have three independent variables (x1,x2,x3) and one dependent varaible (y).
I've been trying to use cor to make a matrix of the correlations, but so far I have bene unable to find a formula for doing this.

x1=rnorm(20)
x2=rnorm(20)
x3=rnorm(20)
y=rnorm(20)
data=cbind(y,x1,x2,x3)
cor(data)

If I have correctly understood, you have a matrix of 3 columns (say x1 to x3) and many rows (as y values). You may act as follows:
foo = matrix(runif(30), ncol=3) # creating a matrix of 3 columns
cor(foo)
If you have already your values in 3 vectors x1 to x3, you can make foo like this: foo=data.frame(x1,x2,x3)

Correct me if I'm wrong, but assuming this is related to a regression problem, this might be what you're looking for:
#Set the number of data points and build 3 independent variables
set.seed(0)
numdatpoi <- 7
x1 <- runif(numdatpoi)
x2 <- runif(numdatpoi)
x3 <- runif(numdatpoi)
#Build the dependent variable with some added noise
noisig <- 10
yact <- 2 + (3 * x1) + (5 * x2) + (10 * x3)
y <- yact + rnorm(n=numdatpoi, mean=0, sd=noisig)
#Fit a linear model
rmod <- lm(y ~ x1 + x2 + x3)
#Build the variance-covariance matrix. This matrix is typically what is wanted.
(vcv <- vcov(rmod))
#If needed, convert the variance-covariance matrix to a correlation matrix
(cm <- cov2cor(vcv))
From the above, here's the variance-covariance matrix:
(Intercept) x1 x2 x3
(Intercept) 466.5773 14.3368 -251.1715 -506.1587
x1 14.3368 452.9569 -170.5603 -307.7007
x2 -251.1715 -170.5603 387.2546 255.9756
x3 -506.1587 -307.7007 255.9756 873.6784
And, here's the associated correlation matrix:
(Intercept) x1 x2 x3
(Intercept) 1.00000000 0.03118617 -0.5908950 -0.7927735
x1 0.03118617 1.00000000 -0.4072406 -0.4891299
x2 -0.59089496 -0.40724064 1.0000000 0.4400728
x3 -0.79277352 -0.48912986 0.4400728 1.0000000

Related

How can I solve multicollinearity?

I constructed a linear model and tried to calculate the VIF of the variables but I get the following error:
vif(lm_model3101)
Error in vif.default(lm_model3101) :
there are aliased coefficients in the model
To check which numeric variables are corelated, i calculated the correlation of the used numeric variables and there is no perfect or nearly perfect correlation between any variables:
cor(multi)
mydata..CRU.Index. mydata..GDP.per.capita. mydata.price_per_unit mydata.price_discount mydata..AC..Volume.
mydata..CRU.Index. 1.000000000 0.006036169 0.1646463 -0.097077238 -0.006590327
mydata..GDP.per.capita. 0.006036169 1.000000000 0.1526220 0.008135387 -0.137733119
mydata.price_per_unit 0.164646319 0.152621974 1.0000000 -0.100344865 -0.310770525
mydata.price_discount -0.097077238 0.008135387 -0.1003449 1.000000000 0.339961760
mydata..AC..Volume. -0.006590327 -0.137733119 -0.3107705 0.339961760 1.000000000
What could the problem be? any help or suggestions? The rest of our explanatory variables are factorial so they can not be correlated
Having aliased coefficients doesn't necessarily mean two predictors are perfectly correlated. It means that they are linearly dependent, that is at least one terms is a linear combination of the others. They could be factors or continuous variables. To find them, use the alias function. For example:
y <- runif(10)
x1 <- runif(10)
x2 <- runif(10)
x3 <- x1 + x2
alias(y~x1+x2+x3)
Model :
y ~ x1 + x2 + x3
Complete :
(Intercept) x1 x2
x3 0 1 1
This identifies x3 as being the sum of x1 and x2

subset data frame according to a criteria based on correlation

I have a data frame and will like to create a function in which only variables with low correlation are keep. This means looking at the pairwise correlation of each variable with the rest of the variables, and for those variables in which at least one correlation coefficient is greater than 0.4 then this variable and the one highly correlated are taken out from the data frame.
For example suppose I have a data frame:
data <- data.frame(x1=rnorm(10), x2=rnorm(10), x3=runif(10), x4=runif(10,15,20))
cor(data, use="pairwise.complete.obs")
x1 x2 x3 x4
x1 1.00000000 -0.3325757 0.08567911 0.2651721
x2 -0.33257569 1.0000000 -0.18761301 0.4660056
x3 0.08567911 -0.1876130 1.00000000 -0.3321003
x4 0.26517210 0.4660056 -0.33210031 1.0000000
Then I will like to return a data frame keeping only x1 and x3 (given that x2 and x4 have a correlation of 0.46)
Calculate the correlation matrix cd, checking if there is anything >0.4.
Then subset away, ignoring the diagonals, where row==col:
cd <- abs(cor(data, use="pairwise.complete.obs")) > 0.4
data[-unique(col(cd)[cd & row(cd) != col(cd)])]
You could try:
set.seed(50)
data <- data.frame(x1=rnorm(10), x2=rnorm(10), x3=runif(10), x4=runif(10,15,20))
mycor <- cor(data, use="pairwise.complete.obs")
data[, !apply(mycor, 2, function (x) max(x[-which.max(x)]) >.4 | min(x[which.min(x)]) < -.4) ]

Subset of predictors using coefplot()

I'd like to do a plot of coefficients using coefplot() that only takes into account a subset of the predictors that I'm using. For example, if you have the code
y1 <- rnorm(1000,50,23)
x1 <- rnorm(1000,50,2)
x2 <- rbinom(1000,1,prob=0.63)
x3 <- rpois(1000, 2)
fit1 <- lm(y1 ~ x1 + x2 + x3)
and then ran
coefplot(fit1)
it would give you a plot displaying the coefficients of the intercept, x1, x2 and x3. How can I modify this so I only get the coefficients for say, x1 and x2?
You can use the argument predictors and it will only plot the coefficients you need:
library(coefplot)
coefplot(fit1, predictors=c('x1','x2'))
Output:

Simultaneous imputation of multiple binary variables in R

I have a dataset with multiple correlated binary variables (0/1). Can anyone point me towards a solution how to impute completely random missing values based on information in the other variables?
Below, I provide some code to create a simplified dataset with just 3 correlated binary variables.
# create correlated random binary (0/1) variables
x1 <- runif(100,0,1) # N(0,1))
x2 <- x1 * runif(100,0,1) # N(0,1))
x3 <- x2 * runif(100,0,1)+0.2 # N(0,1))
x1 <- round(x1)
x2 <- round(x2)
x3 <- round(x3)
#introduce random missing (MCAR)
x1[seq(1,100,7)]<-NA
x2[seq(2,100,7)]<-NA
x3[seq(3,100,7)]<-NA
# how can I impute missing values in this dataframe?
df <- as.data.frame(cbind(x1,x2,x3))
cor(df,use="pairwise.complete.obs")
Thanks so much,
Micha
You could use the mice package.
> library(mice)
Loading required package: Rcpp
mice 2.21 2014-02-05
> df.imputed <- complete(mice(df))
# mice output deleted
> nrow(df) == sum(complete.cases(df.imputed))
[1] TRUE
> cor(df.imputed)
x1 x2 x3
x1 1.0000000 0.4645345 0.2914986
x2 0.4645345 1.0000000 0.6787420
x3 0.2914986 0.6787420 1.0000000

optimal predictor value for multivariate regression in R

Suppose I have 1 response variable Y and 2 predictors X1 and X2, such as the following
Y X1 X2
2.3 1.1 1.2
2.5 1.24 1.17
......
Assuming I have a strong belief the following model works well
fit <- lm(Y ~ poly(X1,2) + X2)
in other words, there is a quadratic relation between Y and X1, a linear relationship between Y and X2.
Now here are my questions:
how to find the optimal value of (x1,x2) such that the fitted model reaches the maximal value at this pair of value?
now assuming X2 has to be fixed at some particular value, how to find the optimal x1 such that the fitted value is maximized?
So here is an empirical way to do this:
# create some random data...
set.seed(1)
X1 <- 1:100
X2 <- sin(2*pi/100*(1:100))
df <- data.frame(Y=3 + 5*X1 -0.2 * X1^2 + 100*X2 + rnorm(100,0,5),X1,X2)
fit <- lm(Y ~ poly(X1,2,raw=T) + X2, data=df)
# X1 and X2 unconstrained
df$pred <- predict(fit)
result <- with(df,df[pred==max(pred),])
result
# Y X1 X2 pred
# 19 122.8838 19 0.9297765 119.2087
# max(Y|X2=0)
newdf <- data.frame(Y=df$Y, X1=df$X1, X2=0)
newdf$pred2 <- predict(fit,newdata=newdf)
result2 <- with(newdf,newdf[pred2==max(pred2),])
result2
# Y X1 X2 pred2
#12 104.6039 12 0 35.09141
So in this example, when X1 and X2 are unconstrained, the maximum value of Y = 119.2 and occurs at (X1,X2) = (122.8,0.930). When X2 is constrained to 0, the maximum value of Y = 35.1 and occurs at (X1,X2) = (104.6,0).
There are a couple of things to consider:
These are global maxima in the space of your data. In other words if your real data has a large number of variables there might be local maxima that you will not find this way.
This method has resolution only as great as your dataset. So if the true maximum occurs at a point between your data points, you will not find it this way.
This technique is restricted to the bounds of your dataset. So if the true maximum is outside those bounds, you will not find it. On the other hand, using a model outside the bounds of your data is, IMHO, the definition of reckless.
Finally, you should be aware the poly(...) produces orthogonal polynomials which will generate a fit, but the coefficients will be very difficult to interpret. If you really want a quadratic fit, e.g. a+ b × x+ c × x2, you are better off doing that explicitly with Y~X1 +I(X1^2)+X2, or using raw=T in the call to poly(...).
credit to #sashkello
Basically, I have to extract coefficients from lm object and multiply with corresponding terms to form the formula to proceed.
I think this is not very efficient. What if this is regression with hundreds of predictors?

Resources