I have a set of data in R and I want to run a regression to test for correlation using custom coefficients.
Example:
x = lm(a ~ b + c + d, data=data, weights=weights)
That gives me coefficients for b, c, and d, but I just want to give b, c, and d my own coefficients and find, for example, the r^2. How would I do so?
Let's assume your predetermined coefficients are a three-element, numeric vector named: vec and that none of a,b,c are factors or character vectors:
#edit ... add a sum() function
(x = lm(a ~ 1, data=data, offset=apply(data, 1, function(x) {sum( c(1,x) * vec))} )
This should produce a model that has the specified estimates. You will probably need to do this:
summary(x)
As always... if you want tested code, then provide a dataset for testing. With the mtcars dataframe:
m1 = lm(mpg ~ carb + wt, data=mtcars)
vec <- coef(m1)
(x = lm(mpg ~ 1, data=mtcars,
offset=apply( mtcars[c("carb","wt")], 1,
function(x){ sum( c(1,x) *vec)} )))
Call:
lm(formula = mpg ~ 1, data = mtcars, offset = apply(mtcars[c("carb",
"wt")], 1, function(x) {
sum( c(1, x) * vec)
}))
Coefficients:
(Intercept)
-7.85e-17
So the offset model (with the coefficients used in the offset) is essentially an exact fit to the m1 model.
#BondedDust's method will be more efficient in the long run, but just for illustration, here's a simple example of how to create your own function to calculate R-squared for any regression coefficients you choose. We'll use the mtcars data set, which is built into R.
Assume a regression model that predicts "mpg" using the independent variables "carb" and "wt". a, b, and c are the three regression parameters that we need to provide to the function.
# Function to calculate R-squared
R2 = function(a,b,c) {
# Calculate the residual sum of squares from the regression model
SSresid = sum(((a + b*mtcars$carb + c*mtcars$wt) - mtcars$mpg)^2)
# Calculate the total sum of squares
SStot = sum((mtcars$mpg - mean(mtcars$mpg))^2)
# Calculate and return the R-squared for the regression model
return(1 - SSresid/SStot)
}
Now let's run the function. First let's see if our function matches the R-squared calculated by lm. We'll do this by creating a regression model in R, then we'll use the coefficients from that model and calculate the R-squared using our function and see if it matches the output from lm:
# Create regression model
m1 = lm(mpg ~ carb + wt, data=mtcars)
summary(m1)
Call:
lm(formula = mpg ~ carb + wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5206 -2.1223 -0.0467 1.4551 5.9736
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.7300 1.7602 21.435 < 2e-16 ***
carb -0.8215 0.3492 -2.353 0.0256 *
wt -4.7646 0.5765 -8.265 4.12e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.839 on 29 degrees of freedom
Multiple R-squared: 0.7924, Adjusted R-squared: 0.7781
F-statistic: 55.36 on 2 and 29 DF, p-value: 1.255e-10
From the summary, we can see that the R-squared is 0.7924. Let's see what we get from the function we just created. All we need to do is feed our function the three regression coefficients listed in the summary above. We can hard-code those numbers, or we can extract the coefficients from the model object m1 (which is what I've done below):
R2(coef(m1)[1], coef(m1)[2], coef(m1)[3])
[1] 0.7924425
Now let's calculate the R-squared for other choices of the regression coefficients:
a = 37; b = -1; c = -3.5
R2(a, b, c)
[1] 0.5277607
a = 37; b = -2; c = -5
R2(a, b, c)
[1] 0.0256494
To check lots of values of a parameter at once, you can, for example, use sapply. The code below will return the R-squared for values of c ranging from -7 to -3 in increments of 0.1 (with the other two parameters set to the the values returned by lm:
sapply(seq(-7,-3,0.1), function(x) R2(coef(m1)[1],coef(m1)[2],x))
Related
I am trying to build a fixed effects regression with the plm package in R. I am using country level panel data with year and country fixed effects.
My problem concerns 2 explanatory variables. One is an interaction term of two varibels and one is a squared term of one of the variables.
model is basically:
y = x1 + x1^2+ x3 + x1*x3+ ...+xn , with the variables all being in log form
It is central to the model to include the squared term, but when I run the regression it always gets excluded because of "singularities", as x1 and x1^2 are obviously correlated.
Meaning the regression works and I get estimates for my variables, just not for x1^2 and x1*x2.
How do I circumvent this?
library(plm)
fe_reg<- plm(log(y) ~ log(x1)+log(x2)+log(x2^2)+log(x1*x2)+dummy,
data = df,
index = c("country", "year"),
model = "within",
effect = "twoways")
summary(fe_reg)
´´´
#I have tried defining the interaction and squared terms as vectors, which helped with the #interaction term but not the squared term.
df1.pd<- df1 %>% mutate_at(c('x1'), ~(scale(.) %>% as.vector))
df1.pd<- df1 %>% mutate_at(c('x2'), ~(scale(.) %>% as.vector))
´´´
I am pretty new to R, so apologies if this not a very well structured question.
You just found two properties of the logarithm function:
log(x^2) = 2 * log(x)
log(x*y) = log(x) + log(y)
Then, obviously, log(x) is collinear with 2*log(x) and one of the two collinear variables is dropped from the estimation. Same for log(x*y) and log(x) + log(y).
So, the model you want to estimate is not estimable by linear regression methods. You might want to take different data transformations than log into account or the original variables.
See also the reproducible example below wher I just used log(x^2) = 2*log(x). Linear dependence can be detected, e.g., via function detect.lindep from package plm (see also below). Dropping of coefficients from estimation also hints at collinear columns in the model estimation matrix. At times, linear dependence appears only after data transformations invovled in the estimation functions, see for an example of the within transformation the help page ?detect.lindep in the Example section).
library(plm)
data("Grunfeld")
pGrun <- pdata.frame(Grunfeld)
pGrun$lvalue <- log(pGrun$value) # log(x)
pGrun$lvalue2 <- log(pGrun$value^2) # log(x^2) == 2 * log(x)
mod <- plm(inv ~ lvalue + lvalue2 + capital, data = pGrun, model = "within")
summary(mod)
#> Oneway (individual) effect Within Model
#>
#> Call:
#> plm(formula = inv ~ lvalue + lvalue2 + capital, data = pGrun,
#> model = "within")
#>
#> Balanced Panel: n = 10, T = 20, N = 200
#>
#> Residuals:
#> Min. 1st Qu. Median 3rd Qu. Max.
#> -186.62916 -20.56311 -0.17669 20.66673 300.87714
#>
#> Coefficients: (1 dropped because of singularities)
#> Estimate Std. Error t-value Pr(>|t|)
#> lvalue 30.979345 17.592730 1.7609 0.07988 .
#> capital 0.360764 0.020078 17.9678 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Total Sum of Squares: 2244400
#> Residual Sum of Squares: 751290
#> R-Squared: 0.66525
#> Adj. R-Squared: 0.64567
#> F-statistic: 186.81 on 2 and 188 DF, p-value: < 2.22e-16
detect.lindep(mod) # run on the model
#> [1] "Suspicious column number(s): 1, 2"
#> [1] "Suspicious column name(s): lvalue, lvalue2"
detect.lindep(pGrun) # run on the data
#> [1] "Suspicious column number(s): 6, 7"
#> [1] "Suspicious column name(s): lvalue, lvalue2"
I have an issue when calculating logistic regression in R that, to me, makes no sense.
I have one parameter in the model, positive numbers (molecular weight).
I have a binary response variable, let's say either A or B.
My data table is called df1.
str(df1)
data.frame': 1015 obs. of 2 variables:
$ Protein_Class: chr "A" "A" "A" "B" ...
$ MW : num 47114 29586 26665 34284 104297 ...
I make the model:
summary(glm(as.factor(df1[,1]) ~ df1[,2],family="binomial"))
The results are:
Call:
glm(formula = as.factor(df1[, 1]) ~ df1[, 2], family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5556 -1.5516 0.8430 0.8439 0.8507
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.562e-01 1.251e-01 6.842 7.8e-12 ***
df1[, 2] -1.903e-07 3.044e-06 -0.063 0.95
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1239.2 on 1014 degrees of freedom
Residual deviance: 1239.2 on 1013 degrees of freedom
AIC: 1243.2
Number of Fisher Scoring iterations: 4
That's all fine and good until this point.
But, when I take the logarithm of my variable:
summary(glm(as.factor(df1[,1]) ~ log10(df1[,2]),family="binomial"))
Call:
glm(formula = as.factor(df1[, 1]) ~ log10(df1[, 2]), family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8948 -1.4261 0.8007 0.8528 1.0469
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.7235 1.1169 -2.438 0.01475 *
log10(df1[, 2]) 0.8038 0.2514 3.197 0.00139 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1239.2 on 1014 degrees of freedom
Residual deviance: 1228.9 on 1013 degrees of freedom
AIC: 1232.9
Number of Fisher Scoring iterations: 4
The p-value has changed!
How can this be? And more importantly, which one to use?
My understanding was that logistic regression is based on ranks, and all I do is a monotone transformation. Note, that the AUROC curve of the model remains the same.
There are no zero or negative values that are lost during the transformation.
Did I miss something here?
Any advice?
Thanks in advance,
Adam
There are a couple of things to think about. First, you can probably constrain your search to one side or the other of 1. That is decreasing the power on x - square root, log, inverse, etc... - all have a similar type of effect, but to differing degrees. They all pull in big values and spread out small values. The transformations greater than 1 do the opposite, they tend to increase the spread among big values and decrease the spread among small values - all generally assuming you've got no non-positive values in your variable. This is really, then, a question about what kind of transformation you want and then after that - how severe does it have to be.
First, what kind of transformation do you need. I made some fake data to illustrate the point:
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(1234)
x <- runif(1000, 1, 10000)
y.star <- -6 + log(x)
y <- rbinom(1000, 1, plogis(y.star) )
df <- tibble(
y=y,
x=x,
ystar=y.star)
Next, since this is just a bivariate relationship, we could plot it out with a loess curve. In particular, though, we want to know what the log-odds of y look like with respect to x. We can do this by transforming the predictions from the loess curve with the logistic quantile function, qlogis() - this takes the probabilities and puts them in log-odds form. Then, we could make the plot.
lo <- loess(y ~ x, span=.75)
df <- df %>% mutate(fit = predict(lo),
fit = case_when(
fit < .01 ~ .01,
fit > .99 ~ .99,
TRUE ~ fit))
ggplot(df) +
geom_line(aes(x=x, y=qlogis(fit)))
This looks like a class log relationship. We could then implement a few different transformations and plot those - square root, log and negative inverse.
lo1 <- loess(y ~ sqrt(x), span=.5)
lo2 <- loess(y ~ log(x), span=.5)
lo3 <- loess(y ~ I(-(1/x)), span=.5)
df <- df %>% mutate(fit1 = predict(lo1),
fit1 = case_when(
fit1 < .01 ~ .01,
fit1 > .99 ~ .99,
TRUE ~ fit1))
df <- df %>% mutate(fit2 = predict(lo2),
fit2 = case_when(
fit2 < .01 ~ .01,
fit2 > .99 ~ .99,
TRUE ~ fit2))
df <- df %>% mutate(fit3 = predict(lo3),
fit3 = case_when(
fit3 < .01 ~ .01,
fit3 > .99 ~ .99,
TRUE ~ fit3))
Next, we need to transform the data so the plotting will look right:
plot.df <- df %>%
tidyr::pivot_longer(cols=starts_with("fit"),
names_to="var",
values_to="vals") %>%
mutate(x2 = case_when(
var == "fit" ~ x,
var == "fit1" ~ sqrt(x),
var == "fit2" ~ log(x),
var == "fit3" ~ -(1/x),
TRUE ~ x),
var = factor(var, labels=c("Original", "Square Root", "Log", "Inverse")))
Then, we can make the plot:
ggplot(plot.df, aes(x=x2, y=vals)) +
geom_line() +
facet_wrap(~var, scales="free_x")
Here, it looks like the log is the most linear of the bunch - not surprising since we made the variable y.star with log(x). If we wanted to test between these different possibilities, Kevin Clarke, a Political Scientist at Rochester proposed a paired sign test for evaluating the difference between non-nested models. There is a paper about it here. I wrote a package called clarkeTest that implements this in R. So, we could use this to test the various different alternatives:
m0 <- glm(y ~ x, data=df, family=binomial)
m1 <- glm(y ~ sqrt(x), data=df, family=binomial)
m2 <- glm(y ~ log(x), data=df, family=binomial)
m3 <- glm(y ~ I(-(1/x)), data=df, family=binomial)
Testing the original against the square root:
library(clarkeTest)
> clarke_test(m0, m1)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -296
# Observations: 1000
# Test statistic: 400 (40%)
#
# Model 2 is preferred (p = 2.7e-10)
This shows that the square root is better than the original un-transformed variable.
clarke_test(m0, m2)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -284
# Observations: 1000
# Test statistic: 462 (46%)
#
# Model 2 is preferred (p = 0.018)
The above shows that the log is better than the un-transformed variable.
> clarke_test(m0, m3)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -292
# Observations: 1000
# Test statistic: 550 (55%)
#
# Model 1 is preferred (p = 0.0017)
The above shows that the un-transformed variable is preferred to the negative inverse. Then, we can test the difference of the two models preferred to the original.
> clarke_test(m1, m2)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -296
# Model 2 log-likelihood: -284
# Observations: 1000
# Test statistic: 536 (54%)
#
# Model 1 is preferred (p = 0.025)
This shows that the the square root is better than the log transformation in terms of individual log-likelihoods.
Another option would be a grid search over possible transformations and look at the AIC each time. We first have to make a function to deal with the situation where the transformation power = 0, where we should substitute the log. Then we can run a model for each different transformation and get the AICs.
grid <- seq(-1,1, by=.1)
trans <- function(x, power){
if(power == 0){
tx <- log(x)
}else{
tx <- x^power
}
tx
}
mods <- lapply(grid, function(p)glm(y ~ trans(x, p),
data=df,
family=binomial))
aic.df <- tibble(
power = grid,
aic = sapply(mods, AIC))
Next, we can plot the AICs as a function of the power.
ggplot(aic.df, aes(x=power, y=aic)) +
geom_line()
This tells us that about -.25 is the appropriate transformation parameter. Note that there is a discrepancy between the Clarke test results and the AIC because AIC is based on the overall log-likelihood and the Clarke test is based on differences in the individual log-likelihoods.
We would find that this new proposed transformation is also worse than the square root:
m4 <- glm(y ~ I(x^-.25), data=df, family=binomial)
clarke_test(m1, m4)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -296
# Model 2 log-likelihood: -283
# Observations: 1000
# Test statistic: 559 (56%)
#
# Model 1 is preferred (p = 0.00021)
So, if you have a couple of different candidates in mind and you like the idea behind the Clarke test, you could use that to find the appropriate transformation. If you don't have a candidate in mind, a grid search is always a possibility.
Say I have the data frame (image below) and I want to split in to two new categories based on region so one would be BC and the other NZ, how do I achieve this? (in R)
data
Here is an example with the mtcars data where we use the transmission variable am to plot different groups in a scatterplot with the ggplot2 package.
We will create a scatterplot with the displacement variable on the x axis and miles per gallon on the y axis. Since cars with larger engine displacement typically consume more gasoline than those with smaller displacement, we expect to see a negative relationship (mpg is higher with low values of displacement) in the chart.
First, we convert am to a factor variable so the legend prints two categories instead of a continuum between 0 and 1. Then we use ggplot() and geom_point() to set the point color based on the value of am.
library(ggplot2)
mtcars$am <- factor(mtcars$am,labels = c("automatic","manual"))
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point(aes(color = am))
...and the output:
Separating charts by group with facets
We can use ggplot2 directly to generate separate charts by a grouping variable. In ggplot2 this is known as a facetted chart. We use facet_wrap() to split the data by values of am as follows.
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point() +
facet_wrap(mtcars$am,ncol = 2)
...and the output:
Adding regression line and confidence intervals
Given the comments in the original question, we an add a regression line to the plot by using the geom_smooth() function, which defaults to lowess smoothing.
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point() +
facet_wrap(mtcars$am,ncol = 2) +
geom_smooth(span = 1)
...and the output:
To use a simple regression instead of lowess smoothing, we use the method = argument in geom_smooth(), and set it to lm.
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point() +
facet_wrap(mtcars$am,ncol = 2) +
geom_smooth(method = "lm")
...and the output:
Generate regression models by group
Here we split the data frame by values of am, and use lapply() to generate regression models for each group.
carsList <- split(mtcars,mtcars$am)
lapply(carsList,function(x){
summary(lm(mpg ~ disp,data = x))
})
...and the output:
$automatic
Call:
lm(formula = mpg ~ disp, data = x)
Residuals:
Min 1Q Median 3Q Max
-2.7341 -1.6546 -0.8855 1.6032 5.0764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.157064 1.592922 15.79 1.36e-11 ***
disp -0.027584 0.005146 -5.36 5.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.405 on 17 degrees of freedom
Multiple R-squared: 0.6283, Adjusted R-squared: 0.6064
F-statistic: 28.73 on 1 and 17 DF, p-value: 5.194e-05
$manual
Call:
lm(formula = mpg ~ disp, data = x)
Residuals:
Min 1Q Median 3Q Max
-4.6056 -2.4200 -0.0956 3.1484 5.2315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.86614 1.95033 16.852 3.33e-09 ***
disp -0.05904 0.01174 -5.031 0.000383 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.545 on 11 degrees of freedom
Multiple R-squared: 0.6971, Adjusted R-squared: 0.6695
F-statistic: 25.31 on 1 and 11 DF, p-value: 0.0003834
NOTE: since this is an example illustrating the code necessary to generate a regression analysis with a split variable, we won't go into the details about whether the data here conforms to modeling assumptions for Ordinary Least Squares regression.
Modeling the groups in one regression model
As I noted in the comments, we can account for the differences between automatic and manual transmissions in one regression model if we specify the am effect as well as an interaction effect between am and disp.
summary(lm(mpg ~ disp + am + am * disp,data=mtcars))
We can demonstrate that this model generates the same predictions as the split model for manual transmissions by generating predictions from each model as follows.
data <- data.frame(am = c(1,1,0),
disp = c(157,248,300))
data$am <- factor(data$am,labels = c("automatic","manual"))
mod1 <- lm(mpg ~ disp + am + am * disp,data=mtcars)
predict(mod1,data)
mod2 <- lm(mpg ~ disp,data = mtcars[mtcars$am == "manual",])
predict(mod2,data[data$am == "manual",])
...and the output:
> data <- data.frame(am = c(1,1,0),
+ disp = c(157,248,300))
> data$am <- factor(data$am,labels = c("automatic","manual"))
> mod1 <- lm(mpg ~ disp + am + am * disp,data=mtcars)
> predict(mod1,data)
1 2 3
23.59711 18.22461 16.88199
> mod2 <- lm(mpg ~ disp,data = mtcars[mtcars$am == "manual",])
> predict(mod2,data[data$am == "manual",])
1 2
23.59711 18.22461
We subset the data prior to predict() for the split model in order to generate predictions only for observations that had manual transmissions. Since the predictions match, we prove that building separate models by transmission type is no different than a fully specified model that includes both the categorical am effect and an interaction effect for am * disp.
I've been told and have seen examples where a linear model and t test are basically the same test just that the t test is a specialized linear model with dummy-coded predictors. Is there a way to get the output of lm to output the same t values, p values, confidence intervals, and standard error as the normal t.test function in r where the default value for the var.equal argument is FALSE?
For example right now the outputs of lm and t.test are different right now
data("mtcars")
#these outputs below give me different values
summary(lm(mpg ~ am, mtcars))
t.test(mpg ~ am, mtcars)
What I want is to make lm have the same values as the t.test function., which is a Welch t test. How would I do that?
First off, there exists a great post on CrossValidated How are regression, the t-test, and the ANOVA all versions of the general linear model?
that gives a lot of background information on the relationship between a t-test, linear regression and ANOVA.
In essence, the p-value from a t-test corresponds to the p-value of the slope parameter in a linear model.
In your case, you need to compare
t.test(mpg ~ am, mtcars, alternative = "two.sided", var.equal = T)
#
# Two Sample t-test
#
#data: mpg by am
#t = -4.1061, df = 30, p-value = 0.000285
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -10.84837 -3.64151
#sample estimates:
#mean in group 0 mean in group 1
# 17.14737 24.39231
fit <- lm(mpg ~ as.factor(am), mtcars)
summary(fit)
#
#Call:
#lm(formula = mpg ~ as.factor(am), data = mtcars)
#
#Residuals:
# Min 1Q Median 3Q Max
#-9.3923 -3.0923 -0.2974 3.2439 9.5077
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 17.147 1.125 15.247 1.13e-15 ***
#as.factor(am)1 7.245 1.764 4.106 0.000285 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 4.902 on 30 degrees of freedom
#Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
#F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Note that p-values agree.
Two comments:
as.factor(am) turns am into a categorical variable
To match the assumptions of the linear model (where the error term epsilon ~ N(0, sigma^2)), we need to use t.test with var.equal = T which assumes the variance to be the same for measurements from both groups.
The difference in the sign of the t value comes from the different definition of the reference level of "categorised" am.
To get the same group means in the linear model, we can remove the intercept
lm(mpg ~ as.factor(am) - 1, mtcars)
#
#Call:
#lm(formula = mpg ~ as.factor(am) - 1, data = mtcars)
#
#Coefficients:
#as.factor(am)0 as.factor(am)1
# 17.15 24.39
An assumption of linear regression is that the residuals are normally distributed with a mean of 0 and a constant variance. Therefore your t.test and regression summary will have consistent results only if you assume that the variances are equal.
I am looking for a way to calculate the multiple correlation coefficient in R http://en.wikipedia.org/wiki/Multiple_correlation, is there a built-in function to calculate it ?
I have one dependent variable and three independent ones.
I am not able to find it online, any idea ?
The easiest way to calculate what you seem to be asking for when you refer to 'the multiple correlation coefficient' (i.e. the correlation between two or more independent variables on the one hand, and one dependent variable on the other) is to create a multiple linear regression (predicting the values of one variable treated as dependent from the values of two or more variables treated as independent) and then calculate the coefficient of correlation between the predicted and observed values of the dependent variable.
Here, for example, we create a linear model called mpg.model, with mpg as the dependent variable and wt and cyl as the independent variables, using the built-in mtcars dataset:
> mpg.model <- lm(mpg ~ wt + cyl, data = mtcars)
Having created the above model, we correlate the observed values of mpg (which are embedded in the object, within the model data frame) with the predicted values for the same variable (also embedded):
> cor(mpg.model$model$mpg, mpg.model$fitted.values)
[1] 0.9111681
R will in fact do this calculation for you, but without telling you so, when you ask it to create the summary of a model (as in Brian's answer): the summary of an lm object contains R-squared, which is the square of the coefficient of correlation between the predicted and observed values of the dependent variable. So an alternative way to get the same result is to extract R-squared from the summary.lm object and take the square root of it, thus:
> sqrt(summary(mpg.model)$r.squared)
[1] 0.9111681
I feel that I should point out, however, that the term 'multiple correlation coefficient' is ambiguous.
The built-in function lm gives at least one version, not sure if this is what you are looking for:
fit <- lm(yield ~ N + P + K, data = npk)
summary(fit)
Gives:
Call:
lm(formula = yield ~ N + P + K, data = npk)
Residuals:
Min 1Q Median 3Q Max
-9.2667 -3.6542 0.7083 3.4792 9.3333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.650 2.205 24.784 <2e-16 ***
N1 5.617 2.205 2.547 0.0192 *
P1 -1.183 2.205 -0.537 0.5974
K1 -3.983 2.205 -1.806 0.0859 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.401 on 20 degrees of freedom
Multiple R-squared: 0.3342, Adjusted R-squared: 0.2343
F-statistic: 3.346 on 3 and 20 DF, p-value: 0.0397
More info on what's going on at ?summary.lm and ?lm.
Try this:
# load sample data
data(mtcars)
# calculate correlation coefficient between all variables in `mtcars` using
# the inbulit function
M <- cor(mtcars)
# M is a matrix of correlation coefficient which you can display just by
# running
print(M)
# If you want to plot the correlation coefficient
library(corrplot)
corrplot(M, method="number",type= "lower",insig = "blank", number.cex = 0.6)