Multiple linear regression and ANOVA in R [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
So, I have a table which consists of 20 football teams and 6 variables, the variables are X1, X2, X3, X4, X5 and X6.
X1 = % of goals shots at goal which result in a goal
X2 = goals scored outside the box
X3 = ratio of short to long passes
X4 = Number of ball crosses
X5 = Average number of goals conceded
X6 = # Yellow cards recieved
and then I have a Y column which is the number of league points each team has.
How would I perform multiple linear regression and ANOVA on this? I am at a complete loss with R.
Thanks
The data is this:

Multiple Linear Regression
The lm() in base R does exactly what you want (no need to use glm if you are only running linear regression):
Reg = lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6, data = mydata)
If Y and the X's are the only columns in your data.frame, you can use this much simpler syntax:
Reg = lm(Y ~ ., data = mydata)
The . means "all other columns".
To see regression output (as suggested by #Manassa Mauler):
summary(Reg)
Refer to ?lm and ?glm for more information.
ANOVA
If you want to compare nested models with the "intercept-only" model, you can do something like the following:
fit0 = lm(Y ~ 1, data = mydata)
fit1 = update(fit0, . ~ . + X1)
fit2 = update(fit1, . ~ . + X2)
fit3 = update(fit2, . ~ . + X3)
fit4 = update(fit3, . ~ . + X4)
fit5 = update(fit4, . ~ . + X5)
fit6 = update(fit5, . ~ . + X6)
This successively adds an additional variable to the intercept-only model.
To compare them, use the anova() function:
anova(fit0, fit1, fit2, fit3, fit4, fit5, fit6, test = "F")
Refer to ?anova or ?anova.lm for more information.

Just set up the multiple regression as follows using the glm function and then extract the results using summary.
model <- glm(Y~X1+X2+X3+X4+X5+X6)
summary(model)

Related

How to loop an ANCOVA test across multiple variables? [duplicate]

I am new to R and I want to improve the following script with an *apply function (I have read about apply, but I couldn't manage to use it). I want to use lm function on multiple independent variables (which are columns in a data frame). I used
for (i in (1:3) {
assign(paste0('lm.',names(data[i])), lm(formula=formula(i),data=data))
}
Formula(i) is defined as
formula=function(x)
{
as.formula ( paste(names(data[x]),'~', paste0(names(data[-1:-3]), collapse = '+')), env=parent.frame() )
}
Thank you.
If I don't get you wrong, you are working with a dataset like this:
set.seed(0)
dat <- data.frame(y1 = rnorm(30), y2 = rnorm(30), y3 = rnorm(30),
x1 = rnorm(30), x2 = rnorm(30), x3 = rnorm(30))
x1, x2 and x3 are covariates, and y1, y2, y3 are three independent response. You are trying to fit three linear models:
y1 ~ x1 + x2 + x3
y2 ~ x1 + x2 + x3
y3 ~ x1 + x2 + x3
Currently you are using a loop through y1, y2, y3, fitting one model per time. You hope to speed the process up by replacing the for loop with lapply.
You are on the wrong track. lm() is an expensive operation. As long as your dataset is not small, the costs of for loop is negligible. Replacing for loop with lapply gives no performance gains.
Since you have the same RHS (right hand side of ~) for all three models, model matrix is the same for three models. Therefore, QR factorization for all models need only be done once. lm allows this, and you can use:
fit <- lm(cbind(y1, y2, y3) ~ x1 + x2 + x3, data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
If you check str(fit), you will see that this is not a list of three linear models; instead, it is a single linear model with a single $qr object, but with multiple LHS. So $coefficients, $residuals and $fitted.values are matrices. The resulting linear model has an additional "mlm" class besides the usual "lm" class. I created a special mlm tag collecting some questions on the theme, summarized by its tag wiki.
If you have a lot more covariates, you can avoid typing or pasting formula by using .:
fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
Caution: Do not write
y1 + y2 + y3 ~ x1 + x2 + x3
This will treat y = y1 + y2 + y3 as a single response. Use cbind().
Follow-up:
I am interested in a generalization. I have a data frame df, where first n columns are dependent variables (y1,y2,y3,....) and next m columns are independent variables (x1+x2+x3+....). For n = 3 and m = 3 it is fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)). But how to do this automatically, by using the structure of the df. I mean something like (for i in (1:n)) fit <- lm(cbind(df[something] ~ df[something], data = dat)). That "something" I have created it with paste and paste0. Thank you.
So you are programming your formula, or want to dynamically generate / construct model formulae in the loop. There are many ways to do this, and many Stack Overflow questions are about this. There are commonly two approaches:
use reformulate;
use paste / paste0 and formula / as.formula.
I prefer to reformulate for its neatness, however, it does not support multiple LHS in the formula. It also needs some special treatment if you want to transform the LHS. So In the following I would use paste solution.
For you data frame df, you may do
paste0("cbind(", paste(names(df)[1:n], collapse = ", "), ")", " ~ .")
A more nice-looking way is to use sprintf and toString to construct the LHS:
sprintf("cbind(%s) ~ .", toString(names(df)[1:n]))
Here is an example using iris dataset:
string_formula <- sprintf("cbind(%s) ~ .", toString(names(iris)[1:2]))
# "cbind(Sepal.Length, Sepal.Width) ~ ."
You can pass this string formula to lm, as lm will automatically coerce it into formula class. Or you may do the coercion yourself using formula (or as.formula):
formula(string_formula)
# cbind(Sepal.Length, Sepal.Width) ~ .
Remark:
This multiple LHS formula is also supported elsewhere in R core:
the formula method for function aggregate;
ANOVA analysis with aov.

Running all Combinations of Dummy Variables Through a Regression Equation

I have an issue that concerns itself with extracting output from a regression for all possible combinations of dummy variable while keeping the continuous predictor variables fixed.
The problem is that my model contains over 100 combinations of interactions and manually calculating all of these will be quite tedious. Is there an efficient method for iteratively calculating output?
The only way I can think of is to write a loop that generates all desired combinations to subsequently feed into the predict() function.
Some context:
I am trying to identify the regional differences of automobile resale prices by the model of car.
My model looks something like this:
lm(data, price ~ age + mileage + region_dummy_1 + ... + region_dummy_n + model_dummy_1 + ... + model_dummy_n + region_dummy_1 * model_dummy_1 + ... + region_dummy_1 * model_dummy_n)
My question is:
How do I produce a table of predicted prices for every model/region combination?
Use .*.
lm(price ~ .*.)
Here's a small reproducible example:
> df <- data.frame(y = rnorm(100,0,1),
+ x1 = rnorm(100,0,1),
+ x2 = rnorm(100,0,1),
+ x3 = rnorm(100,0,1))
>
> lm(y ~ .*., data = df)
Call:
lm(formula = y ~ . * ., data = df)
Coefficients:
(Intercept) x1 x2 x3 x1:x2 x1:x3
-0.02036 0.08147 0.02354 -0.03055 0.05752 -0.02399
x2:x3
0.24065
How does it work?
. is shorthand for "all predictors", and * includes the two-way interaction term.
For example, consider a dataframe with 3 columns: Y (independent variable), and 2 predictors (X1 and X2). The syntax lm(Y ~ X1*X2) is shorthand for lm(Y ~ X1 + X2 + X1:X2), where, X1:X2 is the interaction term.
Extending this simple case, imagine we have a data frame with 3 predictors, X1, X2, and X3. lm(Y ~ .*.) is equivalent to lm(Y ~ X1 + X2 + X3 + X1:X2 + X1:X3 + X2:X3).

Running multiple linear regression in R

How would one run a multiple linear regression on R, with > 100 covariates?
Is there a faster way besides (y ~ x1 + x2 + x3 + ... + x100)?
lm(y ~ ., data = YourData)
You can use the . , which takes all columns other than the response column from the supplied data as covariates.

Fitting a linear model with multiple LHS

I am new to R and I want to improve the following script with an *apply function (I have read about apply, but I couldn't manage to use it). I want to use lm function on multiple independent variables (which are columns in a data frame). I used
for (i in (1:3) {
assign(paste0('lm.',names(data[i])), lm(formula=formula(i),data=data))
}
Formula(i) is defined as
formula=function(x)
{
as.formula ( paste(names(data[x]),'~', paste0(names(data[-1:-3]), collapse = '+')), env=parent.frame() )
}
Thank you.
If I don't get you wrong, you are working with a dataset like this:
set.seed(0)
dat <- data.frame(y1 = rnorm(30), y2 = rnorm(30), y3 = rnorm(30),
x1 = rnorm(30), x2 = rnorm(30), x3 = rnorm(30))
x1, x2 and x3 are covariates, and y1, y2, y3 are three independent response. You are trying to fit three linear models:
y1 ~ x1 + x2 + x3
y2 ~ x1 + x2 + x3
y3 ~ x1 + x2 + x3
Currently you are using a loop through y1, y2, y3, fitting one model per time. You hope to speed the process up by replacing the for loop with lapply.
You are on the wrong track. lm() is an expensive operation. As long as your dataset is not small, the costs of for loop is negligible. Replacing for loop with lapply gives no performance gains.
Since you have the same RHS (right hand side of ~) for all three models, model matrix is the same for three models. Therefore, QR factorization for all models need only be done once. lm allows this, and you can use:
fit <- lm(cbind(y1, y2, y3) ~ x1 + x2 + x3, data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
If you check str(fit), you will see that this is not a list of three linear models; instead, it is a single linear model with a single $qr object, but with multiple LHS. So $coefficients, $residuals and $fitted.values are matrices. The resulting linear model has an additional "mlm" class besides the usual "lm" class. I created a special mlm tag collecting some questions on the theme, summarized by its tag wiki.
If you have a lot more covariates, you can avoid typing or pasting formula by using .:
fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
Caution: Do not write
y1 + y2 + y3 ~ x1 + x2 + x3
This will treat y = y1 + y2 + y3 as a single response. Use cbind().
Follow-up:
I am interested in a generalization. I have a data frame df, where first n columns are dependent variables (y1,y2,y3,....) and next m columns are independent variables (x1+x2+x3+....). For n = 3 and m = 3 it is fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)). But how to do this automatically, by using the structure of the df. I mean something like (for i in (1:n)) fit <- lm(cbind(df[something] ~ df[something], data = dat)). That "something" I have created it with paste and paste0. Thank you.
So you are programming your formula, or want to dynamically generate / construct model formulae in the loop. There are many ways to do this, and many Stack Overflow questions are about this. There are commonly two approaches:
use reformulate;
use paste / paste0 and formula / as.formula.
I prefer to reformulate for its neatness, however, it does not support multiple LHS in the formula. It also needs some special treatment if you want to transform the LHS. So In the following I would use paste solution.
For you data frame df, you may do
paste0("cbind(", paste(names(df)[1:n], collapse = ", "), ")", " ~ .")
A more nice-looking way is to use sprintf and toString to construct the LHS:
sprintf("cbind(%s) ~ .", toString(names(df)[1:n]))
Here is an example using iris dataset:
string_formula <- sprintf("cbind(%s) ~ .", toString(names(iris)[1:2]))
# "cbind(Sepal.Length, Sepal.Width) ~ ."
You can pass this string formula to lm, as lm will automatically coerce it into formula class. Or you may do the coercion yourself using formula (or as.formula):
formula(string_formula)
# cbind(Sepal.Length, Sepal.Width) ~ .
Remark:
This multiple LHS formula is also supported elsewhere in R core:
the formula method for function aggregate;
ANOVA analysis with aov.

How to add one variable each time into the regression model?

I have a question about how to add one variable each time into the regression model to evaluate the adjusted R squared.
For example,
lm(y~x1)
next time, I want to do
lm(y~x1+x2)
and then,
lm(y~x1+x2+x3)
I tried paste, it does not work. for example, lm(y~paste("x1","x2",sep="+")).
Any idea?
Assuming you fit 3 variables to your linear regression model: x1, x2 and x3
lm.fit1 = lm(y ~ x1 + x2 + x3)
Introducing an additional variable (x4) can be achieved by using the update function:
lm.fit2 = update(lm.fit1, .~. + x4)
You could even introduce an interaction term if required:
lm.fit3 = update(lm.fit2, .~. + x2:x3)
Further details on adding variables to regression models can be obtained here

Resources