Linear Regression in R with variable number of explanatory variables [duplicate] - r

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Specifying formula in R with glm without explicit declaration of each covariate
how to succinctly write a formula with many variables from a data frame?
I have a vector of Y values and a matrix of X values that I want to perform a multiple regression on (i.e. Y = X[column 1] + X[column 2] + ... X[column N])
The problem is that the number of columns in my matrix (N) is not prespecified. I know in R, to perform a linear regression you have to specify the equation:
fit = lm(Y~X[,1]+X[,2]+X[,3])
But how do I do this if I don't know how many columns are in my X matrix?
Thanks!

Three ways, in increasing level of flexibility.
Method 1
Run your regression using the formula notation:
fit <- lm( Y ~ . , data=dat )
Method 2
Put all your data in one data.frame, not two:
dat <- cbind(data.frame(Y=Y),as.data.frame(X))
Then run your regression using the formula notation:
fit <- lm( Y~. , data=dat )
Method 3
Another way is to build the formula yourself:
model1.form.text <- paste("Y ~",paste(xvars,collapse=" + "),collapse=" ")
model1.form <- as.formula( model1.form.text )
model1 <- lm( model1.form, data=dat )
In this example, xvars is a character vector containing the names of the variables you want to use.

Related

Plotting a Regression Line Without Extracting Each Coefficient Separately [duplicate]

This question already has answers here:
How to plot a comparisson of two fixed categorical values for linear regression of another continuous variable
(3 answers)
Closed 4 years ago.
For my Stats class, we are using R to compute all of our statistics, and we are working with numeric data that also has a categorical factor. The way we currently are plotting fitted lines is with lm() and then looking at the summary to grab the coefficients manually, create a mesh, and then use the lines() function. I am wanting a way to do this easier. I have seen the predict() function, but not how to use this along with categories.
For example, the data set found here has 2 numerical variables, and one categorical. I want to be able plot the line of best fit for men and women in this set without having to extract each coefficient individually, as below in my current code.
bank<-read.table("http://www.uwyo.edu/crawford/datasets/bank.txt",header=TRUE)
fit <-lm(salary~years*gender,data=bank)
summary(fit)
yearhat<-seq(0,max(bank$salary),length=1000)
salaryfemalehat=fit$coefficients[1]+fit$coefficients[2]*yearhat
salarymalehat=(fit$coefficients[1]+fit$coefficients[3])+(fit$coefficients[2]+fit$coefficients[4])*yearhat
Using what you have, you can get the same predicted values with
yearhat<-seq(0,max(bank$salary),length=1000)
salaryfemalehat <- predict(fit, data.frame(years=yearhat, gender="Female"))
salarymalehat <- predict(fit, data.frame(years=yearhat, gender="Male"))
To supplement MrFlick, in case of more levels we can try:
dat <- mtcars
dat$cyl <- as.factor(dat$cyl)
fit <- lm(mpg ~ disp*cyl, data = dat)
plot(dat$disp, dat$mpg)
with(dat,
for(i in levels(cyl)){
lines(disp, predict(fit, newdata = data.frame(disp = disp, cyl = i))
, col = which(levels(cyl) == i))
}
)

How to add all variables its second degree in lm()? [duplicate]

This question already has an answer here:
R:fit dynamic number of explanatory variable into polynomial regression
(1 answer)
Closed 6 years ago.
I have a dataframe with 16 variables. When I do multiple linear regression I do the following:
fit <- lm(y ~ .,data=data)
Now, I know how to add a second degree term of one of the variables:
fit2 <- lm(y ~ poly(x1,2) + .,data=data)
But now I don't want to write this out for all of my 16 variables. How can I do this in an easy way for all my variables?
When assuming the first variable in data is our 'y', we get this:
as.formula(
paste('y ~',paste('poly(',colnames(data[-1]),',2)',collapse = ' + ')
)

R:fit dynamic number of explanatory variable into polynomial regression

Suppose I was given an data frame df on runtime, how do I fit a polynomial model using polynomial regression, with each predictor is a column from df and has a degree of a constant k >= 2
The difficulty is, 'df' is read during runtime so the number and names of its columns are unknown when the script is written.(but I do know the response variable is the 1st column) So when I call lm I do not know how to write the formula.
In case of k = 1, then I can simply write a generic linear formula
names(df)[1] <- "y"
lm(y ~ ., data = df)
is there something similar I can do for polynomial formula?
One rather convoluted way is to create a formula for the lm regression call by pasting the terms together.
# some data
dat <- data.frame(replicate(10, rnorm(20)))
# Create formula - apply f function to all columns names excluding the first
form <- formula(paste(names(dat)[1], " ~ ",
paste0("poly(", names(dat)[-1], ", 2)", collapse="+")))
# run regression
lm(form , data=dat)

Predict new data using new x values and polynomial regression in R [duplicate]

This question already has an answer here:
Using predict to find values of non-linear model
(1 answer)
Closed 6 years ago.
I need to find a high degree polynomial fit to a set of data, then use that relationship to predict y values given x values. Here is a simplified example of the premise of my problem. I must create a regression (we can just do 2nd degree here, but I need a technique that can handle polynomials of any degree), then predict new y values given new x values.
dfram <- data.frame('x'=c(1,2,3,4))
dfram$y <- c(1,4,9,16)
pred <- data.frame('x'=c(5,6))
# predict pred$y using n degree trend in dfram
Here is the skeleton:
dfram <- data.frame('x'=c(1,2,3,4))
dfram$y <- c(1,4,9,16)
pred <- data.frame('x'=c(5,6))
myFit <- lm(y ~ poly(x,2), data=dfram)
predict(myFit, pred)
1 2
25 36
You can change the degree of polynomial with poly() arguments.

using lm() in R for a series of independent fits

I want to use lm() in R to fit a series (actually 93) separate linear regressions. According to the R lm() help manual:
"If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix."
This works fine as long as there are no missing data points in the Y response matrix. When there are missing points, instead of fitting each regression with the available data, every row that has a missing data point in any column is discarded. Is there any way to specify that lm() should fit all of the columns in Y independently and not discard rows where an individual column has a missing data point?
If you are looking to do n regressions between Y1, Y2, ..., Yn and X, you don't specify that with lm() rather you should use R's apply functions:
# create the response matrix and set some random values to NA
values <- runif(50)
values[sample(1:length(values), 10)] <- NA
Y <- data.frame(matrix(values, ncol=5))
colnames(Y) <- paste0("Y", 1:5)
# single regression term
X <- runif(10)
# create regression between each column in Y and X
lms <- lapply(colnames(Y), function(y) {
form <- paste0(y, " ~ X")
lm(form, data=Y)
})
# lms is a list of lm objects, can access them via [[]] operator
# or work with it using apply functions once again
sapply(lms, function(x) {
summary(x)$adj.r.squared
})
#[1] -0.06350560 -0.14319796 0.36319518 -0.16393125 0.04843368

Resources