R:fit dynamic number of explanatory variable into polynomial regression - r

Suppose I was given an data frame df on runtime, how do I fit a polynomial model using polynomial regression, with each predictor is a column from df and has a degree of a constant k >= 2
The difficulty is, 'df' is read during runtime so the number and names of its columns are unknown when the script is written.(but I do know the response variable is the 1st column) So when I call lm I do not know how to write the formula.
In case of k = 1, then I can simply write a generic linear formula
names(df)[1] <- "y"
lm(y ~ ., data = df)
is there something similar I can do for polynomial formula?

One rather convoluted way is to create a formula for the lm regression call by pasting the terms together.
# some data
dat <- data.frame(replicate(10, rnorm(20)))
# Create formula - apply f function to all columns names excluding the first
form <- formula(paste(names(dat)[1], " ~ ",
paste0("poly(", names(dat)[-1], ", 2)", collapse="+")))
# run regression
lm(form , data=dat)

Related

Error: Variable length differs in lm regression using paste function

I have generated randomly a dataset that has been split in two (L and I).
First I run the regression on L using all the covariates.
After defining the set of variables that are significantly different form zero I want to run the regression on I using this set of variables.
reg_L = lm(y ~ ., data = data)
S_hat = as.data.frame(round(summary(reg_L)$coefficients[,"Pr(>|t|)"], 3)<0.05)
S_hat_L = rownames(which(S_hat==TRUE, arr.ind = TRUE))
Therefore here I want to run the new model that doesn't work only due to a problem in the specification of the variable x.
What am I doing wrong?
# Using the I proportion to construct the p-values
x = noquote(paste(S_hat_L, collapse = " + "))
reg_I = lm(y ~ x, data = data)
summary(reg_I)
A simpler way than trying to manipulate a formula programmatically would be to remove the unwanted predictors from the data:
wanted <- summary(fit)$coefficients[,"Pr(>|t|)"] < 0.05
reduced.data <- data[, wanted]
reg_S <- lm(y ~ ., data=reduced.data)
Note however, that it is more robust with respect to out-of-sample performance to reduce variables with the LASSO. This will yield a model that has some coefficients set to zero, but the other coefficients are adjusted in such a way that the uot-of-sample performance will be better.

fitting linear regression models with different predictors using loops

I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)

Regression Summaries in R

I've been using the glm function to do regression analysis, and it's treating me quite well. I'm wondering though, some of the things I want to regress involve a large amount of regression factors. I have two main questions:
Is it possible to give a text vector for the regressors?
Can the p-value portion of summary(glm) be sorted at all? Preferably by the p-values of each regressor.
Ex.
A # sample data frame
names(A)
[1] Dog Cat Human Limbs Tail Height Weight Teeth.Count
a = names(A)[4:7]
glm( Dog ~ a, data = A, family = "binomial")
For your first question, see as.formula. Basically you want to do the following:
x <- names(A)[4:7]
regressors <- paste(x,collapse=" + ")
form <- as.formula(c("Dog ~ ",regressors))
glm(form, data = A, family = "binomial")
If you want interaction terms in your model, you need to make the structure somewhat more complex by using different collapse= arguments. That argument specifies which symbols are placed between the elements of your vector. For instance, if you specify "*" in the code above, you will have a saturated model with all possible interactions. If you just need some interactions, but not all, you will want to create the part of the formula containing all interactions first (using "*" as collapse argument), and then add the remaining terms in the separate paste function (using "+" as collapse argument). All in all, you want to create a character string that is identical to your formula, and then convert it to the formula class.
For your second question, you need to convert the output of summary to a data structure that can be sorted. For instance, a data frame. Let's say that the name of your glm model is model:
library(plyr)
coef <- summary(model)[12]
coef.sort <- as.data.frame(coef)
names(coef.sort) <- c("Estimate","SE","Tval","Pval")
arrange(coef.sort,Pval)
Assign the result of arrange() to a varable, and continue with it as you like.
An example data frame:
set.seed(42)
A <- data.frame(Dog = sample(0:1, 100, TRUE), b = rnorm(100), c = rnorm(100))
a <- names(A)[2:3]
Firstly, you can use the character vector a to create a model formula with reformulate:
glm(Dog ~ a, data = A, family = "binomial")
form <- reformulate(a, "Dog")
# Dog ~ b + c
model <- glm(form, data = A, family = "binomial")
Secondly, this is a way to sort the model summary by the p-values:
modcoef <- summary(model)[["coefficients"]]
modcoef[order(modcoef[ , 4]), ]
# Estimate Std. Error z value Pr(>|z|)
# b 0.23902684 0.2212345 1.0804232 0.2799538
# (Intercept) 0.20855908 0.2025642 1.0295951 0.3032001
# c -0.09287769 0.2191231 -0.4238608 0.6716673

using lm() in R for a series of independent fits

I want to use lm() in R to fit a series (actually 93) separate linear regressions. According to the R lm() help manual:
"If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix."
This works fine as long as there are no missing data points in the Y response matrix. When there are missing points, instead of fitting each regression with the available data, every row that has a missing data point in any column is discarded. Is there any way to specify that lm() should fit all of the columns in Y independently and not discard rows where an individual column has a missing data point?
If you are looking to do n regressions between Y1, Y2, ..., Yn and X, you don't specify that with lm() rather you should use R's apply functions:
# create the response matrix and set some random values to NA
values <- runif(50)
values[sample(1:length(values), 10)] <- NA
Y <- data.frame(matrix(values, ncol=5))
colnames(Y) <- paste0("Y", 1:5)
# single regression term
X <- runif(10)
# create regression between each column in Y and X
lms <- lapply(colnames(Y), function(y) {
form <- paste0(y, " ~ X")
lm(form, data=Y)
})
# lms is a list of lm objects, can access them via [[]] operator
# or work with it using apply functions once again
sapply(lms, function(x) {
summary(x)$adj.r.squared
})
#[1] -0.06350560 -0.14319796 0.36319518 -0.16393125 0.04843368

Linear Regression in R with variable number of explanatory variables [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Specifying formula in R with glm without explicit declaration of each covariate
how to succinctly write a formula with many variables from a data frame?
I have a vector of Y values and a matrix of X values that I want to perform a multiple regression on (i.e. Y = X[column 1] + X[column 2] + ... X[column N])
The problem is that the number of columns in my matrix (N) is not prespecified. I know in R, to perform a linear regression you have to specify the equation:
fit = lm(Y~X[,1]+X[,2]+X[,3])
But how do I do this if I don't know how many columns are in my X matrix?
Thanks!
Three ways, in increasing level of flexibility.
Method 1
Run your regression using the formula notation:
fit <- lm( Y ~ . , data=dat )
Method 2
Put all your data in one data.frame, not two:
dat <- cbind(data.frame(Y=Y),as.data.frame(X))
Then run your regression using the formula notation:
fit <- lm( Y~. , data=dat )
Method 3
Another way is to build the formula yourself:
model1.form.text <- paste("Y ~",paste(xvars,collapse=" + "),collapse=" ")
model1.form <- as.formula( model1.form.text )
model1 <- lm( model1.form, data=dat )
In this example, xvars is a character vector containing the names of the variables you want to use.

Resources