Regression Summaries in R - r

I've been using the glm function to do regression analysis, and it's treating me quite well. I'm wondering though, some of the things I want to regress involve a large amount of regression factors. I have two main questions:
Is it possible to give a text vector for the regressors?
Can the p-value portion of summary(glm) be sorted at all? Preferably by the p-values of each regressor.
Ex.
A # sample data frame
names(A)
[1] Dog Cat Human Limbs Tail Height Weight Teeth.Count
a = names(A)[4:7]
glm( Dog ~ a, data = A, family = "binomial")

For your first question, see as.formula. Basically you want to do the following:
x <- names(A)[4:7]
regressors <- paste(x,collapse=" + ")
form <- as.formula(c("Dog ~ ",regressors))
glm(form, data = A, family = "binomial")
If you want interaction terms in your model, you need to make the structure somewhat more complex by using different collapse= arguments. That argument specifies which symbols are placed between the elements of your vector. For instance, if you specify "*" in the code above, you will have a saturated model with all possible interactions. If you just need some interactions, but not all, you will want to create the part of the formula containing all interactions first (using "*" as collapse argument), and then add the remaining terms in the separate paste function (using "+" as collapse argument). All in all, you want to create a character string that is identical to your formula, and then convert it to the formula class.
For your second question, you need to convert the output of summary to a data structure that can be sorted. For instance, a data frame. Let's say that the name of your glm model is model:
library(plyr)
coef <- summary(model)[12]
coef.sort <- as.data.frame(coef)
names(coef.sort) <- c("Estimate","SE","Tval","Pval")
arrange(coef.sort,Pval)
Assign the result of arrange() to a varable, and continue with it as you like.

An example data frame:
set.seed(42)
A <- data.frame(Dog = sample(0:1, 100, TRUE), b = rnorm(100), c = rnorm(100))
a <- names(A)[2:3]
Firstly, you can use the character vector a to create a model formula with reformulate:
glm(Dog ~ a, data = A, family = "binomial")
form <- reformulate(a, "Dog")
# Dog ~ b + c
model <- glm(form, data = A, family = "binomial")
Secondly, this is a way to sort the model summary by the p-values:
modcoef <- summary(model)[["coefficients"]]
modcoef[order(modcoef[ , 4]), ]
# Estimate Std. Error z value Pr(>|z|)
# b 0.23902684 0.2212345 1.0804232 0.2799538
# (Intercept) 0.20855908 0.2025642 1.0295951 0.3032001
# c -0.09287769 0.2191231 -0.4238608 0.6716673

Related

How to repeat univariate regression and extract P values?

I am using lapply to perform several glm regressions on one dependent variable by one independent variable at a time. but I'm not sure how to extract the P values at a time.
There are 200 features in my dataset, but the code below only gave me the P value of feature#1. How can I get a matrix of all P values of the 200 features?
valName<- as.data.frame(colnames(repeatData))
featureName<-valName[3,]
lapply(featureName,
function(var) {
formula <- as.formula(paste("outcome ~", var))
fit.logist <- glm(formula, data = repeatData, family = binomial)
summary(fit.logist)
Pvalue<-coef(summary(fit.logist))[,'Pr(>|z|)']
})
I
I simplified your code a little bit; (1) used reformulate() (not really different, just prettier) (2) returned only the p-value for the focal variable (not the intercept p-value). (If you leave out the 2, you'll get a 2-row matrix with intercept and focal-variable p-values.)
My example uses the built-in mtcars data set, with an added (fake) binomial response.
repeatData <- data.frame(outcome=rbinom(nrow(mtcars), size=1, prob=0.5), mtcars)
ff <- function(var) {
formula <- reformulate(var, response="outcome")
fit.logist <- glm(formula, data = repeatData, family = binomial)
coef(summary(fit.logist))[2, 'Pr(>|z|)']
}
## skip first column (response variable).
sapply(names(repeatData)[-1], ff)

How to use data frames to conduct ANOVAs in R

I am currently learning R and am playing around with a dataset that has four nominal variables (Hour.Of.Arrival, Mode, Unit, Weekday), and a continuous dependent variable (Overall). This is all imported from a .csv in a data frame named basic. What I am trying to do is run an ANOVA just using this data frame, without creating separate vectors (e.g. Mode<-basic$Mode). "Fit" holds the results of the ANOVA. Here is the code that I wrote:
Fit<-aov(basic["Overall"],basic["Unit"],data=basic)
However, I keep getting the error
"Error in terms.default(formula, "Error", data = data) : no terms
component nor attribute
I hope this question isn't too basic!!
Thanks :)
I think you want something more like Fit<-aov(Overall ~ Unit,data=basic). The Overall ~ Unit tells R to treat Overall as an outcome being predicted by Unit; you already specify that the dataframe to find these variables is basic.
Here's an example to show you how it works:
> y <- rnorm(100)
> x <- factor(rep(c('A', 'B', 'C', 'D'), each = 25))
> dat <- data.frame(x, y)
> aov(y ~ x, data = dat)
Call:
aov(formula = y ~ x, data = dat)
Terms:
x Residuals
Sum of Squares 2.72218 114.54631
Deg. of Freedom 3 96
Residual standard error: 1.092333
Estimated effects may be unbalanced
Note, you don't need to use the data argument, you could also use aov(dat$y ~ dat$x), but the first argument to the function should be a formula.

how to use loop to do linear regression in R

I wonder if I can use such as for loop or apply function to do the linear regression in R. I have a data frame containing variables such as crim, rm, ad, wd. I want to do simple linear regression of crim on each of other variable.
Thank you!
If you really want to do this, it's pretty trivial with lapply(), where we use it to "loop" over the other columns of df. A custom function takes each variable in turn as x and fits a model for that covariate.
df <- data.frame(crim = rnorm(20), rm = rnorm(20), ad = rnorm(20), wd = rnorm(20))
mods <- lapply(df[, -1], function(x, dat) lm(crim ~ x, data = dat))
mods is now a list of lm objects. The names of mods contains the names of the covariate used to fit the model. The main negative of this is that all the models are fitted using a variable x. More effort could probably solve this, but I doubt that effort is worth the time.
If you are just selecting models, which may be dubious, there are other ways to achieve this. For example via the leaps package and its regsubsets function:
library("leapls")
a <- regsubsets(crim ~ ., data = df, nvmax = 1, nbest = ncol(df) - 1)
summa <- summary(a)
Then plot(a) will show which of the models is "best", for example.
Original
If I understand what you want (crim is a covariate and the other variables are the responses you want to predict/model using crim), then you don't need a loop. You can do this using a matrix response in a standard lm().
Using some dummy data:
df <- data.frame(crim = rnorm(20), rm = rnorm(20), ad = rnorm(20), wd = rnorm(20))
we create a matrix or multivariate response via cbind(), passing it the three response variables we're interested in. The remaining parts of the call to lm are entirely the same as for a univariate response:
mods <- lm(cbind(rm, ad, wd) ~ crim, data = df)
mods
> mods
Call:
lm(formula = cbind(rm, ad, wd) ~ crim, data = df)
Coefficients:
rm ad wd
(Intercept) -0.12026 -0.47653 -0.26419
crim -0.26548 0.07145 0.68426
The summary() method produces a standard summary.lm output for each of the responses.
Suppose you want to have response variable fix as first column of your data frame and you want to run simple linear regression multiple times individually with other variable keeping first variable fix as response variable.
h=iris[,-5]
for (j in 2:ncol(h)){
assign(paste("a", j, sep = ""),lm(h[,1]~h[,j]))
}
Above is the code which will create multiple list of regression output and store it in a2,a3,....

Predict() + assign() loop across variables

I have estimated several models (a, b) and I want to calculate predicted probabilities for each model using a single data frame (df) and store the predicted probabilities of each model as new variables in that data frame. For example:
a <- lm(y ~ z, df) # estimate model a
b <- glm(w ~ x, df) # estimate model b
models <- c("a","b") # create vector of model objects
for (i in models) {
assign(
paste("df$", i, sep = ""),
predict(i, df)
)}
I have tried the above but receive the error "no applicable method for 'predict' applied to an object of class "character"" with the last word changing as I change class of the predicted object, e.g. predict(as.numeric(i),df).
Any ideas? Ideally I could vectorize this as well.
You should rarely have to use assign() and $ should not be used with variable names. The [[]] operator is better for dynamic subsetting than $. And it would be easier if you just made a list if the models rather than just their names. Here's an example
df<-data.frame(x=runif(30), y=runif(30), w=runif(30), z=runif(30))
a <- lm(y ~ z, df) # estimate model a
b <- lm(w ~ x, df) # estimate model b
models <- list(a=a,b=b) # create vector of model objects
# 1) for loop
for (m in names(models)) {
df[[m]] <- predict(models[[m]], df)
}
Or rather than a for loop, you could generate all the values with Map and then append with cdbind afterward
# 2) Map/cbind
df <- cbind(df, Map(function(m) predict(m,df), models))

R:fit dynamic number of explanatory variable into polynomial regression

Suppose I was given an data frame df on runtime, how do I fit a polynomial model using polynomial regression, with each predictor is a column from df and has a degree of a constant k >= 2
The difficulty is, 'df' is read during runtime so the number and names of its columns are unknown when the script is written.(but I do know the response variable is the 1st column) So when I call lm I do not know how to write the formula.
In case of k = 1, then I can simply write a generic linear formula
names(df)[1] <- "y"
lm(y ~ ., data = df)
is there something similar I can do for polynomial formula?
One rather convoluted way is to create a formula for the lm regression call by pasting the terms together.
# some data
dat <- data.frame(replicate(10, rnorm(20)))
# Create formula - apply f function to all columns names excluding the first
form <- formula(paste(names(dat)[1], " ~ ",
paste0("poly(", names(dat)[-1], ", 2)", collapse="+")))
# run regression
lm(form , data=dat)

Resources