I am using summary() to create a, yes, summary from my regression. What now is printed is my variable names, including underscore.
Is there any way to change the printed variable names so that I can see e.g. "Age of dog" instead of dog_age.
I can not change the variable names since they can not contain spaces.
Something like this?
> x <- summary(lm(mpg ~ cyl+wt, mtcars))
> rownames(x$coef) <- c("YOUR", "NAMES", "HERE")
> x$coef
# Estimate Std. Error t value Pr(>|t|)
# YOUR 39.6863 1.7150 23.141 < 2e-16
# NAMES -1.5078 0.4147 -3.636 0.001064
# HERE -3.1910 0.7569 -4.216 0.000222
Or you could just change the names in the data before running regression
> names(mtcars)[1:3] <- rownames(x$coef)
> lm(YOUR ~ NAMES+HERE, mtcars)
# Call:
# lm(formula = YOUR ~ NAMES + HERE, data = mtcars)
# Coefficients:
# (Intercept) NAMES HERE
# 34.66099 -1.58728 -0.02058
You can use backtick ` to introduce spaces in variables:
dat = data.frame(`Age of dog`=1:10,`T`=1:10,check.names=FALSE)
summary(lm(T~`Age of dog`,data=dat))
Related
I would like to analyse many x variables (400 variables) against one y variable (1 variable). However I do not want to write for each and every x variable a new model. Is it possible to write one model which than checks all x variables with y in R-Studio?
Here is an approach where we use a function that regresses all variables in a data frame on a dependent variable from the same data frame that is passed as an argument to the function.
We use lapply() to drive lm() because it will return the resulting model objects as a list, and we are able to easily name the resulting list so we can extract models by independent variable name.
regList <- function(dataframe,depVar) {
indepVars <- names(dataframe)[!(names(dataframe) %in% depVar)]
modelList <- lapply(indepVars,function(x){
lm(dataframe[[depVar]] ~ dataframe[[x]],data=dataframe)
})
# name list elements based on independent variable names
names(modelList) <- indepVars
modelList
}
We demonstrate the function with the mtcars data frame, assigning the mpg column as the dependent variable.
modelList <- regList(mtcars,"mpg")
At this point the modelList object contains 10 models, one for each variable in the mtcars data frame other than mpg. We can access the individual models by independent variable name, or by index.
# print the model where cyl is independent variable
summary(modelList[["cyl"]])
...and the output:
> summary(modelList[["cyl"]])
Call:
lm(formula = dataframe[[depVar]] ~ dataframe[[x]], data = dataframe)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
dataframe[[x]] -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Extracting the content
Saving the output in a list() enables us to do things like find the model with the highest R^2 without having to use vgrep.
First, we extract the r.squared value from each model summary and save the results to a vector.
r.squareds <- unlist(lapply(modelList,function(x) summary(x)$r.squared))
Because we used names() to name elements in the original list, R automatically saves the variable names to the element names of the vector. This comes in handy when we sort the vector by descending order of R^2 and print the first element of the resulting vector.
r.squareds[order(r.squareds,decreasing=TRUE)][1]
...and the winner (not surprisingly) is wt.
> r.squareds[order(r.squareds,decreasing=TRUE)][1]
wt
0.7528328
If your data frame is DF,
regs <- list()
for (v in setdiff(names(DF), "y")) {
fm <- eval(parse(text = sprintf("y ~ %s", v)))
regs[[v]] <- lm(fm, data=DF)
}
Now you have all simple regression results in the regs list.
Example:
## Generate data
n <- 1000
set.seed(1)
DF <- data.frame(y = rnorm(n))
for (j in seq(400)) DF[[paste0('x',j)]] <- rnorm(n)
## Now data ready
dim(DF)
# [1] 1000 401
head(names(DF))
# [1] "y" "x1" "x2" "x3" "x4" "x5"
tail(names(DF))
# [1] "x395" "x396" "x397" "x398" "x399" "x400"
regs <- list()
for (v in setdiff(names(DF), "y")) {
fm <- eval(parse(text = sprintf("y ~ %s", v)))
regs[[v]] <- lm(fm, data=DF)
}
head(names(regs))
# [1] "x1" "x2" "x3" "x4" "x5" "x6"
r2s <- sapply(regs, function(x) summary(x)$r.squared)
head(r2s, 3)
# x1 x2 x3
# 0.0000409755 0.0024376111 0.0005509134
If you want to include them in the models separately, you can just loop over the x variables and add them into the model on each iteration. For example:
x_variables = list("x_var1", "x_var2", "x_var3", "x_var4", ...)
for(x in x_variables){
model <- lm(y_variable ~ x, data = df)
summary(model)
}
You can fill in the elipses in the code above with all your other x variables. I hope for your sake that there is some kind of naming convention you can exploit to select the variables using a dplyr verb like starts_with or contains!
If you hope to include all the x variables in the same model, you just add them in as you normally would. For example (assuming you want to use an OLS, but the same premise would work for other types):
model <- lm(y_variable ~
x_var1, x_var2, x_var3, x_var4, ..., data = df)
summary(model)
The following code
df <- data.frame(place = c("South","South","North","East"),
temperature = c(30,30,20,12),
outlookfine=c(TRUE,TRUE,FALSE,FALSE)
)
glm.fit <- glm(outlookfine ~ .,df , family= binomial)
coef.glm <-coef(summary(glm.fit))
coef.glm
outputs
Estimate Std. Error z value Pr(>|z|)
(Intercept) -23.56607 79462.00 -0.0002965703 0.9997634
placeNorth 0.00000 112376.25 0.0000000000 1.0000000
placeSouth 47.13214 97320.68 0.0004842972 0.9996136
I want to re-display the list without the intercept and without places containing the phrase "South"
I thought of trying to name the index column and then subset on it but have had no success.
[Update]
I added more data to understand why George Sava's answer also stripped out "North"
df <- data.frame(place = c("South","South","North","East","West"),
temperature = c(30,30,20,12,15),
outlookfine=c(TRUE,TRUE,FALSE,FALSE,TRUE)
)
glm.fit <- glm(outlookfine ~ .,df, family= binomial )
coef.glm <-coef(summary(glm.fit))
coef.glm[!grepl(pattern = ("South|Intercept"), rownames(coef.glm)),]
outputs
Estimate Std. Error z value Pr(>|z|)
placeNorth 3.970197e-14 185277.1 2.142843e-19 1.0000000
placeWest 4.913214e+01 185277.2 2.651818e-04 0.9997884
To keep only rows that match (or do not match) a certain pattern, you can use:
coef.glm[!grepl("South|Intercept", rownames(coef.glm)),]
Note when there's only one row selected this becomes a vector.
If you want to retain row names as a column then you could do something like:
library(tibble)
library(dplyr)
as.data.frame(coef.glm) %>%
rownames_to_column("x") %>%
filter(!grepl("Intercept|South", x))
Output
x Estimate Std. Error t value Pr(>|t|)
1 placeNorth -1.281975e-16 3.140185e-16 -0.4082483 0.7532483
In order to use the for loop, I'm trying to replace the arguments in this function by variables:
lm(mpg~cylinders, data=Auto)
So I did this:
var1='cylinders'
lm((paste('mpg ~',var1)), data = Auto)
It worked fine.
Now, I wonder how we can replace the arguments cylinders+acceleration by var1 and var2.
So tried the same method. I tried to replace this:
lm(mpg~cylinders+acceleration, data=Auto)
by
var1='cylinders'
var2 = 'acceleration'
lm((paste('mpg ~',var1+var2)), data = Auto)
But I got a message error:
Error in var1 + var2 : non-numeric argument to binary operator
So I want to learn how I can work with var1 and var2 in order to use for loop afterwards.
Use reformulate to generate the formula.
var1 <- 'cyl'
var2 <- 'disp'
fo <- reformulate(c(var1, var2), "mpg")
lm(fo, mtcars)
or you could write it like this which gives the same answer except the above shows literally fo in the Call: line in the output whereas the code below expands fo in the Call: line in the output.
do.call("lm", list(fo, quote(mtcars)))
giving:
Call:
lm(formula = mpg ~ cyl + disp, data = mtcars)
Coefficients:
(Intercept) cyl disp
34.66099 -1.58728 -0.02058
Background: I have the following data that I run a glm function on:
location = c("DH", "Bos", "Beth")
count = c(166, 57, 38)
#make into df
df = data.frame(location, count)
#poisson
summary(glm(count ~ location, family=poisson))
Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.6376 0.1622 22.424 < 2e-16 ***
locationBos 0.4055 0.2094 1.936 0.0529 .
locationDH 1.4744 0.1798 8.199 2.43e-16 ***
Problem: I would like to change the (Intercept) so I can get all my values relative to Bos
I looked Change reference group using glm with binomial family and How to force R to use a specified factor level as reference in a regression?. I tried there method and it did not work, and I am not sure why.
Tried:
df1 <- within(df, location <- relevel(location, ref = 1))
#poisson
summary(glm(count ~ location, family=poisson, data = df1))
Desired Output:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) ...
locationBeth ...
locationDH ...
Question: How do I solve this problem?
I think your problem is that you are modifying the data frame, but in your model you are not using the data frame. Use the data argument in the model to use the data in the data frame.
location = c("DH", "Bos", "Beth")
count = c(166, 57, 38)
# make into df
df = data.frame(location, count)
Note that location by itself is a character vector. data.frame() coerces it to a factor by default in the data frame. After this conversion, we can use relevel to specify the reference level.
df$location = relevel(df$location, ref = "Bos") # set Bos as reference
summary(glm(count ~ location, family=poisson, data = df))
# Call:
# glm(formula = count ~ location, family = poisson, data = df)
# ...
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 4.0431 0.1325 30.524 < 2e-16 ***
# locationBeth -0.4055 0.2094 -1.936 0.0529 .
# locationDH 1.0689 0.1535 6.963 3.33e-12 ***
# ...
Is there a function that can extract two or more columns from a coeftest object? This is easy one coeftest object at a time, but can I do the same to a list (other than a for() loop)?
> # meaningless data
> temp <- data.frame(a = rnorm(100, mean = 5), b = rnorm(100, mean = 1),
+ c = 1:100)
> formulas <- list(a ~ b, a ~ c)
> models <- lapply(formulas, lm, data = temp)
> library(lmtest)
> cts <- lapply(models, coeftest)
> # easy to extract columns one object at a time
> cts[[1]][, 1:2]
Estimate Std. Error
(Intercept) 5.0314196 0.1333705
b -0.1039264 0.0987044
> # but more difficult algorithmically
> # either one column
> lapply(cts, "[[", 1)
[[1]]
[1] 5.03142
[[2]]
[1] 5.312007
> # or two
> lapply(cts, "[[", 1:2)
Error in FUN(X[[1L]], ...) : attempt to select more than one element
Maybe the more fundamental question is if there is a way to turn the meat of the coeftest object into a data frame, which would allow me to extract columns singly, then use mapply(). Thanks!
Edit: I would like to end up with a matrices (or data frames) with the first and second columns.
[[1]]
Estimate Std. Error
(Intercept) 5.0314196 0.1333705
b -0.1039264 0.0987044
[[2]]
Estimate Std. Error
(Intercept) 5.312007153 0.199485363
c -0.007378529 0.003429477
[[ is the wrong subset function in this case. Note that when you lapply() over a list, what you are operating on are the components of the list, the bits you would get with list[[i]] where i is the ith component.
As such, you only need the [, 1:2] bit of cts[[1]][, 1:2] in the lapply() call. It is a little bit trickier because of the arguments for [, but easily doable with lapply():
> lapply(cts, `[`, , 1:2)
[[1]]
Estimate Std. Error
(Intercept) 4.926679544 0.1549482
b -0.001967657 0.1062437
[[2]]
Estimate Std. Error
(Intercept) 4.849041327 0.204342067
c 0.001494454 0.003512972
Note the <space>, before 1:2; this is the equivalent of [ , 1:2].
I'm not sure if this is what you want, but how about:
> do.call("rbind", cts)[, 1:2]
Estimate Std. Error
(Intercept) 4.8200993881 0.142381642
b -0.0421189130 0.092620363
(Intercept) 4.7459340076 0.206372906
c 0.0005770324 0.003547885