How do I extract the coefficient names from the lm coefficients? - r

I have the following code which displays some coefficients from lm
fit <-lm(Petal.Width ~ Petal.Length, data=iris)
cf <-coef(summary(fit,complete = TRUE))
colnames(cf)[4] <- "pval"
cf<- data.frame(cf)
cf <-cf[cf$pval < 0.05,]
cf <-cf[order(-cf$pval), ]
head(cf)
cf[1,1]
I want to extract the names in the left column ie (intercept) and petal length.
I thought I could use cf[1,1] but it shows the estimate

Those are extracted using rownames :
fit <-lm(Petal.Width ~ Petal.Length, data=iris)
cf <-coef(summary(fit,complete = TRUE))
rownames(cf)
#[1] "(Intercept)" "Petal.Length"

The tidyverse solution would be to use broom:
library(broom)
tidy_fit <- tidy(fit)
Results:
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.363 0.0398 -9.13 4.70e-16
2 Petal.Length 0.416 0.00958 43.4 4.68e-86
Then it's easy to extract the components that you want and the resulting code is more readable, e.g. tidy_fit$term to get the list of variables ((Intercept) and Petal.Length).

Related

How to use results from different regression models in a scatterplot built using group_by in R?

I would like to add 2 different regression curves, coming from different models, in a scatter plot.
Let's use the example below:
Weight=c(12.6,12.6,16.01,17.3,17.7,10.7,17,10.9,15,14,13.8,14.5,17.3,10.3,12.8,14.5,13.5,14.5,17,14.3,14.8,17.5,2.9,21.4,15.8,40.2,27.3,18.3,10.7,0.7,42.5,1.55,46.7,45.3,15.4,25.6,18.6,11.7,28,35,17,21,41,42,18,33,35,19,30,42,23,44,22)
Increment=c(0.55,0.53,16.53,55.47,80,0.08,41,0.1,6.7,2.2,1.73,3.53,64,0.05,0.71,3.88,1.37,3.8,40,3,26.3,29.7,10.7,35,27.5,60,43,31,21,7.85,63,9.01,67.8,65.8,27,40.1,31.2,22.3,35,21,74,75,12,19,4,20,65,46,9,68,74,57,57)
Id=c(rep("Aa",20),rep("Ga",18),rep("Za",15))
df=data.frame(Id,Weight,Increment)
The scatter plot looks like this:
plot_df <- ggplot(df, aes(x = Weight, y = Increment, color=Id)) + geom_point()
I tested a linear and an exponential regression model and could extract the results following loki's answer there:
linear_df <- df %>% group_by(Id) %>% do(model = glance(lm(Increment ~ Weight,data = .))) %>% unnest(model)
exp_df <- df %>% group_by(Id) %>% do(model = glance(lm(log(Increment) ~ Weight,data = .))) %>% unnest(model)
The linear model fits better for the Ga group, the exponential one for the Aa group, and nothing for the Za one:
> linear_df
# A tibble: 3 x 13
Id r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Aa 0.656 0.637 15.1 34.4 1.50e- 5 1 -81.6 169. 172. 4106. 18 20
2 Ga 1.00 1.00 0.243 104113. 6.10e-32 1 1.01 3.98 6.65 0.942 16 18
3 Za 0.0471 -0.0262 26.7 0.642 4.37e- 1 1 -69.5 145. 147. 9283. 13 15
> exp_df
# A tibble: 3 x 13
Id r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Aa 0.999 0.999 0.0624 24757. 1.05e-29 1 28.2 -50.3 -47.4 0.0700 18 20
2 Ga 0.892 0.885 0.219 132. 3.86e- 9 1 2.87 0.264 2.94 0.766 16 18
3 Za 0.00444 -0.0721 0.941 0.0580 8.14e- 1 1 -19.3 44.6 46.7 11.5 13 15
Now, how can I draw the linear regression line for the Aa group, the exponential regression curve for the Ga group, and no curve for the Za group? There is this, but it applies for different regressions built inside the same model type. How can I combine my different objects?
The formula shown below gives the same fitted values as does 3 separate fits for each Id so create the lm objects for each of the two models and then plot the points and the lines for each. The straight solid lines are the linear model and the curved dashed lines are the exponential model.
library(ggplot2)
fm.lin <- lm(Increment ~ Id/Weight + 0, df)
fm.exp <- lm(log(Increment) ~ Id/Weight + 0, df)
df %>%
ggplot(aes(Weight, Increment, color=Id)) +
geom_point() +
geom_line(aes(y = fitted(fm.lin))) +
geom_line(aes(y = exp(fitted(fm.exp))), lty = 2, lwd = 1)
To only show the Aa fitted lines for the linear model and Ga fitted lines for the exponential model NA out the portions not wanted. In this case we used solid lines for the fitted models.
df %>%
ggplot(aes(Weight, Increment, color=Id)) +
geom_point() +
geom_line(aes(y = ifelse(Id == "Aa", fitted(fm.lin), NA))) +
geom_line(aes(y = ifelse(Id == "Ga", exp(fitted(fm.exp)), NA)))
Added
Regarding the questions in the comments, the formula used above nests Weight within Id and effectively uses a model matrix which, modulo column order, is a block diagonal matrix whose blocks are the model matrices of the 3 individual models. Look at this to understand it.
model.matrix(fm.lin)
Since this is a single model rather than three models the summary statistics will be pooled. To get separate summary statistics use lmList from the nlme package (which comes with R so it does not have to be installed -- just issue a library statement). The statements below will give objects of class lmList that can be used in place of the ones above as they have a fitted method that will return the same fitted values.
library(nlme)
fm.lin2 <- lmList(Increment ~ Weight | Id, df, pool = FALSE)
fm.exp2 <- lmList(log(Increment) ~ Weight | Id, df, pool = FALSE)
In addition, they can be used to get individual summary statistics. Internally the lmList objects consist of a list of 3 lm objects with attributes in this case so we can extract the summary statistics by extracting the summary statistics from each component.
library(broom)
sapply(fm.lin2, glance)
sapply(fm.exp2, glance)
One caveat is that common statistical tests between models using different dependent variables, Increment vs. log(Increment), are invalid.
possible solution
Weight=c(12.6,12.6,16.01,17.3,17.7,10.7,17,10.9,15,14,13.8,14.5,17.3,10.3,12.8,14.5,13.5,14.5,17,14.3,14.8,17.5,2.9,21.4,15.8,40.2,27.3,18.3,10.7,0.7,42.5,1.55,46.7,45.3,15.4,25.6,18.6,11.7,28,35,17,21,41,42,18,33,35,19,30,42,23,44,22)
Increment=c(0.55,0.53,16.53,55.47,80,0.08,41,0.1,6.7,2.2,1.73,3.53,64,0.05,0.71,3.88,1.37,3.8,40,3,26.3,29.7,10.7,35,27.5,60,43,31,21,7.85,63,9.01,67.8,65.8,27,40.1,31.2,22.3,35,21,74,75,12,19,4,20,65,46,9,68,74,57,57)
Id=c(rep("Aa",20),rep("Ga",18),rep("Za",15))
df=data.frame(Id,Weight,Increment)
library(tidyverse)
df_model <- df %>%
group_nest(Id) %>%
mutate(
formula = c(
"lm(log(Increment) ~ Weight, data = .x)",
"lm(Increment ~ Weight,data = .x)",
"lm(Increment ~ 0,data = .x)"
),
transform = c("exp(fitted(.x))",
"fitted(.x)",
"fitted(.x)")
) %>%
mutate(model = map2(data, formula, .f = ~ eval(parse(text = .y)))) %>%
mutate(fit = map2(model, transform, ~ eval(parse(text = .y)))) %>%
select(Id, data, fit) %>%
unnest(c(data, fit))
ggplot(df_model) +
geom_point(aes(Weight, Increment, color = Id)) +
geom_line(aes(Weight, fit, color = Id))
Created on 2021-10-06 by the reprex package (v2.0.1)

Export coefficients from a loop for multiple cox regression

I need your help! I have a data set with 100,000 cases and 81 variables and I run a loop for multiple regression for each variable adjusted for age and sex in r:
covariates <- c(var1, var2, ... var81)
purrr:: map(covariates, ~coxph(as.formula(paste("Surv(Time,Event) ~ Age + Sex +", .x)), data=mydata))
The output includes the coefficients for age, sex and for each variable, like that:
coef exp(coef) se(coef) z p
Age 0.0000 0.0000
Sex
Var1
I was wondering if there is a way for me to export in excel only the coefficient of each variable, aka only the third line, and not all the three of them.
Thank you so much for your help in advance!
Using mtcars as an example -
library(dplyr)
library(survival)
covariates <- c('mpg', 'cyl')
purrr:: map_df(covariates, ~{
mod <- coxph(as.formula(paste("Surv(disp,am) ~ hp + ", .x)), data=mtcars)
summary(mod)$coefficients[.x, ]
}) %>%
mutate(corvariate = covariates, .before = 1) -> result
result
# corvariate coef `exp(coef)` `se(coef)` z `Pr(>|z|)`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 mpg 0.614 1.85 0.167 3.68 0.000238
#2 cyl -2.17 0.114 0.704 -3.08 0.00208
Write the output to excel -
writexl::write_xlsx(result, 'data.xlsx')

R loop for linear regression lm(y~x) and save model output as a dataset

I would like to make a regression loop lm(y~x) with a dataset with one y and several x, and run the regression for each x, and then also store the results (estimate, p-values) in a data.frame() so I don't have to copy them manually (especially as my real data set it much bigger).
I think this should not be too difficult, but I struggle a lot to make it work and appreciate your help:
Here is my sample data set:
sample_data <- data.frame(
fit = c(0.8971963, 1.4205607, 1.4953271, 0.8971963, 1.1588785, 0.1869159, 1.1588785, 1.142857143, 0.523809524),
Xbeta = c(2.8907744, -0.7680777, -0.7278847, -0.06293916, -0.04047017, 2.3755812, 1.3043990, -0.5698354, -0.5698354),
Xgamma = c( 0.1180758, -0.6275700, 0.3731964, -0.2353454,-0.5761923, -0.5186803, 0.43041835, 3.9111749, -0.5030638),
Xalpha = c(0.2643091, 1.6663923, 0.4041057, -0.2100472, -0.2100472, 7.4874195, -0.2385278, 0.3183102, -0.2385278),
Xdelta = c(0.1498646, -0.6325119, -0.5947564, -0.2530748, 3.8413339, 0.6839322, 0.7401834, 3.8966404, 1.2028175)
)
#yname <- ("fit")
#xnames <- c("Xbeta ","Xgamma", "Xalpha", "Xdelta")
The simple regression with the first independant variable Xbeta would look like this lm(fit~Xbeta, data= sample_data)and I would like to run the regression for each variable starting with an "X" and then store the result (estimate, p-value).
I have found a code that allows me to select variables that start with "X" and then use it for the model, but the code gives me an error from mutate() onwards (indicated by #).
library(tidyverse)
library(tsibble)
sample_data %>%
gather(stock, return, starts_with("X")) %>%
group_nest(stock)
# %>%
# mutate(model = map(data,
# ~lm(formula = "fit~ return",
# data = .x))
# ),
# resid = map(model, residuals)
# ) %>%
# unnest(c(data,resid)) %>%
# summarise(sd_residual = sd(resid))
For then storing the regression results I have also found the following appraoch using the R package "broom": r for loop for regression lm(y~x)
sample_data%>%
group_by(y,x)%>% # get combinations of y and x to regress
do(tidy(lm(fRS_relative~xvalue, data=.)))
But I always get an error for group_by() and do()
I really appreciate your help!
One option would be to use lapply to perform a regression with each of the independent variables. Use tidy from broom library to store the results into a tidy format.
lapply(1:length(xnames),
function(i) broom::tidy(lm(as.formula(paste0('fit ~ ', xnames[i])), data = sample_data))) -> test
and then combine all the results into a single dataframe:
do.call('rbind', test)
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 1.05 0.133 7.89 0.0000995
# 2 Xbeta -0.156 0.0958 -1.62 0.148
# 3 (Intercept) 0.968 0.147 6.57 0.000313
# 4 Xgamma 0.0712 0.107 0.662 0.529
# 5 (Intercept) 1.09 0.131 8.34 0.0000697
# 6 Xalpha -0.0999 0.0508 -1.96 0.0902
# 7 (Intercept) 0.998 0.175 5.72 0.000723
# 8 Xdelta -0.0114 0.0909 -0.125 0.904
Step one
Your data is messy, let us tidy it up.
sample_data <- data.frame(
fit = c(0.8971963, 1.4205607, 1.4953271, 0.8971963, 1.1588785, 0.1869159, 1.1588785, 1.142857143, 0.523809524),
Xbeta = c(2.8907744, -0.7680777, -0.7278847, -0.06293916, -0.04047017, 2.3755812, 1.3043990, -0.5698354, -0.5698354),
Xgamma = c( 0.1180758, -0.6275700, 0.3731964, -0.2353454,-0.5761923, -0.5186803, 0.43041835, 3.9111749, -0.5030638),
Xalpha = c(0.2643091, 1.6663923, 0.4041057, -0.2100472, -0.2100472, 7.4874195, -0.2385278, 0.3183102, -0.2385278),
Xdelta = c(0.1498646, -0.6325119, -0.5947564, -0.2530748, 3.8413339, 0.6839322, 0.7401834, 3.8966404, 1.2028175)
)
tidyframe = data.frame(fit = sample_data$fit,
X = c(sample_data$Xbeta,sample_data$Xgamma,sample_data$Xalpha,sample_data$Xdelta),
type = c(rep("beta",9),rep("gamma",9),rep("alpha",9),rep("delta",9)))
Created on 2020-07-13 by the reprex package (v0.3.0)
Step two
Iterate over each type, and get the P-value, using this nifty function
# From https://stackoverflow.com/a/5587781/3212698
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
Then do some clever piping
tidyframe %>% group_by(type) %>%
summarise(type = type, p = lmp(lm(formula = fit ~ X))) %>%
unique()
#> `summarise()` regrouping output by 'type' (override with `.groups` argument)
#> # A tibble: 4 x 2
#> # Groups: type [4]
#> type p
#> <fct> <dbl>
#> 1 alpha 0.0902
#> 2 beta 0.148
#> 3 delta 0.904
#> 4 gamma 0.529
Created on 2020-07-13 by the reprex package (v0.3.0)

How to pull the coefficient values from a logistic regression into a dataframe in R? [duplicate]

This question already has answers here:
Extract regression coefficient values
(4 answers)
Closed 4 years ago.
I did run a logistic regression model fit in R for some dataset. I can see the Coefficients per predictor via summary(model_fit), but now I need to store them in a data frame. Below are my values how I see them via summary.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.387e+00 2.734e+00 -1.605 0.1086
GDP_PER_CAP -6.888e-05 3.870e-05 -1.780 0.0751 .
CO2_PER_CAP 1.816e-01 7.255e-02 2.503 0.0123 *
PERC_ACCESS_ELECTRICITY -5.973e-03 1.291e-02 -0.463 0.6437
ATMS_PER_1E5 -5.749e-03 8.181e-03 -0.703 0.4822
PERC_INTERNET_USERS -2.146e-02 2.106e-02 -1.019 0.3083
SCIENTIFIC_ARTICLES_PER_YR 3.319e-05 1.650e-05 2.011 0.0443 *
PERC_FEMALE_SECONDARY_EDU 1.559e-01 6.428e-02 2.426 0.0153 *
PERC_FEMALE_LABOR_FORCE -1.265e-02 1.470e-02 -0.860 0.3896
PERC_FEMALE_PARLIAMENT -4.802e-02 2.087e-02 -2.301 0.0214 *
dataframe <- dataframe0 %>%
mutate(EQUAL_PAY = relevel(factor(EQUAL_PAY), "YES"))
set.seed(1)
trn_index = createDataPartition(y = dataframe$EQUAL_PAY, p = 0.80, list = FALSE)
trn_equalpay = dataframe[trn_index, ]
tst_equalpay = dataframe[-trn_index, ]
equalpay_lgr = train(EQUAL_PAY ~ .-EQUAL_WORK -COUNTRY, method = "glm",
family = binomial(link = "logit"), data = trn_equalpay,
trControl = trainControl(method = 'cv', number = 10))
???? coefficients <- summary(equalpay_lgr)
You should definitely check out the broom package, which does lots of stuff like this. You can find an introduction to that package here.
For your question, the solution is tidy from broom. Using the example from the link above:
library(broom)
tidy(lmfit)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 37.3 1.88 19.9 8.24e-19
## 2 wt -5.34 0.559 -9.56 1.29e-10
As you can see, tidy returns a tibble (a type of data frame) that contains a column estimate. This is the coefficient you're looking for.

How to get r.squared for each regression?

Im working with a huge data frame with structure similar to the followings. I use output_reg to store slope and intercept for each treatment but I need to add r.squared for each lm (y~x) and store it in another column besides the other two. Any hint on that?
library(plyr)
field <- c('t1','t1','t1', 't2', 't2','t2', 't3', 't3','t3')
predictor <- c(4.2, 5.3, 5.4,6, 7,8.5,9, 10.1,11)
response <- c(5.1, 5.1, 2.4,6.1, 7.7,5.5,1.99, 5.42,2.5)
my_df <- data.frame(field, predictor, response, stringsAsFactors = F)
output_reg<-list()
B<-(unique(my_df$field))
for (i in 1:length(B)) {
index <- my_df[my_df$field==B[i],]
x<- index$predictor
y<- index$response
output_reg[[i]] <- lm (y ~ x) # gets estimates for each field
}
Thanks
r.squared can be accessed via the summary of the model, try this:
m <- lm(y ~ x)
rs <- summary(m)$r.squared
The summary object of the linear regression result contains almost everything you need:
output_reg<-list()
B<-(unique(my_df$field))
for (i in 1:length(B)) {
index <- my_df[my_df$field==B[i],]
x<- index$predictor
y<- index$response
m <- lm (y ~ x)
s <- summary(m) # get the summary of the model
# extract every thing you need from the summary object
output_reg[[i]] <- c(s$coefficients[, 'Estimate'], r.squared = s$r.squared)
}
output_reg
#[[1]]
#(Intercept) x r.squared
# 10.7537594 -1.3195489 0.3176692
#[[2]]
#(Intercept) x r.squared
# 8.8473684 -0.3368421 0.1389040
#[[3]]
#(Intercept) x r.squared
#-0.30500000 0.35963455 0.03788593
To bind the result together:
do.call(rbind, output_reg)
# (Intercept) x r.squared
# [1,] 10.753759 -1.3195489 0.31766917
# [2,] 8.847368 -0.3368421 0.13890396
# [3,] -0.305000 0.3596346 0.03788593
Check-out the broom package and sprinkle in some dplyr (see this vignette):
library(broom)
library(dplyr)
my_df %>%
group_by(field) %>%
do(glance(lm(predictor ~ response, data = .))) #also see do(tidy(...))
# field r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
# 1 t1 0.31766917 -0.3646617 0.7778175 0.46556474 0.6188153 2 -1.855107 9.710214 7.006051 0.605000 1
# 2 t2 0.13890396 -0.7221921 1.6513038 0.16131065 0.7568653 2 -4.113593 14.227185 11.523022 2.726804 1
# 3 t3 0.03788593 -0.9242281 1.3894755 0.03937779 0.8752903 2 -3.595676 13.191352 10.487189 1.930642 1
Alternatively, save the regressions first:
regressions <- my_df %>% group_by(field) %>% do(fit = lm(predictor ~ response, data = .))
regressions %>% tidy(fit)
regressions %>% glance(fit)
You can do the following using purrr
require(purrr)
my_df %>%
slice_rows("field") %>%
by_slice(partial(lm, predictor ~ response), .labels = FALSE) %>%
flatten %>%
map(~c(coef(.), r.squared=summary(.)$r.squared))
Which gives you:
[[1]]
(Intercept) response r.squared
5.9777778 -0.2407407 0.3176692
[[2]]
(Intercept) response r.squared
9.8195876 -0.4123711 0.1389040
[[3]]
(Intercept) response r.squared
9.68534163 0.10534562 0.03788593
If you want a data.frame back instead use this as last line:
map_df(~as.data.frame(t(c(coef(.), r.squared=summary(.)$r.squared))))
You can create a data frame with model stats like this:
model_stats <- data.frame(model$coefficients)
model_stats <- rbind(model_stats, r.sq = summary(model)$r.squared)

Resources