I would like to add 2 different regression curves, coming from different models, in a scatter plot.
Let's use the example below:
Weight=c(12.6,12.6,16.01,17.3,17.7,10.7,17,10.9,15,14,13.8,14.5,17.3,10.3,12.8,14.5,13.5,14.5,17,14.3,14.8,17.5,2.9,21.4,15.8,40.2,27.3,18.3,10.7,0.7,42.5,1.55,46.7,45.3,15.4,25.6,18.6,11.7,28,35,17,21,41,42,18,33,35,19,30,42,23,44,22)
Increment=c(0.55,0.53,16.53,55.47,80,0.08,41,0.1,6.7,2.2,1.73,3.53,64,0.05,0.71,3.88,1.37,3.8,40,3,26.3,29.7,10.7,35,27.5,60,43,31,21,7.85,63,9.01,67.8,65.8,27,40.1,31.2,22.3,35,21,74,75,12,19,4,20,65,46,9,68,74,57,57)
Id=c(rep("Aa",20),rep("Ga",18),rep("Za",15))
df=data.frame(Id,Weight,Increment)
The scatter plot looks like this:
plot_df <- ggplot(df, aes(x = Weight, y = Increment, color=Id)) + geom_point()
I tested a linear and an exponential regression model and could extract the results following loki's answer there:
linear_df <- df %>% group_by(Id) %>% do(model = glance(lm(Increment ~ Weight,data = .))) %>% unnest(model)
exp_df <- df %>% group_by(Id) %>% do(model = glance(lm(log(Increment) ~ Weight,data = .))) %>% unnest(model)
The linear model fits better for the Ga group, the exponential one for the Aa group, and nothing for the Za one:
> linear_df
# A tibble: 3 x 13
Id r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Aa 0.656 0.637 15.1 34.4 1.50e- 5 1 -81.6 169. 172. 4106. 18 20
2 Ga 1.00 1.00 0.243 104113. 6.10e-32 1 1.01 3.98 6.65 0.942 16 18
3 Za 0.0471 -0.0262 26.7 0.642 4.37e- 1 1 -69.5 145. 147. 9283. 13 15
> exp_df
# A tibble: 3 x 13
Id r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Aa 0.999 0.999 0.0624 24757. 1.05e-29 1 28.2 -50.3 -47.4 0.0700 18 20
2 Ga 0.892 0.885 0.219 132. 3.86e- 9 1 2.87 0.264 2.94 0.766 16 18
3 Za 0.00444 -0.0721 0.941 0.0580 8.14e- 1 1 -19.3 44.6 46.7 11.5 13 15
Now, how can I draw the linear regression line for the Aa group, the exponential regression curve for the Ga group, and no curve for the Za group? There is this, but it applies for different regressions built inside the same model type. How can I combine my different objects?
The formula shown below gives the same fitted values as does 3 separate fits for each Id so create the lm objects for each of the two models and then plot the points and the lines for each. The straight solid lines are the linear model and the curved dashed lines are the exponential model.
library(ggplot2)
fm.lin <- lm(Increment ~ Id/Weight + 0, df)
fm.exp <- lm(log(Increment) ~ Id/Weight + 0, df)
df %>%
ggplot(aes(Weight, Increment, color=Id)) +
geom_point() +
geom_line(aes(y = fitted(fm.lin))) +
geom_line(aes(y = exp(fitted(fm.exp))), lty = 2, lwd = 1)
To only show the Aa fitted lines for the linear model and Ga fitted lines for the exponential model NA out the portions not wanted. In this case we used solid lines for the fitted models.
df %>%
ggplot(aes(Weight, Increment, color=Id)) +
geom_point() +
geom_line(aes(y = ifelse(Id == "Aa", fitted(fm.lin), NA))) +
geom_line(aes(y = ifelse(Id == "Ga", exp(fitted(fm.exp)), NA)))
Added
Regarding the questions in the comments, the formula used above nests Weight within Id and effectively uses a model matrix which, modulo column order, is a block diagonal matrix whose blocks are the model matrices of the 3 individual models. Look at this to understand it.
model.matrix(fm.lin)
Since this is a single model rather than three models the summary statistics will be pooled. To get separate summary statistics use lmList from the nlme package (which comes with R so it does not have to be installed -- just issue a library statement). The statements below will give objects of class lmList that can be used in place of the ones above as they have a fitted method that will return the same fitted values.
library(nlme)
fm.lin2 <- lmList(Increment ~ Weight | Id, df, pool = FALSE)
fm.exp2 <- lmList(log(Increment) ~ Weight | Id, df, pool = FALSE)
In addition, they can be used to get individual summary statistics. Internally the lmList objects consist of a list of 3 lm objects with attributes in this case so we can extract the summary statistics by extracting the summary statistics from each component.
library(broom)
sapply(fm.lin2, glance)
sapply(fm.exp2, glance)
One caveat is that common statistical tests between models using different dependent variables, Increment vs. log(Increment), are invalid.
possible solution
Weight=c(12.6,12.6,16.01,17.3,17.7,10.7,17,10.9,15,14,13.8,14.5,17.3,10.3,12.8,14.5,13.5,14.5,17,14.3,14.8,17.5,2.9,21.4,15.8,40.2,27.3,18.3,10.7,0.7,42.5,1.55,46.7,45.3,15.4,25.6,18.6,11.7,28,35,17,21,41,42,18,33,35,19,30,42,23,44,22)
Increment=c(0.55,0.53,16.53,55.47,80,0.08,41,0.1,6.7,2.2,1.73,3.53,64,0.05,0.71,3.88,1.37,3.8,40,3,26.3,29.7,10.7,35,27.5,60,43,31,21,7.85,63,9.01,67.8,65.8,27,40.1,31.2,22.3,35,21,74,75,12,19,4,20,65,46,9,68,74,57,57)
Id=c(rep("Aa",20),rep("Ga",18),rep("Za",15))
df=data.frame(Id,Weight,Increment)
library(tidyverse)
df_model <- df %>%
group_nest(Id) %>%
mutate(
formula = c(
"lm(log(Increment) ~ Weight, data = .x)",
"lm(Increment ~ Weight,data = .x)",
"lm(Increment ~ 0,data = .x)"
),
transform = c("exp(fitted(.x))",
"fitted(.x)",
"fitted(.x)")
) %>%
mutate(model = map2(data, formula, .f = ~ eval(parse(text = .y)))) %>%
mutate(fit = map2(model, transform, ~ eval(parse(text = .y)))) %>%
select(Id, data, fit) %>%
unnest(c(data, fit))
ggplot(df_model) +
geom_point(aes(Weight, Increment, color = Id)) +
geom_line(aes(Weight, fit, color = Id))
Created on 2021-10-06 by the reprex package (v2.0.1)
I would like to make a regression loop lm(y~x) with a dataset with one y and several x, and run the regression for each x, and then also store the results (estimate, p-values) in a data.frame() so I don't have to copy them manually (especially as my real data set it much bigger).
I think this should not be too difficult, but I struggle a lot to make it work and appreciate your help:
Here is my sample data set:
sample_data <- data.frame(
fit = c(0.8971963, 1.4205607, 1.4953271, 0.8971963, 1.1588785, 0.1869159, 1.1588785, 1.142857143, 0.523809524),
Xbeta = c(2.8907744, -0.7680777, -0.7278847, -0.06293916, -0.04047017, 2.3755812, 1.3043990, -0.5698354, -0.5698354),
Xgamma = c( 0.1180758, -0.6275700, 0.3731964, -0.2353454,-0.5761923, -0.5186803, 0.43041835, 3.9111749, -0.5030638),
Xalpha = c(0.2643091, 1.6663923, 0.4041057, -0.2100472, -0.2100472, 7.4874195, -0.2385278, 0.3183102, -0.2385278),
Xdelta = c(0.1498646, -0.6325119, -0.5947564, -0.2530748, 3.8413339, 0.6839322, 0.7401834, 3.8966404, 1.2028175)
)
#yname <- ("fit")
#xnames <- c("Xbeta ","Xgamma", "Xalpha", "Xdelta")
The simple regression with the first independant variable Xbeta would look like this lm(fit~Xbeta, data= sample_data)and I would like to run the regression for each variable starting with an "X" and then store the result (estimate, p-value).
I have found a code that allows me to select variables that start with "X" and then use it for the model, but the code gives me an error from mutate() onwards (indicated by #).
library(tidyverse)
library(tsibble)
sample_data %>%
gather(stock, return, starts_with("X")) %>%
group_nest(stock)
# %>%
# mutate(model = map(data,
# ~lm(formula = "fit~ return",
# data = .x))
# ),
# resid = map(model, residuals)
# ) %>%
# unnest(c(data,resid)) %>%
# summarise(sd_residual = sd(resid))
For then storing the regression results I have also found the following appraoch using the R package "broom": r for loop for regression lm(y~x)
sample_data%>%
group_by(y,x)%>% # get combinations of y and x to regress
do(tidy(lm(fRS_relative~xvalue, data=.)))
But I always get an error for group_by() and do()
I really appreciate your help!
One option would be to use lapply to perform a regression with each of the independent variables. Use tidy from broom library to store the results into a tidy format.
lapply(1:length(xnames),
function(i) broom::tidy(lm(as.formula(paste0('fit ~ ', xnames[i])), data = sample_data))) -> test
and then combine all the results into a single dataframe:
do.call('rbind', test)
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 1.05 0.133 7.89 0.0000995
# 2 Xbeta -0.156 0.0958 -1.62 0.148
# 3 (Intercept) 0.968 0.147 6.57 0.000313
# 4 Xgamma 0.0712 0.107 0.662 0.529
# 5 (Intercept) 1.09 0.131 8.34 0.0000697
# 6 Xalpha -0.0999 0.0508 -1.96 0.0902
# 7 (Intercept) 0.998 0.175 5.72 0.000723
# 8 Xdelta -0.0114 0.0909 -0.125 0.904
Step one
Your data is messy, let us tidy it up.
sample_data <- data.frame(
fit = c(0.8971963, 1.4205607, 1.4953271, 0.8971963, 1.1588785, 0.1869159, 1.1588785, 1.142857143, 0.523809524),
Xbeta = c(2.8907744, -0.7680777, -0.7278847, -0.06293916, -0.04047017, 2.3755812, 1.3043990, -0.5698354, -0.5698354),
Xgamma = c( 0.1180758, -0.6275700, 0.3731964, -0.2353454,-0.5761923, -0.5186803, 0.43041835, 3.9111749, -0.5030638),
Xalpha = c(0.2643091, 1.6663923, 0.4041057, -0.2100472, -0.2100472, 7.4874195, -0.2385278, 0.3183102, -0.2385278),
Xdelta = c(0.1498646, -0.6325119, -0.5947564, -0.2530748, 3.8413339, 0.6839322, 0.7401834, 3.8966404, 1.2028175)
)
tidyframe = data.frame(fit = sample_data$fit,
X = c(sample_data$Xbeta,sample_data$Xgamma,sample_data$Xalpha,sample_data$Xdelta),
type = c(rep("beta",9),rep("gamma",9),rep("alpha",9),rep("delta",9)))
Created on 2020-07-13 by the reprex package (v0.3.0)
Step two
Iterate over each type, and get the P-value, using this nifty function
# From https://stackoverflow.com/a/5587781/3212698
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
Then do some clever piping
tidyframe %>% group_by(type) %>%
summarise(type = type, p = lmp(lm(formula = fit ~ X))) %>%
unique()
#> `summarise()` regrouping output by 'type' (override with `.groups` argument)
#> # A tibble: 4 x 2
#> # Groups: type [4]
#> type p
#> <fct> <dbl>
#> 1 alpha 0.0902
#> 2 beta 0.148
#> 3 delta 0.904
#> 4 gamma 0.529
Created on 2020-07-13 by the reprex package (v0.3.0)
Im working with a huge data frame with structure similar to the followings. I use output_reg to store slope and intercept for each treatment but I need to add r.squared for each lm (y~x) and store it in another column besides the other two. Any hint on that?
library(plyr)
field <- c('t1','t1','t1', 't2', 't2','t2', 't3', 't3','t3')
predictor <- c(4.2, 5.3, 5.4,6, 7,8.5,9, 10.1,11)
response <- c(5.1, 5.1, 2.4,6.1, 7.7,5.5,1.99, 5.42,2.5)
my_df <- data.frame(field, predictor, response, stringsAsFactors = F)
output_reg<-list()
B<-(unique(my_df$field))
for (i in 1:length(B)) {
index <- my_df[my_df$field==B[i],]
x<- index$predictor
y<- index$response
output_reg[[i]] <- lm (y ~ x) # gets estimates for each field
}
Thanks
r.squared can be accessed via the summary of the model, try this:
m <- lm(y ~ x)
rs <- summary(m)$r.squared
The summary object of the linear regression result contains almost everything you need:
output_reg<-list()
B<-(unique(my_df$field))
for (i in 1:length(B)) {
index <- my_df[my_df$field==B[i],]
x<- index$predictor
y<- index$response
m <- lm (y ~ x)
s <- summary(m) # get the summary of the model
# extract every thing you need from the summary object
output_reg[[i]] <- c(s$coefficients[, 'Estimate'], r.squared = s$r.squared)
}
output_reg
#[[1]]
#(Intercept) x r.squared
# 10.7537594 -1.3195489 0.3176692
#[[2]]
#(Intercept) x r.squared
# 8.8473684 -0.3368421 0.1389040
#[[3]]
#(Intercept) x r.squared
#-0.30500000 0.35963455 0.03788593
To bind the result together:
do.call(rbind, output_reg)
# (Intercept) x r.squared
# [1,] 10.753759 -1.3195489 0.31766917
# [2,] 8.847368 -0.3368421 0.13890396
# [3,] -0.305000 0.3596346 0.03788593
Check-out the broom package and sprinkle in some dplyr (see this vignette):
library(broom)
library(dplyr)
my_df %>%
group_by(field) %>%
do(glance(lm(predictor ~ response, data = .))) #also see do(tidy(...))
# field r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
# 1 t1 0.31766917 -0.3646617 0.7778175 0.46556474 0.6188153 2 -1.855107 9.710214 7.006051 0.605000 1
# 2 t2 0.13890396 -0.7221921 1.6513038 0.16131065 0.7568653 2 -4.113593 14.227185 11.523022 2.726804 1
# 3 t3 0.03788593 -0.9242281 1.3894755 0.03937779 0.8752903 2 -3.595676 13.191352 10.487189 1.930642 1
Alternatively, save the regressions first:
regressions <- my_df %>% group_by(field) %>% do(fit = lm(predictor ~ response, data = .))
regressions %>% tidy(fit)
regressions %>% glance(fit)
You can do the following using purrr
require(purrr)
my_df %>%
slice_rows("field") %>%
by_slice(partial(lm, predictor ~ response), .labels = FALSE) %>%
flatten %>%
map(~c(coef(.), r.squared=summary(.)$r.squared))
Which gives you:
[[1]]
(Intercept) response r.squared
5.9777778 -0.2407407 0.3176692
[[2]]
(Intercept) response r.squared
9.8195876 -0.4123711 0.1389040
[[3]]
(Intercept) response r.squared
9.68534163 0.10534562 0.03788593
If you want a data.frame back instead use this as last line:
map_df(~as.data.frame(t(c(coef(.), r.squared=summary(.)$r.squared))))
You can create a data frame with model stats like this:
model_stats <- data.frame(model$coefficients)
model_stats <- rbind(model_stats, r.sq = summary(model)$r.squared)