I have looked over the forum but couldn't find what I am looking for.
I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.
Assume I work with the following data:
num value person1 person2 person3
0 1 229 29 81 0
1 2 203 17 75 0
2 3 244 62 0 55
and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.
And so my initial attempt was to:
column <- names(df)[-2]
for(i in 3:5){
temp <- df[,c("value", column[i])]
lm.test <- lm(value ~ ., data = temp)
i + 1
}
However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:
debug at #3: temp <- df[,c("value", column[i])]
suggesting that there's something wrong with line 3?
If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.
You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:
vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
lm_formula <- as.formula(paste0("mpg ~ ", col))
lm(lm_formula, data = mtcars)
})
You can then again iterate over lm_results to get the r.squared:
lapply(lm_results, function(x) summary(x)$r.squared)
Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:
library(broom)
library(purrr)
lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)
Result:
# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().
Related
I tried doing this
fits = list(fit0)
for(i in 1:5)
{
temp = assign(paste0("fit", i), lm(formula = y ~ poly(x, degree = i, raw = TRUE)))
fits = append(fits, temp)
}
which seems like it should work and I don't get any errors initially. The problem though, is that instead of creating a list of lists of length 6, where each element is itself a list (as lm objects are lists), it seems to be taking the elements of each list and making them each a separate element in temp. When I do length(fits) it gives 61. And when I do View(fits) it shows me this:
which certainly looks like it took all the elements of each individual list and made them the elements of fits, though I don't understand why.
Oddly though, if I just do fits[1] in the console it gives
which is the exact same output I get if I type fit0 in the console. So it seems like it's in someway storing each lm object as one thing.
The problem though, is that if I then try to get the R^2 value for, e.g., fit0, it works fine if I do summary(fit0)$r.squared, but if try to do it for fits[1] it does this:
I don't understand what's going on here. I thought maybe the problem was using append, since I'd previously only used it with vectors so I Googled "how to create list of lists in R" but the examples I found used append so that doesn't seem to be the issue.
I assume it's something to do with the intricacy of lm objects, but the documentation isn't actually helpful (on a side note, why IS R's documentation so terrible anyway? Compared to Python, or even C++(which is a far more complicated language to work with overall), it's so much harder to gleam the details of how the different functions and data types work because the documentation always seems to give the bear minimum, if that, of information) so I don't know what I'm doing wrong.
I've tried Googling how to create a list of lm objects and I found the lmlist data type documentation, but that seems to be for when you want to create a single regression but using data grouped by categories in a data.frame, which isn't what I'm trying to do here. I also found this post: Populating a list with lm objects, but I don't really understand the example code the OP asks about as I'm unsure what they mean by a "random name" or how it even even makes sense for them to be trying to access a named element in what looks like am empty list, and the only answer does the same thing. I did make note of the comment mentioning using double brackets, but I get the same error whether I use double brackets or not:
I'm quite confused here, so any guidance would be greatly appreciated.
Showing how to use a for loop for this:
DF <- data.frame(x = rnorm(100), y = rnorm(100))
fits <- list(fit0 = lm(y ~ 1, data = DF))
for(i in 1:5)
{
fits[[paste0("fit", i)]] <- lm(formula = y ~ poly(x, degree = i, raw = TRUE), data = DF)
}
sapply(fits, \(x) summary(x)$r.squared)
# fit0 fit1 fit2 fit3 fit4 fit5
#0.00000000 0.06441347 0.07915820 0.08547018 0.08547089 0.08569820
From the perspective of a statistician, you should not do this.
lm objects in R are indeed complicated. The broom package offers a consistent way to convert model objects into a "tidy" output format that can be easier to work with downstream.
For instance, we can use broom::glance to get a table with the lm stats as a data frame:
fit <- lm(mpg ~ wt, data = mtcars)
broom::glance(fit)
Result
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 0.753 0.745 3.05 91.4 1.29e-10 1 -80.0 166. 170. 278. 30 32
We could extend this to an example where we group the mtcars dataset by gear, nest the associated data for each gear group, run lm on each one, glance each of those, and finally unnest to get the results into a table. That seems to demonstrate what you're describing -- we can see how the r.squared varied for the lm run on each group.
library(tidyverse); library(broom)
mtcars %>%
group_by(gear) %>%
nest() %>%
mutate(fit = map(data, ~lm(.x$mpg~.x$wt)),
tidy = map(fit, glance)) %>%
unnest(tidy)
# Groups: gear [3]
gear data fit r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<dbl> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 4 <tibble [12 × 10]> <lm> 0.677 0.645 3.14 21.0 0.00101 1 -29.7 65.4 66.8 98.9 10 12
2 3 <tibble [15 × 10]> <lm> 0.608 0.578 2.19 20.2 0.000605 1 -32.0 69.9 72.1 62.3 13 15
3 5 <tibble [5 × 10]> <lm> 0.979 0.972 1.11 141. 0.00128 1 -6.34 18.7 17.5 3.69 3 5
Or maybe you have your list of lm objects, you could feed those into map_dfr(glance) to get a table with r.squared:
fit1 <- lm(mpg~wt, mtcars)
fit2 <- lm(mpg~cyl+wt, mtcars)
list(fit1, fit2) %>%
map_dfr(glance)
# A tibble: 2 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 0.753 0.745 3.05 91.4 1.29e-10 1 -80.0 166. 170. 278. 30 32
2 0.830 0.819 2.57 70.9 6.81e-12 2 -74.0 156. 162. 191. 29 32
I've figured out how to run a 2-way anova on several variables in my data frame, but not sure how to get this into a format that could be easily exported to a csv file or excel. Ideally, I'd like it to have this in a format where each of my several hundred dependent variables is in it's own row, with the pVaules and Fvalues
I've made an example using the titanic dataset. In this case I've set Sex & Embarked as my categorical variables, and would like the output for the effects of Sex Embarked and ~Interaction somehow saved to a file. I'm open to suggestions on how to output this -- just want to be able to easily identify what values are significant, ideally with each dependent variable on its own line.
library(titanic)
library(tidyverse)
df1 <-
titanic_train %>%
select(Sex, Embarked, (1:10)) %>%
select(!("Name" | "Ticket")) %>%
filter(Embarked != "") # deleting empty Embarked status
names(df1)
df1$Sex<- factor(df1$Sex)
df1$Embarked <-factor(df1$Embarked)
#store all formulae in a list
formulae <- lapply(colnames(df1)[3:ncol(df1)], function(x) as.formula(paste0(x, " ~ Sex * Embarked")))
#go through list and run aov()
results <- lapply(formulae, function(x) summary(aov(x, data = df1)))
names(results) <- format(formulae)
results
You can extract the relevant statistics from the summary or store the model in a list and use broom::tidy on it to get all the stats together in a dataframe. Use map functions to run it on list of models.
library(purrr)
library(broom)
results <- lapply(formulae, function(x) aov(x, data = df1))
names(results) <- format(formulae)
data <- map_df(results, tidy, .id = 'formulae')
data
# A tibble: 28 x 7
# formulae term df sumsq meansq statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 PassengerId ~… Sex 1 1.09e+5 1.09e+5 1.65 1.99e- 1
# 2 PassengerId ~… Embark… 2 5.50e+4 2.75e+4 0.416 6.60e- 1
# 3 PassengerId ~… Sex:Em… 2 7.73e+4 3.86e+4 0.584 5.58e- 1
# 4 PassengerId ~… Residu… 883 5.84e+7 6.61e+4 NA NA
# 5 Survived ~ Se… Sex 1 6.16e+1 6.16e+1 376. 4.44e-70
# 6 Survived ~ Se… Embark… 2 3.32e+0 1.66e+0 10.1 4.39e- 5
# 7 Survived ~ Se… Sex:Em… 2 4.85e-1 2.43e-1 1.48 2.28e- 1
# 8 Survived ~ Se… Residu… 883 1.45e+2 1.64e-1 NA NA
# 9 Pclass ~ Sex … Sex 1 1.01e+1 1.01e+1 16.2 6.12e- 5
#10 Pclass ~ Sex … Embark… 2 5.83e+1 2.91e+1 46.8 4.74e-20
# … with 18 more rows
Write data to csv.
write_csv(data, 'data.csv')
I have a data table with the results of an experiment that evaluated 2 factors: Light and Day_light at 2 different temperatures (Temperature).
I performed a 2-way ANOVA for each Temperature using the rstatix package.
The result of the 2-way ANOVA is presented as a data table, with a column called SSn. I would like to divide each SSn value by the sum of all SSn values for each Temperature. For that, I used an approach similar to the one that the rstatix package uses, but I was not successful. Below, I present the code I used and a brief graphic explanation of what I want to accomplish.
library(rstatix)
# Data frame
Temperature <- factor(c(rep("cold", times = 8),
rep("hot", times = 8)),
levels = c("cold", "hot"))
Light <- factor(rep(c(rep("blue", times = 4),
rep("yellow", times = 4)),
times = 2),
levels = c("blue", "yellow"))
Day_light <- factor(rep(c(rep("Day", times = 2),
rep("Night", times = 2)),
times = 4),
levels = c("Day", "Night"))
Result <- c(90.40, 85.20, 21.70, 25.30,
75.12, 77.36, 6.11, 10.8
85.14, 88.96, 30.21, 35.15)
Data <- data.frame(Temperature, Light, Day_light, Result)
# ANOVA
ANOVA <- Data %>%
group_by(Temperature) %>%
anova_test(Result ~ Light * Day_light,
detailed = TRUE)
ANOVA
# Calculations within the ANOVA data frame (not running)
Calculations <- ANOVA %>%
group_by(Temperature) %>%
ANOVA$SSn/sum(ANOVA$SSn)*100
Calculations
> ANOVA
# A tibble: 6 x 10
Temperature Effect SSn SSd DFn DFd F p `p<.05` ges
* <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 cold Light 354. 33.5 1 4 42.2 0.003 "*" 0.914
2 cold Day_light 8723. 33.5 1 4 1041. 0.0000055 "*" 0.996
3 cold Light:Day_light 6.07 33.5 1 4 0.725 0.442 "" 0.153
4 hot Light 773. 23.1 1 4 134. 0.000318 "*" 0.971
5 hot Day_light 5014. 23.1 1 4 869. 0.00000788 "*" 0.995
6 hot Light:Day_light 37.0 23.1 1 4 6.41 0.065 "" 0.616
I have already solved it partially, but I still don't know how to separate the calculation by Temperature
ANOVA$Calculations <- ANOVA$SSn/sum(ANOVA$SSn)*100
Graphical representation of my question
then...
I tend to stick to data.table personally, because it has some nice benefits, and there is quite a learning curve involved.
The traditional plyr way is also shown:
library(data.table)
ANOVA <- as.data.table(ANOVA)
ANOVA[, Calculations := SSn / sum(SSn) , by=Temperature ]
ANOVA
## and the plyr way:
ANOVA %>% group_by( Temperature ) %>%
mutate( Calculations = SSn / sum(SSn) )
This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 2 years ago.
My dataset looks like this
df = data.frame(site=c(rep('A',95),rep('B',110),rep('C',250)),
nps_score=c(floor(runif(455, min=0, max=10))),
service_score=c(floor(runif(455, min=0, max=10))),
food_score=c(floor(runif(455, min=0, max=10))),
clean_score=c(floor(runif(455, min=0, max=10))))
I'd like to run a linear model on each group (i.e. for each site), and produce the coefficients for each group in a dataframe, along with the significance levels of each variable.
I am trying to group_by the site variable and then run the model for each site but it doesn't seem to be working. I've looked at some existing solutions on stack overflow but cannot seem to adapt the code to my solution.
#Trying to run this by group, and output the resulting coefficients per site in a separate df with their signficance levels.
library(MASS)
summary(ols <- rlm(nps_score ~ ., data = df))
Any help on this would be greatly appreciated
library(tidyverse)
library(broom)
library(MASS)
# We first create a formula object
my_formula <- as.formula(paste("nps_score ~ ", paste(df %>% select(-site, -nps_score) %>% names(), collapse= "+")))
# Now we can group by site and use the formula object within the pipe.
results <- df %>%
group_by(site) %>%
do(tidy(rlm(formula(my_formula), data = .)))
which gives:
# A tibble: 12 x 5
# Groups: site [3]
site term estimate std.error statistic
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 5.16 0.961 5.37
2 A service_score -0.0656 0.110 -0.596
3 A food_score -0.0213 0.102 -0.209
4 A clean_score -0.0588 0.110 -0.536
5 B (Intercept) 2.22 0.852 2.60
6 B service_score 0.221 0.103 2.14
7 B food_score 0.163 0.104 1.56
8 B clean_score -0.0383 0.0928 -0.413
9 C (Intercept) 5.47 0.609 8.97
10 C service_score -0.0367 0.0721 -0.509
11 C food_score -0.0585 0.0724 -0.808
12 C clean_score -0.0922 0.0691 -1.33
Note: i'm not familiar with the rlm function and if it provides p-values in the first place. But at least the tidy function doesn't offer p-values for rlm. If a simple linear regression would fit your suits, you could replace the rlm function by lm in which case a sixth column with p-values would be added.
I try running the following code, but get error ar shown in the pictures below. Im quite new to R so dont know if its information to the case, but the first column in my data frame called "data" is dates. I get as.Dates.numeric(value) "origin" must be applied, my intuition says it got something to do with the date column, but then again, im a newbie. Just in case, the date column is not supposed to be a part of coef.vec.
v1 <- 2:7
coef.vec <- data.frame(NULL) # create object to keep results
for (i in seq_along(v1)) {
m <- summary(lm(data[,v1[i]] ~ data[,8])) # run model
coef.vec[i, 1] <- names(data)[v1[i]] # print variable name
coef.vec[i, 2] <- m$coefficients[1,1] # intercept
coef.vec[i, 3] <- m$coefficients[2,1] # coefficient
coef.vec[i, 4] <- mean(data[[i]]) # means of variables
}
names(coef.vec) <- c("y.variable", "intercept", "coef.x","variable.mean")
error1
error2
Try this approach using lapply for column 2 to 7 of your data.
coef.vec <- do.call(rbind, lapply(names(data)[2:7], function(x) {
m <- summary(lm(data[[x]] ~ data[[8]]))
data.frame(y.variable = x,
intercept = m$coefficients[1,1],
coef.x = m$coefficients[2,1],
variable.mean = mean(data[[x]]))
}))
We can construct the formula with reformulate, apply the lm, get the summary output with tidy from broom and create a single dataset
library(dplyr)
library(purrr)
library(broom)
map_dfr(names(data)[2:7], ~
tidy(lm(reformulate(names(data)[8], response = .x), data = data)))
Or this can be done in a single step without any loop
tidy(lm(cbind(iris[,1], iris[,2]) ~ Species, iris))
Or
tidy(lm(as.matrix(iris[1:2]) ~ Species, iris))
# A tibble: 6 x 6
# response term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 Sepal.Length (Intercept) 5.01 0.0728 68.8 1.13e-113
#2 Sepal.Length Speciesversicolor 0.93 0.103 9.03 8.77e- 16
#3 Sepal.Length Speciesvirginica 1.58 0.103 15.4 2.21e- 32
#4 Sepal.Width (Intercept) 3.43 0.0480 71.4 5.71e-116
#5 Sepal.Width Speciesversicolor -0.658 0.0679 -9.69 1.83e- 17
#6 Sepal.Width Speciesvirginica -0.454 0.0679 -6.68 4.54e- 10
and check the output from the loop
map_dfr(names(iris)[1:2], ~ tidy(lm(reformulate('Species', response = .x), data = iris)))
# A tibble: 6 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 (Intercept) 5.01 0.0728 68.8 1.13e-113
#2 Speciesversicolor 0.93 0.103 9.03 8.77e- 16
#3 Speciesvirginica 1.58 0.103 15.4 2.21e- 32
#4 (Intercept) 3.43 0.0480 71.4 5.71e-116
#5 Speciesversicolor -0.658 0.0679 -9.69 1.83e- 17
#6 Speciesvirginica -0.454 0.0679 -6.68 4.54e- 10