Save R-squared from lm summary as a dataframe - r

I want to save the result of a lm model into a dataframe. I generated an empty dataframe (Startframe), where I want to save the results.
My dataframe containing the data is called testdata in this case. It contains the Date in the first column and then several Stations in the rest of the colums.
So far this code is working to get the Estimate, Std. Error, t value and Pr(>|t|).
for(i in 2:ncol(testdata)) {
x <- testdata[,1]
y <- testdata[,i]
mod <- lm(y ~ x)
summary(mod)
Startframe[i,] <- c(i,
summary(mod)[['coefficients']]['(Intercept)','Estimate'],
summary(mod)[['coefficients']]['x','Estimate'],
summary(mod)[['coefficients']]['x','Std. Error'],
summary(mod)[['coefficients']]['x','t value'],
summary(mod)[['coefficients']]['x','Pr(>|t|)'])
But how can I also extract the r.squared?
I tried to add summary(mod)[['r.squared']] to the list, but it gives me the wrong numbers.
I know str(summary(mod)) gives me an overview, but I cant figure out how to add it into my loop.
Thanks for your help.

Nice way to work with the same model on different datasets is to use the tidyverse approach using broom package.
In this example I'm using the diamonds dataset to test how carat and depth effects the diamonds' price in different diamond cuts.
require(tidyverse)
require(broom)
diamonds %>%
nest(-cut) %>%
mutate(model = purrr::map(data, function(x) {
lm(price ~ carat + depth, data = x)}),
values = purrr::map(model, glance),
r.squared = purrr::map_dbl(values, "r.squared"),
pvalue = purrr::map_dbl(values, "p.value")) %>%
select(-data, -model, -values)
cut r.squared pvalue
<ord> <dbl> <dbl>
1 Ideal 0.867 0
2 Premium 0.856 0
3 Good 0.851 0
4 Very Good 0.859 0
5 Fair 0.746 0

Related

Understanding the Output Coefficients from a Linear Model Regression in R

I'm reading a fairly simple hypothesis textbook at the moment. It is being explained that the coefficients from a linear model, where the independent variables are two categorical variables with 2 and 3 factors respectively, and the dependent variable is a continuous variable should be interpreted as; the difference between the overall mean of the dependent variable (mean across all categorical variables and factors) and the mean of the dependent variable based on the values of the dependent variable from a given factorized categorical variable. I hope it's understandable.
However, when I try to reproduce the example in the book, I do not get the same coefficients, std. err., T- or P-values.
I created a reproducible example using the ToothGrowth dataset, where the same is the case:
library(tidyverse)
# Transforming Data to a Tibble and Change Variable 'dose' to a Factor:
tooth_growth_reprex <- ToothGrowth %>%
as_tibble() %>%
mutate(dose = as.factor(dose))
# Creating Linear Model of Variables in ToothGrowth (tg):
tg_lm <- lm(formula = len ~ supp * dose, data = tooth_growth_reprex)
# Extracting suppVC coefficient:
(coef_supp_vc <- tg_lm$coefficients["suppVC"])
#> suppVC
#> -5.25
# Calculating Mean Difference between Overall Mean and Supplement VC Mean:
## Overall Mean:
(overall_summary <- tooth_growth_reprex %>%
summarise(Mean = mean(len)))
#> # A tibble: 1 x 1
#> Mean
#> <dbl>
#> 1 18.8
## Supp VC Mean:
(supp_vc_summary <- tooth_growth_reprex %>%
group_by(supp) %>%
summarise(Mean = mean(len))) %>%
filter(supp == "VC")
#> # A tibble: 1 x 2
#> supp Mean
#> <fct> <dbl>
#> 1 VC 17.0
## Difference between Overall Mean and Supp VC Mean:
(mean_dif_overall_vc <- overall_summary$Mean - supp_vc_summary$Mean[2])
#> [1] 1.85
# Testing if supp_VC coefficient and difference between Overall Mean and Supp VC Mean is near identical:
near(coef_supp_vc, mean_dif_overall_vc)
#> suppVC
#> FALSE
Created on 2021-02-23 by the reprex package (v1.0.0)
My questions:
Am I understanding the interpretation of the coefficient values completely wrong?
What is the lm actually calculating regarding the coefficients?
Is there any functions in R that can calculate what I'm interested in, with me having to do it manually?
I hope this is enough information. If not, please don't hesitate to ask me!
The lm() function uses dummy coding, so all the coefficients in your model are compared to the reference group's mean. The reference group here is the first levels of your factors, so supp=OJ and dose=0.5
You can then do this verification like so:
coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] == mean_table %>% filter(supp=='VC' & dose==0.5) %>% pull(M)
(coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] + coef(tg_lm)["dose1"] + coef(tg_lm)["suppVC:dose1"]) == mean_table %>% filter(supp=='VC' & dose==1) %>% pull(M)
You can read into the differences here

I have two different classes in one column. How to test normality of each of them?

A newbie in R.
Considering this is my situation:(Actually my real situation is much more complex)
set.seed(100)
df = data.frame(SEX=sample(c("M","F"),100,replace=TRUE),BW = rnorm(100,80,2))
One column is SEX(male and female), another one is BW(body weight).
I want to test male's body weight normality and female's body weight normality. Then I can test equlity of variances respectively. At last, T test or other test for this situation.
But shapiro.test can't be used in this situation. (like shapiro.test(BW~SEX,data=df))
What should I do? I don't want to seperate the data frame or make new subsets.
Thanks in advance~!
A "tidyverse" solution to this problem is described in detail here: Running a model on separate groups.
Briefly, using your data:
library(dplyr) # for mutate
library(tidyr) # for nest/unnest
library(purrr) # for map
library(broom) # for glance
df %>%
nest(data = c(BW)) %>%
mutate(model = map(data, ~ shapiro.test(.x$BW)),
g = map(model, glance)) %>%
unnest(g)
Result:
# A tibble: 2 x 6
SEX data model statistic p.value method
<fct> <list<df[,1]>> <list> <dbl> <dbl> <chr>
1 F [50 x 1] <htest> 0.982 0.639 Shapiro-Wilk normality test
2 M [50 x 1] <htest> 0.980 0.535 Shapiro-Wilk normality test
Oh I just figured out by myself...
using this code
with(df, shapiro.test(BW[SEX == "M"]))
with(df, shapiro.test(BW[SEX == "F"]))
I am glad I can learn more!

Hundreds of linear regressions that run by group in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a table with 3,000+ rows and 10+ variables. I am trying to run a linear regression using one variables as the predictor and another as the response for 300 different groups. I need the slope, p-value, and r-squared for each of these regressions. To do each regression individually and record the summary variables would take hours if not days.
I have used the following package to get the intercept and slope for each group, but I do not know how to also get the corresponding p-value and r-squared for each group:
library(lme4)
groupreg<-lmList(logpop ~ avgp | id, data=data)
groupreg
I achieved a list sample below, where "Adams #" is the id value. NAs exist because not all groups have multiple points to plot and compare:
Coefficients:
(Intercept) avgp
Adams 6 4.0073332 NA
Adams 7 6.5177389 -7.342443e+00
Adams 8 4.7449321 NA
Adams 9 NA NA
This table does not include any significance statistics, however. I still need the p-value and r-squared statistic. If there is a code to do it all in one go for all group values, or a code to just pull the remaining values, it would be helpful.
Is there are way also to exponentiate the slope output for all groups? My outcome was log-transformed.
Thank you all!!
I thinks the easiest answer is still missing. You can use a combination of nesting and mapping. I'll show you how it works for linear regression. I think you're able to apply the same principle to models of the lme4 package.
Lets create a toy data set, where we've measured the IQ score for three different groups at two different points of time.
library(tidyverse)
library(broom)
df <- tibble(
id = seq_len(90),
IQ = rnorm(90, 100, 15),
group = rep(c("A", "B", "C"), each = 30),
time = rep(c("T1", "T2"), 45)
)
If we want to build a regression model for each group, investigating the relation between the IQ score and the point of time, we only need five lines of code.
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results = map(fit, glance)) %>%
unnest(results) %>%
select(group, r.squared, p.value)
Which will return
# A tibble: 3 x 3
group r.squared p.value
<chr> <dbl> <dbl>
1 A 0.0141 0.532
2 B 0.0681 0.164
3 C 0.00432 0.730
where nest(-group) creates tibbles within your tibble for each group, containing the corresponding variables of id, IQ and time. Then you add a new column fit with mutate() where you apply a regression model for each group and a new column containing the results, which we unnest() shortly after to access the values glance() returned properly. In the last step we select() the three values of interest.
To get the slope you need to call tidy() in addition. Maybe it's possible to shorten the code somehow, but one solution would be
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results1 = map(fit, glance),
results2 = map(fit, tidy)) %>%
unnest(results1) %>%
unnest(results2) %>%
select(group, term, estimate, r.squared, p.value) %>%
mutate(estimate = exp(estimate))
To exponentiate the slope, you can just add another mutate() statement. Finally it returns
# A tibble: 6 x 5
group term estimate r.squared p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 3.34e+46 0.0141 0.532
2 A timeT2 3.31e- 2 0.0141 0.532
3 B (Intercept) 1.17e+47 0.0681 0.164
4 B timeT2 1.34e- 3 0.0681 0.164
5 C (Intercept) 8.68e+43 0.00432 0.730
6 C timeT2 1.25e- 1 0.00432 0.730
Note that the estimates are exponentiated already. Without the exponentiation you can double check the slope and p value with base R calling
summary(lm(IQ ~ time, data = filter(df, group == "A")))
If you work with more complex models (lme4), there is a package called lmerTest which offers wrapper functions for lme4 which return p-values (at least for mixed models, with which I already worked with).
A word of warning towards using glance() for lme4 models should be spoken, because the maintainers of the broom package, will try a new concept where they outsource the summary statistics to the particular package developer responsible for the model.
If I am understanding your question correctly, you want to run multiple regressions over lots of groups. Here is an example of how to do so with the mtcars data.
library(dplyr)
mtcars %>% group_by(cyl) %>%
summarise_at(vars(disp:wt), funs(
r.sqr = summary(lm(mpg~.))$r.squared,
intercept = summary(lm(mpg~.))$coefficients[[1]],
slope = summary(lm(mpg~.))$coefficients[[2]],
p.value = summary(lm(mpg~.))$coefficients[[8]]
))
This will run a regression per group per variable and extract the info you asked for. If your formula is always the same, you could simplify as follows.
mtcars %>% group_by(cyl) %>%
summarise(
r.sqr = summary(lm(mpg~wt))$r.squared,
intercept = summary(lm(mpg~wt))$coefficients[[1]],
slope = summary(lm(mpg~wt))$coefficients[[2]],
p.value = summary(lm(mpg~wt))$coefficients[[8]]
)
This is actually running the regression 4 times(once per value of interest). If that takes too long for your real data, you could try this:
df <- mtcars %>% group_by(cyl) %>% summarise(model = list(summary(lm(mpg~wt))))
which simply runs the model once per group and then extract out the info you want. The problem is that extracting values this way can be a pain
df$model[[1]]$coefficients[[1]]
[1] 39.5712
While the code given by AndS will work, it will run lm function 4 times for each group which makes it a bit inefficient. You can use the following. I am trying to break it into simpler steps:
Assuming your data frame(df) has three variables: "Group", "Dep", "Indep":
#Getting the unique list of groups
groups <- unique(df$Group)
#Creating a model summary list to combine the model summary of each model
model_summaries = list()
#Running the models
for(i in 1:length(groups)){
model <- lm(Dep ~ Indep, df[df$Group == Groups[i], c("Dep", "Indep")])
model_summaries[i] <- summary(model)
}
In each model summary you have following elements RSQ, coefficients(contains p-values and intercept too)
Let me know if this helps.

A loop to create multiple data frames from a population data frame

Suppose I have a data frame called pop, and I wish to split this data frame by a categorical variable called replicate. This replicate consists out of 110 categories, and I wish to perform analyses on each data frame then the output of each must be combined to create a new data frame. In other words suppose it is replicate i then I wish to create data frame i and perform a logistic regression on i and save beta 0 for i. All the beta 0 will be combined to create a table with all the beta 0 for replicate 1-110.
I know that's A mouth full but thanks in advance.
Since you didn't give some sample data I will use mtcars. You can use split to split a data.frame on a categorical value. Combining this with map and tidy from the purrr and broom packages you can create a dataframe with all the beta's in one go.
So what happens is 1: split data.frame, 2: run regression model 3: tidy data to get the coefficients out and create a data.frame of the data.
You will need to adjust this to your data.frame and replicate variable. Broom can handle logistic regression so everything should work out.
library(purrr)
library(broom)
my_lms <- mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map_dfr(~ tidy(.))
my_lms
term estimate std.error statistic p.value
1 (Intercept) 39.571196 4.3465820 9.103980 7.771511e-06
2 wt -5.647025 1.8501185 -3.052251 1.374278e-02
3 (Intercept) 28.408845 4.1843688 6.789278 1.054844e-03
4 wt -2.780106 1.3349173 -2.082605 9.175766e-02
5 (Intercept) 23.868029 3.0054619 7.941551 4.052705e-06
6 wt -2.192438 0.7392393 -2.965803 1.179281e-02
EDIT
my_lms <- lapply(split(mtcars, mtcars$cyl), function(x) lm(mpg ~ wt, data = x))
my_coefs <- as.data.frame(sapply(my_lms, coef))
my_coefs
4 6 8
(Intercept) 39.571196 28.408845 23.868029
wt -5.647025 -2.780106 -2.192438
#Or transpose the coefficents if you want column results.
t(my_coefs)
(Intercept) wt
4 39.57120 -5.647025
6 28.40884 -2.780106
8 23.86803 -2.192438

Running several linear regressions from a single dataframe in R

I have a dataset of export trade data for a single country with 21 columns. The first column indicates the years (1962-2014) while the other 20 are trading partners. I am trying to run linear regressions for the years column and each other column. I have tried the method recommended here: Running multiple, simple linear regressions from dataframe in R that entails using
combn(names(DF), 2, function(x){lm(DF[, x])}, simplify = FALSE)
However this only yields the intercept for each pair which is less important to me than the slope of the regressions.
Additionally I have tried to use my dataset as a time series, however when I try to run
lm(dimnames~., brazilts, na.action=na.exclude)
(where brazilts is my dataset as a time series from "1962" to "2014") it returns:
Error in model.frame.default(formula = dimnames ~ ., data = brazilts, :
object is not a matrix.
I therefore tried the same method with a matrix but then it returned the error:
Error in model.frame.default(formula = . ~ YEAR, data = brazilmatrix, :
'data' must be a data.frame, not a matrix or an array
(where brazilmatrix is my dataset as a data.matrix which includes a column for years).
Really I am not even proficient in R and at this point. The ultimate goal is to create a loop that I can use to get take regressions for a significantly larger dataset of gross exports by country-pair per year for 28 countries. Perhaps I am attacking this in entirely the wrong way, so any help or criticism is welcome. Bare in mind that the years (1962-2014) are in effect my explanatory variable and the value of gross export is my dependent variable, which may be throwing off my syntax in the above examples. Thanks in advance!
Just to add an alternative, I would propose going down this route:
library(reshape2)
library(dplyr)
library(broom)
df <- melt(data.frame(x = 1962:2014,
y1 = rnorm(53),
y2 = rnorm(53),
y3 = rnorm(53)),
id.vars = "x")
df %>% group_by(variable) %>% do(tidy(lm(value ~ x, data=.)))
Here, I just melt the data so that all relevant columns are given by groups of rows, to be able to use dplyr's grouped actions. This gives the following dataframe as output:
Source: local data frame [6 x 6]
Groups: variable [3]
variable term estimate std.error statistic p.value
(fctr) (chr) (dbl) (dbl) (dbl) (dbl)
1 y1 (Intercept) -3.646666114 18.988154862 -0.1920495 0.8484661
2 y1 x 0.001891627 0.009551103 0.1980533 0.8437907
3 y2 (Intercept) -8.939784046 16.206935047 -0.5516024 0.5836297
4 y2 x 0.004545156 0.008152140 0.5575415 0.5795966
5 y3 (Intercept) 21.699503502 16.785586452 1.2927462 0.2019249
6 y3 x -0.010879271 0.008443204 -1.2885240 0.2033785
This is a pretty convenient form to continue working with the coefficients. All that is required is to melt the dataframe so that all columns are rows in the dataset, and then to use dplyr's group_by to carry out the regression in all subsets. broom::tidy puts the regression output into a nice dataframe. See ?broom for more information.
In case you need to keep the models to do adjustments of some sort (which are implemented for lm objects), then you can also do the following:
df %>% group_by(variable) %>% do(mod = lm(value ~ x, data=.))
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
variable mod
* <fctr> <list>
1 y1 <S3: lm>
2 y2 <S3: lm>
3 y3 <S3: lm>
Here, for each variable, the lm object is stored in the dataframe. So, if you want to get the model output for the first, you can just access it as you would access any normal dataframe, e.g.
tmp <- df %>% group_by(variable) %>% do(mod = lm(value ~ x, data=.))
tmp[tmp$variable == "y1",]$mod
[[1]]
Call:
lm(formula = value ~ x, data = .)
Coefficients:
(Intercept) x
-1.807255 0.001019
This is convenient if you want to apply some methods to all lm objects since you can use the fact that tmp$mod gives you a list of them, which makes it easy to pass to e.g. lapply.
Quite aside from the statistical justification for doing this, the programming problem is an interesting one. Here is a solution, but probably not the most elegant one. First, create a sample data set:
x = c(1962:2014)
y1 = c(rnorm(53))
y2 = c(rnorm(53))
y3 = c(rnorm(53))
mydata = data.frame(x, y1, y2, y3)
attach(mydata)
head(mydata)
# x y1 y2 y3
#1 1962 -0.9884054 -1.68208217 0.5980446
#2 1963 -1.0741098 0.51309753 1.0986366
#3 1964 0.1357549 -0.23427820 0.1482258
#4 1965 -0.8846920 -0.60375400 0.7162992
#5 1966 -0.5529187 0.85573739 0.5541827
#6 1967 0.4881922 -0.09360152 -0.5379037
Next, use a for loop to do several regressions:
for(i in 2:4){
reg = lm(x ~ mydata[,i])
print(reg)
}
Call:
lm(formula = x ~ mydata[, i])
Coefficients:
(Intercept) mydata[, i]
1988.0088 -0.1341
Call:
lm(formula = x ~ mydata[, i])
Coefficients:
(Intercept) mydata[, i]
1987.87 2.07
Call:
lm(formula = x ~ mydata[, i])
Coefficients:
(Intercept) mydata[, i]
1987.304 -4.101

Resources