Understanding the Output Coefficients from a Linear Model Regression in R - r

I'm reading a fairly simple hypothesis textbook at the moment. It is being explained that the coefficients from a linear model, where the independent variables are two categorical variables with 2 and 3 factors respectively, and the dependent variable is a continuous variable should be interpreted as; the difference between the overall mean of the dependent variable (mean across all categorical variables and factors) and the mean of the dependent variable based on the values of the dependent variable from a given factorized categorical variable. I hope it's understandable.
However, when I try to reproduce the example in the book, I do not get the same coefficients, std. err., T- or P-values.
I created a reproducible example using the ToothGrowth dataset, where the same is the case:
library(tidyverse)
# Transforming Data to a Tibble and Change Variable 'dose' to a Factor:
tooth_growth_reprex <- ToothGrowth %>%
as_tibble() %>%
mutate(dose = as.factor(dose))
# Creating Linear Model of Variables in ToothGrowth (tg):
tg_lm <- lm(formula = len ~ supp * dose, data = tooth_growth_reprex)
# Extracting suppVC coefficient:
(coef_supp_vc <- tg_lm$coefficients["suppVC"])
#> suppVC
#> -5.25
# Calculating Mean Difference between Overall Mean and Supplement VC Mean:
## Overall Mean:
(overall_summary <- tooth_growth_reprex %>%
summarise(Mean = mean(len)))
#> # A tibble: 1 x 1
#> Mean
#> <dbl>
#> 1 18.8
## Supp VC Mean:
(supp_vc_summary <- tooth_growth_reprex %>%
group_by(supp) %>%
summarise(Mean = mean(len))) %>%
filter(supp == "VC")
#> # A tibble: 1 x 2
#> supp Mean
#> <fct> <dbl>
#> 1 VC 17.0
## Difference between Overall Mean and Supp VC Mean:
(mean_dif_overall_vc <- overall_summary$Mean - supp_vc_summary$Mean[2])
#> [1] 1.85
# Testing if supp_VC coefficient and difference between Overall Mean and Supp VC Mean is near identical:
near(coef_supp_vc, mean_dif_overall_vc)
#> suppVC
#> FALSE
Created on 2021-02-23 by the reprex package (v1.0.0)
My questions:
Am I understanding the interpretation of the coefficient values completely wrong?
What is the lm actually calculating regarding the coefficients?
Is there any functions in R that can calculate what I'm interested in, with me having to do it manually?
I hope this is enough information. If not, please don't hesitate to ask me!

The lm() function uses dummy coding, so all the coefficients in your model are compared to the reference group's mean. The reference group here is the first levels of your factors, so supp=OJ and dose=0.5
You can then do this verification like so:
coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] == mean_table %>% filter(supp=='VC' & dose==0.5) %>% pull(M)
(coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] + coef(tg_lm)["dose1"] + coef(tg_lm)["suppVC:dose1"]) == mean_table %>% filter(supp=='VC' & dose==1) %>% pull(M)
You can read into the differences here

Related

Correlation between groups and ranks over different samples with R

I'm studying some scores on DNA for which each position has a score.
I would like to find a method to know whether some samples are more often likely to have a high score, not in general, but position per position.
Some positions are not defined on all samples, and some samples don't have score for a given position.
data.frame('pos'=c(1,2,3,1,2,3,1,2,5), 'sample'=c('A','A','A','B','B','B','C','C','C'), 'score'=c(1,10,5,20,40,10,0.1,5,4))
I'd like to know using a spearman correlation (I'm looking for rankings as there is no real biological reasons to compare position 1 and 2 for instance) whether some samples are more likely to have the "top" scoring values.
My difficulty is that I have actually two qualitative values : the sample ID and the position and only one quantitative. I don't manage to indicate to R that I want somehow to group the data by position and then have a ranking on each position to study the correlation of rankings.
Finally I'd like to have a spearman correlation score assessing in that dataset that sample B is the top-scorer on most of the positions.
Any idea on how to achieve that?
Thanks a lot !
Maybe this points in the a helpful direction.
library(tidyverse)
df = data.frame('pos'=c(1,2,3,1,2,3,1,2,3), # Using 3 as the last position
'sample'=c('A','A','A','B','B','B','C','C','C'),
'score'=c(1,10,5,20,40,10,0.1,5,4))
# Compute rank of each sample within each position
ranked = df %>% group_by(pos) %>%
mutate(rank=rank(score, ties.method='min')) %>%
ungroup()
# B seems to consistently score higher
ggplot(ranked, aes(pos, rank, color=sample)) +
geom_point(size=5)
# Kruskal-Wallis rank sum test of the null hypothesis that the rankings
# are from the same distribution for all samples.
kruskal.test(ranked$rank, ranked$sample)
#>
#> Kruskal-Wallis rank sum test
#>
#> data: ranked$rank and ranked$sample
#> Kruskal-Wallis chi-squared = 8, df = 2, p-value = 0.01832
# Pairwise Wilcoxon test for B vs C
df %>% filter(sample!='A') %>%
group_by(pos) %>%
mutate(rank=rank(score, ties.method='min')) %>%
ungroup() %>%
pivot_wider(id_cols='pos', names_from='sample', values_from='rank') %>%
{wilcox.test(.$B, .$C, paired=TRUE)}
#> Warning in wilcox.test.default(.$B, .$C, paired = TRUE): cannot compute exact p-
#> value with ties
#>
#> Wilcoxon signed rank test with continuity correction
#>
#> data: .$B and .$C
#> V = 6, p-value = 0.1489
#> alternative hypothesis: true location shift is not equal to 0
If the scores are all drawn from the same distribution, I think you could do these same tests on the scores directly, without ranking.
Created on 2020-01-10 by the reprex package (v0.3.0)

Save R-squared from lm summary as a dataframe

I want to save the result of a lm model into a dataframe. I generated an empty dataframe (Startframe), where I want to save the results.
My dataframe containing the data is called testdata in this case. It contains the Date in the first column and then several Stations in the rest of the colums.
So far this code is working to get the Estimate, Std. Error, t value and Pr(>|t|).
for(i in 2:ncol(testdata)) {
x <- testdata[,1]
y <- testdata[,i]
mod <- lm(y ~ x)
summary(mod)
Startframe[i,] <- c(i,
summary(mod)[['coefficients']]['(Intercept)','Estimate'],
summary(mod)[['coefficients']]['x','Estimate'],
summary(mod)[['coefficients']]['x','Std. Error'],
summary(mod)[['coefficients']]['x','t value'],
summary(mod)[['coefficients']]['x','Pr(>|t|)'])
But how can I also extract the r.squared?
I tried to add summary(mod)[['r.squared']] to the list, but it gives me the wrong numbers.
I know str(summary(mod)) gives me an overview, but I cant figure out how to add it into my loop.
Thanks for your help.
Nice way to work with the same model on different datasets is to use the tidyverse approach using broom package.
In this example I'm using the diamonds dataset to test how carat and depth effects the diamonds' price in different diamond cuts.
require(tidyverse)
require(broom)
diamonds %>%
nest(-cut) %>%
mutate(model = purrr::map(data, function(x) {
lm(price ~ carat + depth, data = x)}),
values = purrr::map(model, glance),
r.squared = purrr::map_dbl(values, "r.squared"),
pvalue = purrr::map_dbl(values, "p.value")) %>%
select(-data, -model, -values)
cut r.squared pvalue
<ord> <dbl> <dbl>
1 Ideal 0.867 0
2 Premium 0.856 0
3 Good 0.851 0
4 Very Good 0.859 0
5 Fair 0.746 0

Keeping your statistician happy: Stata vs. R Student's t-test

Chapter 1: mean age by gender
I work a lot with epidemiologists and statisticians that have very specific requirements for their statistical output and I frequently fail to reproduce the exact same thing in R (our epidemiologst works in Stata).
Let's start with an easy example, a Student's t-test. What we are interested in is the difference in mean age at first diagnosis and a confidence interval.
1) create some sample data in R
set.seed(41)
cohort <- data.frame(
id = seq(1,100),
gender = sample(c(rep(1,33), rep(2,67)),100),
age = sample(seq(0,50),100, replace=TRUE)
)
# save to import into Stata
# write.csv(cohort, "cohort.csv", row.names = FALSE)
b) import data and run t-test in Stata
import delimited "cohort.csv"
ttest age, by(gender)
What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57
b) run t-test in R
t.test(age~gender, data=cohort)
t.test(cohort$age[cohort$gender == 1])
t.test(cohort$age[cohort$gender == 2])
t.test(cohort$age)
There surely must be another way instead of running 4 t-tests in R!
You can try to put everything in one function and some tidyverse magic. The output can be edited as your needs are of course. boom's tidy will be used for nice output.
foo <- function(df, x, y){
require(tidyverse)
require(broom)
a1 <- df %>%
select(ep=!!x, gr=!!y) %>%
mutate(gr=as.character(gr)) %>%
bind_rows(mutate(., gr="ALL")) %>%
split(.$gr) %>%
map(~tidy(t.test(.$ep))) %>%
bind_rows(.,.id = "gr") %>%
mutate_if(is.factor, as.character)
tidy(t.test(as.formula(paste(x," ~ ",y)), data=df)) %>%
mutate_if(is.factor, as.character) %>%
mutate(gr="vs") %>%
select(gr, estimate, statistic, p.value,parameter, conf.low, conf.high, method, alternative) %>%
bind_rows(a1, .)}
foo(cohort, "age", "gender")
gr estimate statistic p.value parameter conf.low conf.high method alternative
1 1 25.121212 9.545737 6.982763e-11 32.00000 19.76068 30.481745 One Sample t-test two.sided
2 2 28.791045 15.699854 5.700541e-24 66.00000 25.12966 32.452428 One Sample t-test two.sided
3 ALL 27.580000 18.301678 1.543834e-33 99.00000 24.58985 30.570147 One Sample t-test two.sided
4 vs -3.669833 -1.144108 2.568817e-01 63.37702 -10.07895 2.739284 Welch Two Sample t-test two.sided
I recommend to start from the beginning using this
foo <- function(df){
a1 <- broom::tidy(t.test(age~gender, data=df))
a2 <- broom::tidy(t.test(df$age))
a3 <- broom::tidy(t.test(df$age[df$gender == 1]))
a4 <- broom::tidy(t.test(df$age[df$gender == 2]))
list(rbind(a2, a3, a4), a1)
}
foo(cohort)
[[1]]
estimate statistic p.value parameter conf.low conf.high method alternative
1 27.58000 18.301678 1.543834e-33 99 24.58985 30.57015 One Sample t-test two.sided
2 25.12121 9.545737 6.982763e-11 32 19.76068 30.48174 One Sample t-test two.sided
3 28.79104 15.699854 5.700541e-24 66 25.12966 32.45243 One Sample t-test two.sided
[[2]]
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
1 -3.669833 25.12121 28.79104 -1.144108 0.2568817 63.37702 -10.07895 2.739284 Welch Two Sample t-test two.sided
You can make your own function:
tlimits <- function(data, group){
error <- qt(0.975, df = length(data)-1)*sd(data)/(sqrt(length(data)))
mean <- mean(data)
means <- tapply(data, group, mean)
c(abs(means[1] - means[2]), mean - error, mean + error)
}
tlimits(cohort$age, cohort$gender)
1
3.669833 24.589853 30.570147
What we want is the absolute difference in the mean= 3.67 years and the combined confidence intervals = 95% CI: 24.59 - 30.57
Notice that R's t.test does a t-test, whereas you want a mean difference and "combined confidence intervals" (which is CI around the mean ignoring the grouping variable). So you don't want a t-test but something else.
You can get the mean difference using, e.g.:
diff(with(cohort, tapply(age, gender, mean)))
# 3.669833
# no point in using something more complicated e.g., t-test or lm
... and the CI using, e.g.:
confint(lm(age~1, data=cohort))
# 2.5 % 97.5 %
# (Intercept) 24.58985 30.57015
And obviously, you can easily combine the two steps into one function if you need it often.
doit <- function(a,b) c(diff= diff(tapply(a,b,mean)), CI=confint(lm(a~1)))
with(cohort, doit(age,gender))
# diff.2 CI1 CI2
# 3.669833 24.589853 30.570147

Hundreds of linear regressions that run by group in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a table with 3,000+ rows and 10+ variables. I am trying to run a linear regression using one variables as the predictor and another as the response for 300 different groups. I need the slope, p-value, and r-squared for each of these regressions. To do each regression individually and record the summary variables would take hours if not days.
I have used the following package to get the intercept and slope for each group, but I do not know how to also get the corresponding p-value and r-squared for each group:
library(lme4)
groupreg<-lmList(logpop ~ avgp | id, data=data)
groupreg
I achieved a list sample below, where "Adams #" is the id value. NAs exist because not all groups have multiple points to plot and compare:
Coefficients:
(Intercept) avgp
Adams 6 4.0073332 NA
Adams 7 6.5177389 -7.342443e+00
Adams 8 4.7449321 NA
Adams 9 NA NA
This table does not include any significance statistics, however. I still need the p-value and r-squared statistic. If there is a code to do it all in one go for all group values, or a code to just pull the remaining values, it would be helpful.
Is there are way also to exponentiate the slope output for all groups? My outcome was log-transformed.
Thank you all!!
I thinks the easiest answer is still missing. You can use a combination of nesting and mapping. I'll show you how it works for linear regression. I think you're able to apply the same principle to models of the lme4 package.
Lets create a toy data set, where we've measured the IQ score for three different groups at two different points of time.
library(tidyverse)
library(broom)
df <- tibble(
id = seq_len(90),
IQ = rnorm(90, 100, 15),
group = rep(c("A", "B", "C"), each = 30),
time = rep(c("T1", "T2"), 45)
)
If we want to build a regression model for each group, investigating the relation between the IQ score and the point of time, we only need five lines of code.
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results = map(fit, glance)) %>%
unnest(results) %>%
select(group, r.squared, p.value)
Which will return
# A tibble: 3 x 3
group r.squared p.value
<chr> <dbl> <dbl>
1 A 0.0141 0.532
2 B 0.0681 0.164
3 C 0.00432 0.730
where nest(-group) creates tibbles within your tibble for each group, containing the corresponding variables of id, IQ and time. Then you add a new column fit with mutate() where you apply a regression model for each group and a new column containing the results, which we unnest() shortly after to access the values glance() returned properly. In the last step we select() the three values of interest.
To get the slope you need to call tidy() in addition. Maybe it's possible to shorten the code somehow, but one solution would be
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results1 = map(fit, glance),
results2 = map(fit, tidy)) %>%
unnest(results1) %>%
unnest(results2) %>%
select(group, term, estimate, r.squared, p.value) %>%
mutate(estimate = exp(estimate))
To exponentiate the slope, you can just add another mutate() statement. Finally it returns
# A tibble: 6 x 5
group term estimate r.squared p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 3.34e+46 0.0141 0.532
2 A timeT2 3.31e- 2 0.0141 0.532
3 B (Intercept) 1.17e+47 0.0681 0.164
4 B timeT2 1.34e- 3 0.0681 0.164
5 C (Intercept) 8.68e+43 0.00432 0.730
6 C timeT2 1.25e- 1 0.00432 0.730
Note that the estimates are exponentiated already. Without the exponentiation you can double check the slope and p value with base R calling
summary(lm(IQ ~ time, data = filter(df, group == "A")))
If you work with more complex models (lme4), there is a package called lmerTest which offers wrapper functions for lme4 which return p-values (at least for mixed models, with which I already worked with).
A word of warning towards using glance() for lme4 models should be spoken, because the maintainers of the broom package, will try a new concept where they outsource the summary statistics to the particular package developer responsible for the model.
If I am understanding your question correctly, you want to run multiple regressions over lots of groups. Here is an example of how to do so with the mtcars data.
library(dplyr)
mtcars %>% group_by(cyl) %>%
summarise_at(vars(disp:wt), funs(
r.sqr = summary(lm(mpg~.))$r.squared,
intercept = summary(lm(mpg~.))$coefficients[[1]],
slope = summary(lm(mpg~.))$coefficients[[2]],
p.value = summary(lm(mpg~.))$coefficients[[8]]
))
This will run a regression per group per variable and extract the info you asked for. If your formula is always the same, you could simplify as follows.
mtcars %>% group_by(cyl) %>%
summarise(
r.sqr = summary(lm(mpg~wt))$r.squared,
intercept = summary(lm(mpg~wt))$coefficients[[1]],
slope = summary(lm(mpg~wt))$coefficients[[2]],
p.value = summary(lm(mpg~wt))$coefficients[[8]]
)
This is actually running the regression 4 times(once per value of interest). If that takes too long for your real data, you could try this:
df <- mtcars %>% group_by(cyl) %>% summarise(model = list(summary(lm(mpg~wt))))
which simply runs the model once per group and then extract out the info you want. The problem is that extracting values this way can be a pain
df$model[[1]]$coefficients[[1]]
[1] 39.5712
While the code given by AndS will work, it will run lm function 4 times for each group which makes it a bit inefficient. You can use the following. I am trying to break it into simpler steps:
Assuming your data frame(df) has three variables: "Group", "Dep", "Indep":
#Getting the unique list of groups
groups <- unique(df$Group)
#Creating a model summary list to combine the model summary of each model
model_summaries = list()
#Running the models
for(i in 1:length(groups)){
model <- lm(Dep ~ Indep, df[df$Group == Groups[i], c("Dep", "Indep")])
model_summaries[i] <- summary(model)
}
In each model summary you have following elements RSQ, coefficients(contains p-values and intercept too)
Let me know if this helps.

How to estimate residuals of subgroup with lme4 in R

I'd like to reproduce the results reported in Hoffman & Rovine's work (Multilevel models for the experimental psychologist: foundations and illustrative examples) with lme4 package in R.
In their first example they compared reaction time between elders and young people. Each of their participants have many task trials. So, in the individual level, participants' reaction time were affected by various variables related to their manipulation of trials. In the second level, participants' age and age group would affect participants' reaction time.
In Hoffman's model 2B, they estimate first level residuals for elders and young people respectively, with two dummy variable for young people and old people.
Hoffman's equation is
Level1 equation
I'd like to know how to estimate two residuals in lme4 package.
Hoffman's article and example data could be found in Hoffman's website.
I've successfully replicated their result of model 2A, where the variance of young people and old people were assumed to be the same, with the following code.
lmer(lg_rt ~ c_mean + c_sal + (1|Item) + oldage + yrs65 + (1|id), Ex1, REML = F)
You can handle heteroscedasticity in lme4 using the modular fitting functions. Here is an example with two groups, which should be extendable to other types of heteroscedasticity. Note that although the weights are estimated, the uncertainty about the weights is not taken into account in the standard errors of the parameters in the final fit. This issue should be possible to solve using the delta method, see e.g. the first equation in Section 2.3.3 of https://10.3102/1076998611417628.
set.seed(1234)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(lme4)
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#>
#> expand, pack, unpack
n <- 100 # number of level-2 units
m <- 3 # number of repeated observations per unit
sd_b <- .3 # random intercept standard deviation
sd_eps1 <- .1 # residual standard deviation in group 1
sd_eps2 <- .3 # residual standard deviation in group 2
# Simulate data
dat <- tibble(
# unique ID
id = seq_len(n),
# explanatory variable, constant over repetitions
x = runif(n),
# random intercept
b = rnorm(n, sd = sd_b),
# group membership
grp = sample(1:2, n, replace = TRUE)
) %>%
uncount(3) %>%
mutate(
# residual
eps = rnorm(nrow(.), sd = c(sd_eps1, sd_eps2)[grp]),
# response, fixed effect is beta=1
y = x + b + eps
)
# now optimize over residual weights, fixing the group 1 weight to 1.
# optimize() would be sufficient, but I show it with optim() because it
# then can be directly extended to a larger number of groups
opt <- optim(
# initial value for group 2 residual relative to group 1
par = 1,
fn = function(weight){
# Compute weights from group variable
df <- dat %>%
mutate(weight = c(1, weight)[grp])
## 1. Parse the data and formula:
lmod <- lFormula(y ~ x + (1|id), data = df, weights = df$weight)
## 2. Create the deviance function to be optimized:
devfun <- do.call(mkLmerDevfun, lmod)
## 3. Optimize the deviance function:
opt <- optimizeLmer(devfun)
# return the deviance
opt$fval
},
# Use a method that allows box constraints
method = "L-BFGS-B",
# Weight cannot be negative
lower = 0.01
)
# The weight estimates the following ratio, and it is pretty close
sd_eps1^2/sd_eps2^2
#> [1] 0.1111111
opt$par
#> [1] 0.1035914
# We can now fit the final model at the chosen weights
df <- dat %>%
mutate(weight = c(1, opt$par)[grp])
mod <- lmer(y ~ x + (1|id), data = df, weights = df$weight)
# Our estimate of sd_eps1
sigma(mod)
#> [1] 0.09899687
# True value
sd_eps1
#> [1] 0.1
# Our estimate of sd_eps2
sigma(mod) * sqrt(1/opt$par)
#> [1] 0.307581
# True value
sd_eps2
#> [1] 0.3
Created on 2021-02-10 by the reprex package (v1.0.0)

Resources