I have a data frame of values from individuals linked to groups. I want to identify those groups who have mean values greater than the mean value plus one standard deviation for the whole data set. To do this, I'm calculating the mean value and standard deviation for the entire data frame and then running pairwise t-tests to compare to each group mean. I'm running into trouble outputting the results.
> head(df)
individual group value
1 11559638 75 0.371
2 11559641 75 0.367
3 11559648 75 0.410
4 11559650 75 0.417
5 11559652 75 0.440
6 11559654 75 0.395
> allvalues <- data.frame(mean=rep(mean(df$value), length(df$individual)), sd=rep(sd(df$value), length(df$individual)))
> valueplus <- with(df, by(df, df$individual, function(x) t.test(allvalues$mean + allvalues$sd, df$value, data=x)))
> tmpplus
--------------------------------------------------------------------------
df$individuals: 10
NULL
--------------------------------------------------------------------------
df$individuals: 20
NULL
--------------------------------------------------------------------------
df$individuals: 21
Welch Two Sample t-test
data: allvalues$mean + allvalues$sd and df$value
t = 84.5217, df = 4999, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.04676957 0.04899068
sample estimates:
mean of x mean of y
0.4719964 0.4241162
How do I get the results into a data frame? I'd expect the output to look something like this:
groups t df p-value mean.x mean.y
1 10 NULL NULL NULL NULL NULL
2 20 NULL NULL NULL NULL NULL
3 21 84.5217 4999 2.2e-16 0.4719964 0.4241162
From a purely programming perspective, you are asking how to get the output of t.test into a data.frame. Try the following, using mtcars:
library(broom)
tidy(t.test(mtcars$mpg))
estimate statistic p.value parameter conf.low conf.high
1 20.09062 18.85693 1.526151e-18 31 17.91768 22.26357
Or for multiple groups:
library(dplyr)
mtcars %>% group_by(vs) %>% do(tidy(t.test(.$mpg)))
# A tibble: 2 x 9
# Groups: vs [2]
vs estimate statistic p.value parameter conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0 16.6 18.3 1.32e-12 17 14.7 18.5 One Sample t-test two.sided
2 1 24.6 17.1 2.75e-10 13 21.5 27.7 One Sample t-test two.sided
Needless to say, you'll need to adjust the code to fit your specific setting.
Related
I borrow a dataset from SPSS prepared by Julie Pallant's SPSS Survival Manual and run it on R.
I select three columns to run correlation and significance test: toptim, tnegaff, sex. I select the columns using select: df <- survey %>% select(toptim, tnegaff, sex).
Then, problems emerge.
I'd like to know the correlation between toptim and tnegaff by sex. But I can't use cor and resort to correlate. Why is there error and any difference between the two methods?
df %>% group_by(sex) %>% summarise(cor = correlate(toptim, tnegaff)) <- OK (male = 0.22 female = 0.394)
df %>% group_by(sex) %>% summarise(cor = cor(toptim, tnegaff)) <- failed, returns with NA
I failed to obtain the test of significance with cor.test (The answer should be p = 0.0488)
Error in `summarise()`:
! Problem while computing `cor = cor.test(toptim, tnegaff)`.
✖ `cor` must be a vector, not a `htest` object.
ℹ The error occurred in group 1: sex = 1.
Then I try to follow past examples and use broom::tidy, but no output for p-values....
> df %>% group_by(sex) %>% broom::tidy(cor.test(toptim, tnegaff))
# A tibble: 3 × 13
column n mean sd median trimmed mad min max range skew kurtosis se
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 toptim 435 22.1 4.43 22 22.3 3 7 30 23 NA NA 0.212
2 tnegaff 435 19.4 7.07 18 18.6 4 10 39 29 NA NA 0.339
3 sex 439 1.58 0.494 2 1.58 0 1 2 1 -0.318 1.10 0.0236
How can I get the result? May I know the reason for such failure?
Thank you for your answers in advance.
It's trying to use all values and coming across NAs I presume. If you set to use "complete.obs" then it should work. For the cor.test part wrap the output in a list function to use the tibble's capabilities to have a column of a vector of objects.
For the final tidying and getting p-values, use map(cor.test, broom::tidy) then tidyr::unnest() to get a full and tidy dataframe.
That's a few steps to go through but hope it helps!
df <- haven::read_sav("survey.sav")
library(tidyverse)
df %>%
group_by(sex) %>%
summarise(cor = cor(toptim, tnegaff, use = "complete.obs"),
cor.test = list(cor.test(toptim, tnegaff))) %>%
mutate(tidy_out = map(cor.test, broom::tidy)) %>%
unnest(tidy_out)
#> # A tibble: 2 × 11
#> sex cor cor.t…¹ estim…² stati…³ p.value param…⁴ conf.…⁵ conf.…⁶ method
#> <dbl+l> <dbl> <list> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
#> 1 1 [MAL… -0.220 <htest> -0.220 -3.04 2.73e- 3 182 -0.353 -0.0775 Pears…
#> 2 2 [FEM… -0.394 <htest> -0.394 -6.75 1.06e-10 248 -0.494 -0.284 Pears…
#> # … with 1 more variable: alternative <chr>, and abbreviated variable names
#> # ¹cor.test, ²estimate, ³statistic, ⁴parameter, ⁵conf.low, ⁶conf.high
Edit - examining difference in correlation
Borrowing the function from here you can examine the difference in correlation coefficients between sexes like this:
cor.diff.test(df$toptim[df$sex == 1], df$tnegaff[df$sex == 1], df$toptim[df$sex == 2], df$tnegaff[df$sex == 2])
I have looked over the forum but couldn't find what I am looking for.
I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.
Assume I work with the following data:
num value person1 person2 person3
0 1 229 29 81 0
1 2 203 17 75 0
2 3 244 62 0 55
and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.
And so my initial attempt was to:
column <- names(df)[-2]
for(i in 3:5){
temp <- df[,c("value", column[i])]
lm.test <- lm(value ~ ., data = temp)
i + 1
}
However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:
debug at #3: temp <- df[,c("value", column[i])]
suggesting that there's something wrong with line 3?
If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.
You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:
vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
lm_formula <- as.formula(paste0("mpg ~ ", col))
lm(lm_formula, data = mtcars)
})
You can then again iterate over lm_results to get the r.squared:
lapply(lm_results, function(x) summary(x)$r.squared)
Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:
library(broom)
library(purrr)
lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)
Result:
# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().
I'm looking at age- and sex-adjusted group differences in a continuous variable of interest. As done in other studies in my field, I want to calculate Cohen's d based on contrasts extracted from a multiple linear regression model.
The original formula (Nakagawa & Cuthill, 2007) is as follows:
n1 = sample size in Group 1
n2 = sample size in Group 2
df' = degrees of freedom used for a corresponding t value in a linear model
t = t-statistic corresponding to the contrast of interest
So far I've attempted to apply this in R, but the results are looking strange (much larger effect sizes than expected).
Here's some simulated data:
library(broom)
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010),
Group = as.numeric(c('0','1','0','0','1','1','0','1','0','1')),
age = as.numeric(c('23','28','30','15','7','18','29','27','14','22')),
sex = as.numeric(c('1','0','1','0','0','1','1','0','0','1')),
test_score = as.numeric(c('18','20','19','15','20','23','19','25','10','14')))
# run lm and extract regression coefficients
model <- lm(test_score ~ Group + age + sex, data = df)
tidy_model <- tidy(model)
tidy_model
# A tibble: 4 x 5
#term estimate std.error statistic p.value
#<chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 11.1 4.41 2.52 0.0451
# 2 Group 4.63 2.65 1.75 0.131
# 3 age 0.225 0.198 1.13 0.300
# 4 sex 0.131 2.91 0.0452 0.965
t_statistic <- tidy_model[2,4] # = 1.76
n <- 5 #(equal n of participants in Group1 as in Group2)
cohens_d <- t_statistic*(n + n)/(sqrt(n * n) * sqrt(1)) # 1 dof for 1 estimated parameter (group contrast)
cohens_d # = 3.518096
Could you please flag up where I'm going wrong?
You have set the degrees of freedom to 1. However, you actually have 6 degrees of freedom which you can see if you type: summary(model).
If you set your degrees of freedom to 6 your Cohen's d will be ~1.7 which should be more inline with what you expect.
I am trying to write a .csv file that appends the important information from the summary of a glmer analysis (from the package lme4).
I have been able to isolate the coefficients, AIC, and random effects , but I have not been able to isolate the scaled residuals (Min, 1Q, Median, 3Q, Max).
I have tried using $residuals, but I get a very long output, not the information shown in the summary.
> library(lme4)
> setwd("C:/Users/Arthur Scully/Dropbox/! ! ! ! PHD/Chapter 2 Lynx Bobcat BC/ResourceSelection")
> #simple vectors
>
> x <- c("a","b","b","b","b","d","b","c","c","a")
>
> y <- c(1,1,0,1,0,1,1,1,1,0)
>
>
> # Simple data frame
>
> aes.samp <- data.frame(x,y)
> aes.samp
x y
1 a 1
2 b 1
3 b 0
4 b 1
5 b 0
6 d 1
7 b 1
8 c 1
9 c 1
10 a 0
>
> # Simple glmer
>
> aes.glmer <- glmer(y~(1|x),aes.samp,family ="binomial")
boundary (singular) fit: see ?isSingular
>
> summary(aes.glmer)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: y ~ (1 | x)
Data: aes.samp
AIC BIC logLik deviance df.resid
16.2 16.8 -6.1 12.2 8
I can isolate information above by using the call summary(aes.glmer)$AIC
Scaled residuals:
Min 1Q Median 3Q Max
-1.5275 -0.9820 0.6546 0.6546 0.6546
I do not know the call to isolate the above information
Random effects:
Groups Name Variance Std.Dev.
x (Intercept) 0 0
Number of obs: 10, groups: x, 4
I can isolate this information using the ranef function
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8473 0.6901 1.228 0.22
And I can isolate the information above using summary(aes.glmer)$coefficient
convergence code: 0
boundary (singular) fit: see ?isSingular
>
> #Pull important
> ##write call to select important output
> aes.glmer.coef <- summary(aes.glmer)$coefficient
> aes.glmer.AIC <- summary(aes.glmer)$AIC
> aes.glmer.ran <-ranef(aes.glmer)
>
> ##
> data.frame(c(aes.glmer.coef, aes.glmer.AIC, aes.glmer.ran))
X0.847297859077025 X0.690065555425105 X1.22785125618255 X0.219502810378876 AIC BIC logLik deviance df.resid X.Intercept.
a 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
b 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
c 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
d 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
If anyone knows what call I can use to isolate the "scaled residuals" I would be very greatful.
I haven't got your data, so we'll use example data from the lme4 vignette.
library(lme4)
library(lattice)
library(broom)
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
This is for the residuals. tidy from the broom package puts it in to a tibble, which you can then export to a csv.
x <- tidy(quantile(residuals(gm1, "pearson", scaled = TRUE)))
x
# A tibble: 5 x 2
names x
<chr> <dbl>
1 0% -2.38
2 25% -0.789
3 50% -0.203
4 75% 0.514
5 100% 2.88
Also here are some of the other bits that you might find useful, using glance from broom.
y <- glance(gm1)
y
# A tibble: 1 x 6
sigma logLik AIC BIC deviance df.residual
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 -92.0 194. 204. 73.5 51
And
z <- tidy(gm1)
z
# A tibble: 5 x 6
term estimate std.error statistic p.value group
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) -1.40 0.231 -6.05 1.47e-9 fixed
2 period2 -0.992 0.303 -3.27 1.07e-3 fixed
3 period3 -1.13 0.323 -3.49 4.74e-4 fixed
4 period4 -1.58 0.422 -3.74 1.82e-4 fixed
5 sd_(Intercept).herd 0.642 NA NA NA herd
I am new to R . I've spent hours trying to figure this out and searching Google and SO, but can't seem to find anything that's exactly what I'm looking for. Hopefully you can help?
I have a data set that looks like:
Site(factor) Species Date Mass GDD
1 cockerelli 0017-03-14 2.73 252.1
2 doddsii 0017-01-12 3.73 583.4
4 cockerelli 0017-03-14 2.71 385.4
4 doddsii 0018-05-16 2.22 783.2
1 infrequens 0018-05-16 2.89 583.0
etc.
I split() my data frame into a list of data frames that I can then pass into apply() functions.
splitdata = split(data, paste(data$Species,data$Site))
However, when I use code such as:
grmodel = lapply(splitdata, function(x){
grmodel = aov(x$Mass~x$GDD)
print(summary(grmodel))
})
I get a huge list of ANOVA summaries (like those below), but I don't know which species and site they belong to.
Df Sum Sq Mean Sq F value Pr(>F)
x$GDD 1 0.000022 0.0000216 0.044 0.838
Residuals 9 0.004396 0.0004884
1 observation deleted due to missingness
Df Sum Sq Mean Sq F value Pr(>F)
x$GDD 1 0.0002526 0.0002526 0.65 0.451
Residuals 6 0.0023319 0.0003887
1 observation deleted due to missingness
I was wondering if anyone knows how to alter the code to tell me which species and site the ANOVA table belong to? I've found some answers that talk about paste() and other functions, but nothing I've tried works.
Thanks so much in advance!
As far as I can tell the names should be visible, I'm not sure what you're seeing exactly but maybe a reprex would be useful.
You could also try to use tidy::broom to see clearer:
lapply(split(iris,iris$Species),
function(x) aov(Petal.Length ~ Petal.Width,x))
# $`setosa`
# Call:
# aov(formula = Petal.Length ~ Petal.Width, data = x)
#
# Terms:
# Petal.Width Residuals
# Sum of Squares 0.1625262 1.3152738
# Deg. of Freedom 1 48
#
# Residual standard error: 0.1655341
# Estimated effects may be unbalanced
#
# $versicolor
# Call:
# aov(formula = Petal.Length ~ Petal.Width, data = x)
#
# Terms:
# Petal.Width Residuals
# Sum of Squares 6.695921 4.124079
# Deg. of Freedom 1 48
#
# Residual standard error: 0.2931183
# Estimated effects may be unbalanced
#
# $virginica
# Call:
# aov(formula = Petal.Length ~ Petal.Width, data = x)
#
# Terms:
# Petal.Width Residuals
# Sum of Squares 1.548503 13.376297
# Deg. of Freedom 1 48
#
# Residual standard error: 0.5278947
# Estimated effects may be unbalanced
With tidy::broom :
lapply(split(iris,iris$Species),
function(x) aov(Petal.Length ~ Petal.Width,x) %>% broom::tidy())
# $`setosa`
# term df sumsq meansq statistic p.value
# 1 Petal.Width 1 0.1625262 0.16252620 5.93128 0.01863892
# 2 Residuals 48 1.3152738 0.02740154 NA NA
#
# $versicolor
# term df sumsq meansq statistic p.value
# 1 Petal.Width 1 6.695921 6.69592109 77.93357 1.271916e-11
# 2 Residuals 48 4.124079 0.08591831 NA NA
#
# $virginica
# term df sumsq meansq statistic p.value
# 1 Petal.Width 1 1.548503 1.5485033 5.556707 0.02253577
# 2 Residuals 48 13.376297 0.2786728 NA NA
The names of the result of split are the values of the second argument coerced to character-class and lapply preserves these names, so you should not need to add back any names, but rather just look at:
names(grmodel)
Perhaps you want to do this for output:
for( i in names(grmodel) ){ cat(i);
cat( : : :\n");
print(grmodel[[i]]);
cat("\n\n")}
.... which just prints the names and a bit of spacing for each item in the grmodels list.