I'd like to group multiple t test result into one table. Originally my code looks like this:
tt_data <- iris %>%
group_by(Species) %>%
summarise(p = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$p.value,
estimate = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$estimate
# Species p estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
However, base on the idea that I should only perform the statistical test once, is there a way for me to run t test once per group and collect the intended table? I think there are some combination of broom and purrr but I am unfamiliar with the syntax.
# code idea (I know this won't work!)
tt_data <- iris %>%
group_by(Species) %>%
summarise(tt = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)) %>%
select(Species, tt.p, tt.estimate)
# Species tt.p tt.estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
You can use broom::tidy() to transform the resut of the t.test to a tidy 'tibble':
iris %>%
group_by(Species) %>%
t.test(.$Sepal.Length,.$Petal.Length,alternative="two.sided",paired=T) %>%
}) %>%
select(estimate, p.value)
#> Adding missing grouping variables: `Species`
#> # A tibble: 3 x 3
#> # Groups: Species [3]
#> Species estimate p.value
#> <fct> <dbl> <dbl>
#> 1 setosa 3.54 2.54e-51
#> 2 versicolor 1.68 9.67e-36
#> 3 virginica 1.04 7.99e-28
Created on 2020-09-02 by the reprex package (v0.3.0)
You can use map to select the desired values from the list generated by t.test and by tidying it up to a data frame via broom::tidy, i.e.
iris %>%
group_by(Species) %>%
summarise(p = list(broom::tidy(t.test(Sepal.Length, Petal.Length, alternative = "two.sided", paired = T)))) %>%
mutate(p.value = purrr::map(p, ~select(.x, c('p.value', 'estimate')))) %>%
select(-p) %>%
# A tibble: 3 x 3
# Species p.value estimate
# <fct> <dbl> <dbl>
#1 setosa 2.54e-51 3.54
#2 versicolor 9.67e-36 1.68
#3 virginica 7.99e-28 1.04
cdata is a tibble (I used haven to import a .sav file into the cdata object).
Why does using cdata$WEIGHT instead of WEIGHT produce such a radical difference in the output below?
this code uses cdata$WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(cdata$WEIGHT))
produces an unwanted table:
this code uses WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(WEIGHT))
produces the correct table:
I realize that tibble has a different mental model than base R. However, the above difference doesn't make intuitive sense to me. What's the intent behind this difference in output when using a common column identification technique (cdata$WEIGHT)?
When we having a grouping variable, cdata$WEIGHT extracts the whole column and thus the sum is from the whole column whereas if we use only WEIGHT, it returns only the data from the column for each group
If we really wanted to use $, then use the pronoun .data
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.data$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
which is identical to
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Or use cur_data()
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(cur_data()$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Whereas if we use .$ or iris$, it extracts the whole column breaking the group attributes
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 876.
2 versicolor 876.
3 virginica 876.
I am trying to make a data.frame which displays the average time an individual displays a behaviour.
I have been using group_by and summarise to calculate the averages across groups. But the output is many rows down. See an example using the iris dataset...
x <- iris %>%
group_by(Species, Petal.Length) %>%
I would like to get an output that has, for this example, one row per 'Species' and a column of averages per 'Petal.Length'.
I have resorted to creating multiple outputs and then using left_join to combine them into the desired data.frame. See example below...
a <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.1) %>%
b <- iris %>%
group_by(Species) %>%
filter(Petal.Length == 0.2) %>%
left_join(a, b)
However, doing this twelve or more times at a time is tedious and I am sure there must be an easy way to get the mean(Sepal.Length) for the 'Petal.Length' 0.1, and 0.2, and 0.3 (etc) in the one output.
n.b. in my data Petal.Length would actually be characters that represent behaviours and Sepal.Length would be the duration of time
Some ideas:
mutate(iris, Petal.Length_discrete = cut(Petal.Length, 5)) %>%
group_by(Species, Petal.Length_discrete) %>%
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
#> # A tibble: 7 x 3
#> # Groups: Species [3]
#> Species Petal.Length_discrete `mean(Sepal.Length)`
#> <fct> <fct> <dbl>
#> 1 setosa (0.994,2.18] 5.01
#> 2 versicolor (2.18,3.36] 5
#> 3 versicolor (3.36,4.54] 5.81
#> 4 versicolor (4.54,5.72] 6.43
#> 5 virginica (3.36,4.54] 4.9
#> 6 virginica (4.54,5.72] 6.32
#> 7 virginica (5.72,6.91] 7.25
iris %>%
group_split(Species, Petal.Length) %>%
map(~ summarise(.x, mean(Sepal.Length))) %>%
#> [[1]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.6
#> [[2]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 4.3
#> [[3]]
#> # A tibble: 1 x 1
#> `mean(Sepal.Length)`
#> <dbl>
#> 1 5.4
Created on 2021-06-28 by the reprex package (v2.0.0)
I'd like to be able to use dplyr's group_by to group by multiple columns, simple enough. But, the complication is I want to create a function where one or more columns are always in the group by and the user can select an additional column to group by. What I've tried so far involves using the non-string specification of the columns that are always in the group by and using a string for the column the user selects, but nothing I've tried works. This combination seems to work fine in SELECT, but not GROUP_BY. Ideally, I'd rather not switch to all strings because I want to be able to take advantage of some of the functionality of dplyr that allows me to select a range of columns. Below is an example.
To make a simple example, I started with the iris data set and added a couple more columns, their exact meanings are not important.
test_tbl <- iris %>%
mutate(extra_var1 = ifelse(Sepal.Length >= 5.0, "Yes", "No"),
extra_var2 = "What")
Here's an example that uses the non-string specification for all variables, which works just fine:
test_tbl %>%
select(Species, extra_var1, Sepal.Length, Petal.Width) %>%
group_by(Species, extra_var1) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
But, I'd like to be able to, within a function, have the user specify whether they want to group by extra_var1 or extra_var2. Here's my attempt, which doesn't work. Again, I believe the select part works fine, but the group_by part does not.
group_and_summarize <- function(var) {
test_tbl %>%
select(Species, var, Sepal.Length, Petal.Width) %>%
group_by(Species, var) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
This would be one way to do it:
group_and_summarize <- function(var) {
test_tbl %>%
select(Species, {{var}}, Sepal.Length, Petal.Width) %>%
group_by(Species, {{var}}) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
#> `summarise()` regrouping output by 'Species' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: Species [3]
#> Species extra_var1 average.Sepal.Length average.Petal.Width
#> <fct> <chr> <dbl> <dbl>
#> 1 setosa No 4.67 0.195
#> 2 setosa Yes 5.23 0.28
#> 3 versicolor No 4.9 1
#> 4 versicolor Yes 5.96 1.33
#> 5 virginica No 4.9 1.7
#> 6 virginica Yes 6.62 2.03
Created on 2021-05-11 by the reprex package (v0.3.0)
If you want the user to enter strings then we can use !!! syms():
group_and_summarize <- function(vars) {
test_tbl %>%
select(Species, !!! syms(vars), Sepal.Length, Petal.Width) %>%
group_by(Species, !!! syms(vars)) %>%
summarize(average.Sepal.Length = mean(Sepal.Length),
average.Petal.Width = mean(Petal.Width))
group_and_summarize(c("extra_var1", "extra_var2"))
#> `summarise()` regrouping output by 'Species', 'extra_var1' (override with `.groups` argument)
#> # A tibble: 6 x 5
#> # Groups: Species, extra_var1 [6]
#> Species extra_var1 extra_var2 average.Sepal.Length average.Petal.Width
#> <fct> <chr> <chr> <dbl> <dbl>
#> 1 setosa No What 4.67 0.195
#> 2 setosa Yes What 5.23 0.28
#> 3 versicolor No What 4.9 1
#> 4 versicolor Yes What 5.96 1.33
#> 5 virginica No What 4.9 1.7
#> 6 virginica Yes What 6.62 2.03
Created on 2021-05-11 by the reprex package (v0.3.0)
Suppose I have the following function
SlowFunction = function(vector){
mean =mean(vector),
sd = sd(vector)
And I would like to use dplyr:summarise to write the results to a dataframe:
iris %>%
dplyr::group_by(Species) %>%
mean = SlowFunction(Sepal.Length)$mean,
sd = SlowFunction(Sepal.Length)$sd
Does anyone have a suggestion how I can do this by calling "SlowFunction" once instead of twice? (In my code "SlowFunction" is a slow function that I have to call many times.) Without splitting "SlowFunction" in two parts of course. So actually I would like to somehow fill multiple columns of a dataframe in one statement.
Without changing your current SlowFunction one way is to use do
iris %>%
group_by(Species) %>%
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636
Or with group_split + purrr::map_dfr
bind_cols(Species = unique(iris$Species), iris %>%
group_split(Species) %>%
An option is to use to store the output of SlowFunction in a list column of data.frames and then to use unnest
iris %>%
group_by(Species) %>%
summarise(res = list(as.data.frame(SlowFunction(Sepal.Length)))) %>%
## A tibble: 3 x 3
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636
We can use group_map if you are using dplyr 0.8.0 or later. The output from SlowFunction needs to be converted to a data frame.
iris %>%
group_by(Species) %>%
group_map(~SlowFunction(.x$Sepal.Length) %>% as.data.frame())
# # A tibble: 3 x 3
# # Groups: Species [3]
# Species mean sd
# <fct> <dbl> <dbl>
# 1 setosa 5.01 0.352
# 2 versicolor 5.94 0.516
# 3 virginica 6.59 0.636
We can change the SlowFunction to return a tibble and
SlowFunction = function(vector){
mean =mean(vector),
sd = sd(vector)
and then unnest the summarise output in a list
iris %>%
group_by(Species) %>%
summarise(out = list(SlowFunction(Sepal.Length))) %>%
# A tibble: 3 x 3
# Species mean sd
# <fct> <dbl> <dbl>
#1 setosa 5.01 0.352
#2 versicolor 5.94 0.516
#3 virginica 6.59 0.636
I would like to pre-assign my column name and use that within a dplyr pipe
Here's an example. I want to do this:
iris %>%
group_by(Species) %>%
summarise(Var = mean(Petal.Length[Sepal.Width > 3]))
But with the column name assigned outside of the pipe, like this
col_name <- "Petal.Length"
iris %>%
group_by(Species) %>%
summarise(Var = mean(!!col_name[Sepal.Width > 3]))
We can convert to symbol (sym) and then do the evaluation (!!)
iris %>%
group_by(Species) %>%
summarise(Var = mean((!!rlang::sym(col_name))[Sepal.Width >3]))
# A tibble: 3 x 2
# Species Var
# <fct> <dbl>
#1 setosa 1.48
#2 versicolor 4.65
#3 virginica 5.72
If we need to use only dplyr, then can pass the variable object in summarise_at
iris %>%
group_by(Species) %>%
summarise_at(vars(col_name), funs(mean(.[Sepal.Width > 3])))
# A tibble: 3 x 2
# Species Petal.Length
# <fct> <dbl>
#1 setosa 1.48
#2 versicolor 4.65
#3 virginica 5.72