tidyverse and a $ subsetting pecularity? intent for such behavior? - r

cdata is a tibble (I used haven to import a .sav file into the cdata object).
Why does using cdata$WEIGHT instead of WEIGHT produce such a radical difference in the output below?
this code uses cdata$WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(cdata$WEIGHT))
produces an unwanted table:
this code uses WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(WEIGHT))
produces the correct table:
I realize that tibble has a different mental model than base R. However, the above difference doesn't make intuitive sense to me. What's the intent behind this difference in output when using a common column identification technique (cdata$WEIGHT)?

When we having a grouping variable, cdata$WEIGHT extracts the whole column and thus the sum is from the whole column whereas if we use only WEIGHT, it returns only the data from the column for each group
If we really wanted to use $, then use the pronoun .data
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.data$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
which is identical to
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Or use cur_data()
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(cur_data()$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Whereas if we use .$ or iris$, it extracts the whole column breaking the group attributes
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 876.
2 versicolor 876.
3 virginica 876.

Related

Dplyr: Count number of observations in group and summarise?

Im wondering if there is a more elegant way to perform this.
Right now, I am grouping all observations by Species. Then I summarize the median values.
median <- iris %>%
group_by(Species) %>%
summarise(medianSL = median(Sepal.Length),
medianSW = median(Sepal.Width),
medianPL = median(Petal.Length),
medianPW = median(Petal.Width))
I also wanted a column (n) that shows the amount of flowers in each row:
median_n <- iris %>%
group_by(Species) %>%
tally()
Can I combine these two code chunks? So that way the above code chunk will generate a table with the median lengths AND the total n for each row?
We may use across in summarise to loop over the numeric columns to get the median as well as create a frequency count with n() outside the across
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric),
~ median(.x, na.rm = TRUE),
.names = "median{str_remove_all(.col, '[a-z.]+')}"),
n = n(), .groups = "drop")
-output
# A tibble: 3 × 6
Species medianSL medianSW medianPL medianPW n
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 setosa 5 3.4 1.5 0.2 50
2 versicolor 5.9 2.8 4.35 1.3 50
3 virginica 6.5 3 5.55 2 50

Group t test result into columns within tidyverse

I'd like to group multiple t test result into one table. Originally my code looks like this:
tt_data <- iris %>%
group_by(Species) %>%
summarise(p = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$p.value,
estimate = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$estimate
)
tt_data
# Species p estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
However, base on the idea that I should only perform the statistical test once, is there a way for me to run t test once per group and collect the intended table? I think there are some combination of broom and purrr but I am unfamiliar with the syntax.
# code idea (I know this won't work!)
tt_data <- iris %>%
group_by(Species) %>%
summarise(tt = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)) %>%
select(Species, tt.p, tt.estimate)
tt_data
# Species tt.p tt.estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
You can use broom::tidy() to transform the resut of the t.test to a tidy 'tibble':
library(dplyr)
library(broom)
iris %>%
group_by(Species) %>%
group_modify(~{
t.test(.$Sepal.Length,.$Petal.Length,alternative="two.sided",paired=T) %>%
tidy()
}) %>%
select(estimate, p.value)
#> Adding missing grouping variables: `Species`
#> # A tibble: 3 x 3
#> # Groups: Species [3]
#> Species estimate p.value
#> <fct> <dbl> <dbl>
#> 1 setosa 3.54 2.54e-51
#> 2 versicolor 1.68 9.67e-36
#> 3 virginica 1.04 7.99e-28
Created on 2020-09-02 by the reprex package (v0.3.0)
You can use map to select the desired values from the list generated by t.test and by tidying it up to a data frame via broom::tidy, i.e.
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(p = list(broom::tidy(t.test(Sepal.Length, Petal.Length, alternative = "two.sided", paired = T)))) %>%
mutate(p.value = purrr::map(p, ~select(.x, c('p.value', 'estimate')))) %>%
select(-p) %>%
unnest()
# A tibble: 3 x 3
# Species p.value estimate
# <fct> <dbl> <dbl>
#1 setosa 2.54e-51 3.54
#2 versicolor 9.67e-36 1.68
#3 virginica 7.99e-28 1.04

How to collapse multiple variables by column in csv file with R

I've written code to collapse my table, and it works, however whenever I write to a new csv the table seems to not be in the collapsed state. Perhaps this has to do with the summarize function I am using? I am collapsing variables d1, d2, d3 for region_ID. Is there a way for me put all 3 variables together and save to a new csv?
#Collapse the bsu_re_dec.csv table
HI_csv<- read.csv("C:\\filepath\\bsu_re_dec.csv")
HI_csv %>%
group_by(region_ID) %>%
summarize(sum(d1))
HI_csv %>%
group_by(region_ID) %>%
summarize(sum(d2))
HI_csv %>%
group_by(region_ID) %>%
summarize(sum(d3))
We can use summarise_at to summarise multiple variables
library(dplyr)
HI_csv %>%
group_by(region_ID) %>%
summarise_at(vars(matches('^d\\d+$')), sum)
In the devel version of dplyr, another option is across with `summarise
HI_csv %>%
group_by(region_ID) %>%
summarise(across(matches('^d\\d+'), sum))
Or with a reproducible example
iris %>%
group_by(Species) %>%
summarise(across(everything(), sum))
# A tibble: 3 x 5
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 setosa 250. 171. 73.1 12.3
#2 versicolor 297. 138. 213 66.3
#3 virginica 329. 149. 278. 101.

How to use "summarise" from dplyr with dynamic column names?

I am summarizing group means from a table using the summarize function from the dplyr package in R. I would like to do this dynamically, using a column name string stored in another variable.
The following is the "normal" way and it works, of course:
myTibble <- group_by( iris, Species)
summarise( myTibble, avg = mean( Sepal.Length))
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
However, I would like to do something like this instead:
myTibble <- group_by( iris, Species)
colOfInterest <- "Sepal.Length"
summarise( myTibble, avg = mean( colOfInterest))
I've read the Programming with dplyr page, and I've tried a bunch of combinations of quo, enquo, !!, .dots=(...), etc., but I haven't figured out the right way to do it yet.
I'm also aware of this answer, but, 1) when I use the standard-evaluation function standardise_, R tells me that it's depreciated, and 2) that answer doesn't seem elegant at all. So, is there a good, easy way to do this?
Thank you!
1) Use !!sym(...) like this:
colOfInterest <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarize(avg = mean(!!sym(colOfInterest))) %>%
ungroup
giving:
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
2) A second approach is:
colOfInterest <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarize(avg = mean(.data[[colOfInterest]])) %>%
ungroup
Of course this is straight forward in base R:
aggregate(list(avg = iris[[colOfInterest]]), iris["Species"], mean)
Another solution:
iris %>%
group_by(Species) %>%
summarise_at(vars("Sepal.Length"), mean) %>%
ungroup()
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59

How does one summarize with conditions into a single variable in R?

I would like to use summarise() from dplyr after grouping data to compute a new variable. But, I would like it to use one equation for some of the data and a second equation for the rest of the data.
I have tried using group_by() and and summarise() with if_else() but it isn't working.
Here's an example. Let's say--for some reason--I wanted to find a special value for sepal length. For the species 'setosa' this special value is twice the mean of the sepal length. For all of the other species it is simply the mean of sepal length. This is the code I've tried, but it doesn't work with summarise()
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(sepal_special = if_else(Species == "setosa", mean(Sepal.Length)*2, mean(Sepal.Length)))
This idea works with mutate() but I would need to re-format the tibble to be the dataset I am looking for.
library(dplyr)
iris %>%
group_by(Species) %>%
mutate(sepal_special = if_else(Species == "setosa", mean(Sepal.Length)*2, mean(Sepal.Length)))
This is how I want the resulting tibble to be laid out:
library(dplyr)
iris %>%
group_by(Species)%>%
summarise(sepal_mean = mean(Sepal.Length))
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
#>
But my result would show the value for setosa x 2
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa **10.02**
#2 versicolor 5.94
#3 virginica 6.59
#>
Suggestions? I feel like I've really searched for ways to use if_else() with summarise() but can't find it anywhere, which means there must be a better way.
Thanks!
After the mutate step, use summarise to get the first element of 'sepal_special' for each 'Species'
iris %>%
group_by(Species) %>%
mutate(sepal_special = if_else(Species == "setosa",
mean(Sepal.Length)*2, mean(Sepal.Length))) %>%
summarise(sepal_special = first(sepal_special))
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa 10.0
#2 versicolor 5.94
#3 virginica 6.59
Or instead of calling the mutate, after the if_else is applied, get the first value in summarise
iris %>%
group_by(Species) %>%
summarise(sepal_special = if_else(Species == "setosa",
mean(Sepal.Length)*2, mean(Sepal.Length))[1])
# A tibble: 3 x 2
# Species sepal_special
# <fctr> <dbl>
#1 setosa 10.0
#2 versicolor 5.94
#3 virginica 6.59
Another option: since twice the mean is the same as the mean of twice the values, you can double the sepal lengths for setosa and then summarise:
iris %>%
mutate(Sepal.Length = ifelse(Species == "setosa", 2*Sepal.Length, Sepal.Length)) %>%
group_by(Species) %>%
summarise(sepal_special = mean(Sepal.Length))
# A tibble: 3 x 2
Species sepal_special
<fct> <dbl>
1 setosa 10.0
2 versicolor 5.94
3 virginica 6.59

Resources