How to collapse multiple variables by column in csv file with R

How to collapse multiple variables by column in csv file with R - r

I've written code to collapse my table, and it works, however whenever I write to a new csv the table seems to not be in the collapsed state. Perhaps this has to do with the summarize function I am using? I am collapsing variables d1, d2, d3 for region_ID. Is there a way for me put all 3 variables together and save to a new csv?
#Collapse the bsu_re_dec.csv table
HI_csv<- read.csv("C:\\filepath\\bsu_re_dec.csv")
HI_csv %>%
group_by(region_ID) %>%
summarize(sum(d1))
HI_csv %>%
group_by(region_ID) %>%
summarize(sum(d2))
HI_csv %>%
group_by(region_ID) %>%
summarize(sum(d3))

We can use summarise_at to summarise multiple variables
library(dplyr)
HI_csv %>%
group_by(region_ID) %>%
summarise_at(vars(matches('^d\\d+$')), sum)
In the devel version of dplyr, another option is across with `summarise
HI_csv %>%
group_by(region_ID) %>%
summarise(across(matches('^d\\d+'), sum))
Or with a reproducible example
iris %>%
group_by(Species) %>%
summarise(across(everything(), sum))
# A tibble: 3 x 5
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 setosa 250. 171. 73.1 12.3
#2 versicolor 297. 138. 213 66.3
#3 virginica 329. 149. 278. 101.

Related

Group by or sum more column [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 10 months ago.
I would like to know if exist a method "automatic" for calculted more column in the same time.
library(dplyr)
abc <- iris %>%
group_by(Species) %>%
summarise(abc = sum(Petal.Width)) %>%
ungroup()
abc2 <- iris %>%
group_by(Species) %>%
summarise(abc = sum(Sepal.Width)) %>%
ungroup()
I can use this code for each column but if i need to do more column (in this dataset the first four), how can I do? And I need in the same dataset, it is possible?

Try this:
iris %>%
group_by(Species) %>%
summarise(across(Sepal.Length:Petal.Width, sum))
# A tibble: 3 x 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 250. 171. 73.1 12.3
2 versicolor 297. 138. 213 66.3
3 virginica 329. 149. 278. 101.

tidyverse and a $ subsetting pecularity? intent for such behavior?

cdata is a tibble (I used haven to import a .sav file into the cdata object).
Why does using cdata$WEIGHT instead of WEIGHT produce such a radical difference in the output below?
this code uses cdata$WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(cdata$WEIGHT))
produces an unwanted table:
this code uses WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(WEIGHT))
produces the correct table:
I realize that tibble has a different mental model than base R. However, the above difference doesn't make intuitive sense to me. What's the intent behind this difference in output when using a common column identification technique (cdata$WEIGHT)?

When we having a grouping variable, cdata$WEIGHT extracts the whole column and thus the sum is from the whole column whereas if we use only WEIGHT, it returns only the data from the column for each group
If we really wanted to use $, then use the pronoun .data
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.data$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
which is identical to
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Or use cur_data()
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(cur_data()$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Whereas if we use .$ or iris$, it extracts the whole column breaking the group attributes
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 876.
2 versicolor 876.
3 virginica 876.

Group t test result into columns within tidyverse

I'd like to group multiple t test result into one table. Originally my code looks like this:
tt_data <- iris %>%
group_by(Species) %>%
summarise(p = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$p.value,
estimate = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$estimate
)
tt_data
# Species p estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
However, base on the idea that I should only perform the statistical test once, is there a way for me to run t test once per group and collect the intended table? I think there are some combination of broom and purrr but I am unfamiliar with the syntax.
# code idea (I know this won't work!)
tt_data <- iris %>%
group_by(Species) %>%
summarise(tt = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)) %>%
select(Species, tt.p, tt.estimate)
tt_data
# Species tt.p tt.estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036

You can use broom::tidy() to transform the resut of the t.test to a tidy 'tibble':
library(dplyr)
library(broom)
iris %>%
group_by(Species) %>%
group_modify(~{
t.test(.$Sepal.Length,.$Petal.Length,alternative="two.sided",paired=T) %>%
tidy()
}) %>%
select(estimate, p.value)
#> Adding missing grouping variables: `Species`
#> # A tibble: 3 x 3
#> # Groups: Species [3]
#> Species estimate p.value
#> <fct> <dbl> <dbl>
#> 1 setosa 3.54 2.54e-51
#> 2 versicolor 1.68 9.67e-36
#> 3 virginica 1.04 7.99e-28
Created on 2020-09-02 by the reprex package (v0.3.0)

You can use map to select the desired values from the list generated by t.test and by tidying it up to a data frame via broom::tidy, i.e.
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(p = list(broom::tidy(t.test(Sepal.Length, Petal.Length, alternative = "two.sided", paired = T)))) %>%
mutate(p.value = purrr::map(p, ~select(.x, c('p.value', 'estimate')))) %>%
select(-p) %>%
unnest()
# A tibble: 3 x 3
# Species p.value estimate
# <fct> <dbl> <dbl>
#1 setosa 2.54e-51 3.54
#2 versicolor 9.67e-36 1.68
#3 virginica 7.99e-28 1.04

How to use "summarise" from dplyr with dynamic column names?

I am summarizing group means from a table using the summarize function from the dplyr package in R. I would like to do this dynamically, using a column name string stored in another variable.
The following is the "normal" way and it works, of course:
myTibble <- group_by( iris, Species)
summarise( myTibble, avg = mean( Sepal.Length))
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
However, I would like to do something like this instead:
myTibble <- group_by( iris, Species)
colOfInterest <- "Sepal.Length"
summarise( myTibble, avg = mean( colOfInterest))
I've read the Programming with dplyr page, and I've tried a bunch of combinations of quo, enquo, !!, .dots=(...), etc., but I haven't figured out the right way to do it yet.
I'm also aware of this answer, but, 1) when I use the standard-evaluation function standardise_, R tells me that it's depreciated, and 2) that answer doesn't seem elegant at all. So, is there a good, easy way to do this?
Thank you!

1) Use !!sym(...) like this:
colOfInterest <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarize(avg = mean(!!sym(colOfInterest))) %>%
ungroup
giving:
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
2) A second approach is:
colOfInterest <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarize(avg = mean(.data[[colOfInterest]])) %>%
ungroup
Of course this is straight forward in base R:
aggregate(list(avg = iris[[colOfInterest]]), iris["Species"], mean)

Another solution:
iris %>%
group_by(Species) %>%
summarise_at(vars("Sepal.Length"), mean) %>%
ungroup()
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59

Using pre-assigned column name within a pipe

I would like to pre-assign my column name and use that within a dplyr pipe
Here's an example. I want to do this:
iris %>%
group_by(Species) %>%
summarise(Var = mean(Petal.Length[Sepal.Width > 3]))
But with the column name assigned outside of the pipe, like this
col_name <- "Petal.Length"
iris %>%
group_by(Species) %>%
summarise(Var = mean(!!col_name[Sepal.Width > 3]))

We can convert to symbol (sym) and then do the evaluation (!!)
iris %>%
group_by(Species) %>%
summarise(Var = mean((!!rlang::sym(col_name))[Sepal.Width >3]))
# A tibble: 3 x 2
# Species Var
# <fct> <dbl>
#1 setosa 1.48
#2 versicolor 4.65
#3 virginica 5.72
If we need to use only dplyr, then can pass the variable object in summarise_at
iris %>%
group_by(Species) %>%
summarise_at(vars(col_name), funs(mean(.[Sepal.Width > 3])))
# A tibble: 3 x 2
# Species Petal.Length
# <fct> <dbl>
#1 setosa 1.48
#2 versicolor 4.65
#3 virginica 5.72

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to collapse multiple variables by column in csv file with R - r

Related

Group by or sum more column [duplicate]

tidyverse and a $ subsetting pecularity? intent for such behavior?

Group t test result into columns within tidyverse

How to use "summarise" from dplyr with dynamic column names?

Using pre-assigned column name within a pipe

Categories

Resources