Summarise multiple columns that have to be grouped tidyverse - r

I have a data frame containing data that looks something like this:
df <- data.frame(
group1 = c("High","High","High","Low","Low","Low"),
group2 = c("male","female","male","female","male","female"),
one = c("yes","yes","yes","yes","no","no"),
two = c("no","yes","no","yes","yes","yes"),
three = c("yes","no","no","no","yes","yes")
)
I want to summarise the counts of yes/no in the variables one, two, and three which normally I would do by df %>% group_by(group1,group2,one) %>% summarise(n()). Is there any way that I can summarise all three columns and then bind them all into one output df without having to manually perform the code over each column? I've tried using for loop but I can't get the group_by() to recognize the colname I am giving it as input

Get the data in long format and count :
library(dplyr)
library(tidyr)
df %>% pivot_longer(cols = one:three) %>% count(group1, group2, value)
# group1 group2 value n
# <chr> <chr> <chr> <int>
#1 High female no 1
#2 High female yes 2
#3 High male no 3
#4 High male yes 3
#5 Low female no 2
#6 Low female yes 4
#7 Low male no 1
#8 Low male yes 2

This may be done in dplyr only (no need to use tidyr::pivot_*), though giving slightly different output format. (This one is working even without rowwise though I am not aware of exact reason of it)
df <- data.frame(
group1 = c("High","High","High","Low","Low","Low"),
group2 = c("male","female","male","female","male","female"),
one = c("yes","yes","yes","yes","no","no"),
two = c("no","yes","no","yes","yes","yes"),
three = c("yes","no","no","no","yes","yes")
)
library(dplyr)
df %>%
group_by(group1, group2) %>%
summarise(yes_count = sum(c_across(everything()) == 'yes'),
no_count = sum(c_across(one:three) == 'no'), .groups = 'drop')
#> # A tibble: 4 x 4
#> group1 group2 yes_count no_count
#> <chr> <chr> <int> <int>
#> 1 High female 2 1
#> 2 High male 3 3
#> 3 Low female 4 2
#> 4 Low male 2 1
Created on 2021-05-12 by the reprex package (v2.0.0)

Using data.table
library(data.table)
melt(setDT(df), id.var = c('group1', 'group2'))[, .(n = .N),
.(group1, group2, value)]
-output
group1 group2 value n
1: High male yes 3
2: High female yes 2
3: Low female yes 4
4: Low male no 1
5: Low female no 2
6: High male no 3
7: Low male yes 2
8: High female no 1
With base R, we can use by and table
by(df[3:5], df[1:2], function(x) table(unlist(x)))

Related

How to get tally (rolling sum) by group in R?

I would like to create a column called "tally" in my dataset that takes sums the count of each type and rank.
type <- c("A","A","A","B","B","C")
rank <- c("low", "med", "high","med", "high", "low")
count <- c(9,20,31,2,4,14)
df <- data.frame(type, rank, count)
My desired output would be:
type rank count tally
1 A low 9 9
2 A med 20 29
3 A high 31 60
4 B med 2 2
5 B high 4 6
6 C low 14 14
I guess another way to describe it would be a rolling sum (where it takes into account the low to high order)? I have looked around but I can't find any good functions to do this. Ideally, I could have a for loop that would allow me to get this "rolling sum" by type.
We can use cumsum after grouping by 'type'
library(dplyr)
df <- df %>%
group_by(type) %>%
mutate(tally = cumsum(count)) %>%
ungroup
-output
# A tibble: 6 x 4
type rank count tally
<chr> <chr> <dbl> <dbl>
1 A low 9 9
2 A med 20 29
3 A high 31 60
4 B med 2 2
5 B high 4 6
6 C low 14 14

Summarize and Transpose rows to columns in R

This is my input data:
Program = c("A","A","A","B","B","C")
Age = c(10,30,30,12,32,53)
Gender = c("F","F","M","M","M","F")
Language = c("Eng","Eng","Kor","Kor","Other","Other")
df = data.frame(Program,Age,Gender,Language)
I would like to output a table like this:
Program
MEAN AGE
ENG
KOR
FEMALE
MALE
A
B
C
Where MEAN AGE is the average age, ENG,KOR,FEMALE,MALE are counts.
I have tried using dplyr and t() but in this case I feel like I'm completely lost as to what are the steps (my first post, new to this). Thank you in advance!
You can take the following approach:
library(dplyr)
df %>%
group_by(Program) %>%
summarise(
`Mean Age` = mean(Age),
ENG = sum(Language=="Eng"),
KOR = sum(Language=="Kor"),
Female = sum(Gender=="F"),
Male = sum(Gender=="M"),
.groups="drop"
)
Output:
# A tibble: 3 x 6
Program `Mean Age` ENG KOR Female Male
<chr> <dbl> <int> <int> <int> <int>
1 A 23.3 2 1 2 1
2 B 22 0 1 0 2
3 C 53 0 0 1 0
Note: .groups is a special variable for dplyr functions. The way it's used here is equivalent to using %>% ungroup() after the calculation. If you type any other name in the summarise function, it will assume it's a column name.
In base R you could do:
df1 <- cbind(df[1:2], stack(df[3:4])[-2])
cbind(aggregate(Age~Program, df, mean),as.data.frame.matrix(table(df1[-2])))
Program Age Eng F Kor M Other
A A 23.33333 2 2 1 1 0
B B 22.00000 0 0 1 2 1
C C 53.00000 0 1 0 0 1

Group data hierarchically on two levels, then compute relative frequencies in R using dplyr [duplicate]

This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 3 years ago.
I want to do something which appears simple, but I don't have a good feel for R yet, it is a maze of twisty passages, all different.
I have a table with several variables, and I want to group on two variables ... I want a two-level hierarchical grouping, also known as a tree. This can evidently be done using the group_by function of dplyr.
And then I want to compute marginal statistics (in this case, relative frequencies) based on group counts for level 1 and level 2.
In pictures, given this table of 18 rows:
I want this table of 6 rows:
Is there a simple way to do this in dplyr? (I can do it in SQL, but ...)
Edited for example
For example, based on the nycflights13 package:
library(dplyr)
install.packages("nycflights13")
require(nycflights13)
data(flights) # contains information about flights, one flight per row
ff <- flights %>%
mutate(approx_dist = floor((distance + 999)/1000)*1000) %>%
select(carrier, approx_dist) %>%
group_by(carrier, approx_dist) %>%
summarise(n = n()) %>%
arrange(carrier, approx_dist)
This creates a tbl ff with the number of flights for each pair of (carrier, inter-airport-distance-rounded-to-1000s):
# A tibble: 33 x 3
# Groups: carrier [16]
carrier approx_dist n
<chr> <dbl> <int>
1 9E 1000 15740
2 9E 2000 2720
3 AA 1000 9146
4 AA 2000 17210
5 AA 3000 6373
And now I would like to compute the relative frequencies for the "approx_dist" values in each "carrier" group, for example, I would like to get:
carrier approx_dist n rel_freq
<chr> <dbl> <int>
1 9E 1000 15740 15740/(15740+2720)
2 9E 2000 2720 2720/(15740+2720)
If I understood your problem correctly, here is what you can do. This is not to exactly solve your problem (we don't have the data), but to give you some hints:
library(dplyr)
d <- data.frame(col1= rep(c("a", "a", "a", "b", "b", "b"),2),
col2 = rep(c("a1", "a2", "a3", "b1", "b2", "b3"),2),
stringsAsFactors = F)
d %>% group_by(col1) %>% mutate(count_g1 = n()) %>% ungroup() %>%
group_by(col1, col2) %>% summarise(rel_freq = n()/unique(count_g1)) %>% ungroup()
# # A tibble: 6 x 3
# col1 col2 rel_freq
# <chr> <chr> <dbl>
# 1 a a1 0.333
# 2 a a2 0.333
# 3 a a3 0.333
# 4 b b1 0.333
# 5 b b2 0.333
# 6 b b3 0.333
Update: #TimTeaFan's suggestion on how to re-write the code above using prop.table
d %>% group_by(col1, col2) %>% summarise(n = n()) %>% mutate(freq = prop.table(n))
Update: Running this trick on the ff table given in the question's example, which has everything set up except the last mutate:
ff %>% mutate(rel_freq = prop.table(n))
# A tibble: 33 x 4
# Groups: carrier [16]
carrier approx_dist n rel_freq
<chr> <dbl> <int> <dbl>
1 9E 1000 15740 0.853
2 9E 2000 2720 0.147
3 AA 1000 9146 0.279
4 AA 2000 17210 0.526
5 AA 3000 6373 0.195
6 AS 3000 714 1
7 B6 1000 24613 0.450
8 B6 2000 22159 0.406
9 B6 3000 7863 0.144
10 DL 1000 20014 0.416
# … with 23 more rows
...or
ff %>% mutate(rel_freq = n/sum(n))
Fake data for demonstration:
library(dplyr)
df <- data.frame(stringsAsFactors = F,
col1 = rep(c("A","B"), each = 9),
col2 = rep(1:3),
value = 1:18)
#> df
# col1 col2 value
#1 A 1 1
#2 A 2 2
#3 A 3 3
#4 A 1 4
#5 A 2 5
#6 A 3 6
#7 A 1 7
#8 A 2 8
#9 A 3 9
#10 B 1 10
#11 B 2 11
#12 B 3 12
#13 B 1 13
#14 B 2 14
#15 B 3 15
#16 B 1 16
#17 B 2 17
#18 B 3 18
Solution
df %>%
group_by(col1, col2) %>%
summarise(col2_ttl = sum(value)) %>% # Count is boring for this data, but you
mutate(share_of_col1 = col2_ttl / sum(col2_ttl)) #... could use `n()` for that
## A tibble: 6 x 4
## Groups: col1 [2]
# col1 col2 col2_ttl share_of_col1
# <chr> <int> <int> <dbl>
#1 A 1 12 0.267
#2 A 2 15 0.333
#3 A 3 18 0.4
#4 B 1 39 0.310
#5 B 2 42 0.333
#6 B 3 45 0.357
First we group by both columns. In this case, the ordering makes a difference, because the groups are created hierarchically, and each summary we run summarizes the last layer of grouping. So the summarise line (or summarize, it was written with UK spelling but with US spelling aliases) sums up the values in each col1-col2 combination, leaving a residual grouping by col1 which we can use in the next line. (Try putting a # after sum(value)) to see what is produced at that stage.)
In the last line, the col2_ttl is divided by the sum of all the col2_ttl in its group, ie the total across each col1.

Gather multiple variables of different classes at the same time using Tidyverse

This is a question for all the Tidyverse experts out there. I have a dataset with lots of different classes (datettime, integer, factor, etc.) and want to use tidyr to gather multiple variables at the same time. In the reproducible example below I would like to gather time_, factor_ and integer_ at once, while id and gender remain untouched.
I am looking for the current best practice solution using any of the Tidyverse functions.
(I'd prefer if the solution isn't too "hacky" as I have a dataset with dozens of different key variables and around five hundred thousand rows).
Example data:
library("tidyverse")
data <- tibble(
id = c(1, 2, 3),
gender = factor(c("Male", "Female", "Female")),
time1 = as.POSIXct(c("2014-03-03 20:19:42", "2014-03-03 21:53:17", "2014-02-21 12:13:06")),
time2 = as.POSIXct(c("2014-05-28 15:26:49 UTC", NA, "2014-05-24 10:53:01 UTC")),
time3 = as.POSIXct(c(NA, "2014-09-26 00:52:40 UTC", "2014-09-27 07:08:47 UTC")),
factor1 = factor(c("A", "B", "C")),
factor2 = factor(c("B", NA, "C")),
factor3 = factor(c(NA, "A", "B")),
integer1 = c(1, 3, 2),
integer2 = c(1, NA, 4),
integer3 = c(NA, 5, 2)
)
Desired outcome:
# A tibble: 9 x 5
id gender Time Integer Factor
<dbl> <fct> <dttm> <dbl> <fct>
1 1 Male 2014-03-03 20:19:42 1 A
2 2 Female 2014-03-03 21:53:17 3 B
3 3 Female 2014-02-21 12:13:06 2 C
4 1 Male 2014-05-28 15:26:49 1 B
5 2 Female NA NA NA
6 3 Female 2014-05-24 10:53:01 4 C
7 1 Male NA NA NA
8 2 Female 2014-09-26 00:52:40 5 A
9 3 Female 2014-09-27 07:08:47 2 B
P.S. I did find a couple of threads that scratch the surface of gathering multiple variables, but none deal with the issue of gathering different classes and describe the current state of the art Tidyverse solution.
Probably too repetitive for what you want, but using mutate_at to recode multiple variables at the end when dealing with a large number of variables may be an option
Changing them all to character at the start maintains the time data then it needs to be converted back to date time at the end
data %>%
mutate_all(funs(as.character)) %>%
gather(key = variable, value = value, -id, -gender, convert = T) %>%
mutate(wave = readr::parse_number(variable),
variable = gsub("\\d","", x = variable)) %>%
spread(variable, value, convert = T) %>%
mutate(time = as.POSIXct(time),
factor = factor(factor),
gender = factor(gender)) %>%
select(1, 2, 6, 5, 4)
# A tibble: 9 x 5
id gender time integer factor
<chr> <fct> <dttm> <int> <fct>
1 1 Male 2014-03-03 20:19:42 1 A
2 1 Male 2014-05-28 15:26:49 1 B
3 1 Male NA NA NA
4 2 Female 2014-03-03 21:53:17 3 B
5 2 Female NA NA NA
6 2 Female 2014-09-26 00:52:40 5 A
7 3 Female 2014-02-21 12:13:06 2 C
8 3 Female 2014-05-24 10:53:01 4 C
9 3 Female 2014-09-27 07:08:47 2 B
(I'm rewriting basically all of my previous answer but keeping as this post to preserve comments.)
You can use some of the tidyselect helper functions, namely starts_with, to select batches of columns to gather, and then drop superfluous ones. This handles (some) of the issue of data types with gathering, because you're gathering sets of columns of the same type together, but it still requires re-coercing Factor into a factor because of the different factor levels present when gathering (see the warning message).
What I had trouble grasping was how the gathered columns would "move" while keeping some pattern with the id and gender columns. Doing a series of gather calls doesn't keep the pattern you want, but you can do each gather call and join them back together.
Here's one:
library(tidyverse)
data %>%
select(id, gender, starts_with("time")) %>%
gather(key = key_time, value = Time, starts_with("time"))
#> # A tibble: 9 x 4
#> id gender key_time Time
#> <dbl> <fct> <chr> <dttm>
#> 1 1 Male time1 2014-03-03 20:19:42
#> 2 2 Female time1 2014-03-03 21:53:17
#> 3 3 Female time1 2014-02-21 12:13:06
#> 4 1 Male time2 2014-05-28 15:26:49
#> 5 2 Female time2 NA
#> 6 3 Female time2 2014-05-24 10:53:01
#> 7 1 Male time3 NA
#> 8 2 Female time3 2014-09-26 00:52:40
#> 9 3 Female time3 2014-09-27 07:08:47
To do all of these, you can map over the prefixes—"time," "factor," and "integer"—and reduce-join them together. The trick is that you need some unique identifier for each row in order to join properly; for this, I added a column with row_number, use it as a joining column, then drop it.
map(c("time", "factor", "integer"), function(p) {
val_name <- str_to_title(p)
data %>%
select(id, gender, starts_with(p)) %>%
gather(key = key, value = !!val_name, starts_with(p)) %>%
select(-key) %>%
mutate(row = row_number())
}) %>%
reduce(left_join) %>%
select(-row)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> Joining, by = c("id", "gender", "row")
#> Joining, by = c("id", "gender", "row")
#> # A tibble: 9 x 5
#> id gender Time Factor Integer
#> <dbl> <fct> <dttm> <chr> <dbl>
#> 1 1 Male 2014-03-03 20:19:42 A 1
#> 2 2 Female 2014-03-03 21:53:17 B 3
#> 3 3 Female 2014-02-21 12:13:06 C 2
#> 4 1 Male 2014-05-28 15:26:49 B 1
#> 5 2 Female NA <NA> NA
#> 6 3 Female 2014-05-24 10:53:01 C 4
#> 7 1 Male NA <NA> NA
#> 8 2 Female 2014-09-26 00:52:40 A 5
#> 9 3 Female 2014-09-27 07:08:47 B 2
It's a little ugly, and won't fit well in a piped workflow already underway, but you could easily enough wrap it in a function:
gather_by_prefix <- function(.data, prefix) {
map(prefix, function(p) {
val_name <- str_to_title(p)
data %>%
select(id, gender, starts_with(p)) %>%
gather(key = key, value = !!val_name, starts_with(p)) %>%
select(-key) %>%
mutate(row = row_number())
}) %>%
reduce(left_join) %>%
select(-row)
}
Calling it like so gets the same output as above:
data %>%
gather_by_prefix(c("time", "factor", "integer"))
As for keeping factor levels, I think unfortunately you'll need to coerce it back afterwards. There are other questions on possible ways around it; here's one.
It's also worth noting that the tidyr github has several issues filed on work being done to implement a multi_gather-type of function, likely for use cases like yours. Not sure if those would cover factor conversion.

Count number of observations without N/A per year in R

I have a dataset and I want to summarize the number of observations without the missing values (denoted by NA).
My data is similar as the following:
data <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="CompanyNumber ResponseVariable Year ExplanatoryVariable1 ExplanatoryVariable2
1 2.5 2000 1 2
1 4 2001 3 1
1 3 2002 NA 7
2 1 2000 3 NA
2 2.4 2001 0 4
2 6 2002 2 9
3 10 2000 NA 3")
I was planning to use the package dplyr, but that does only take the years into account and not the different variables:
library(dplyr)
data %>%
group_by(Year) %>%
summarise(number = n())
How can I obtain the following outcome?
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
To get the counts, you can start by using:
library(dplyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.)))
## A tibble: 3 x 3
# Year ExplanatoryVariable1 ExplanatoryVariable2
# <int> <int> <int>
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2
If you want to reshape it as shown in your question, you can extend the pipe using tidyr functions:
library(tidyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.))) %>%
gather(var, count, -Year) %>%
spread(Year, count)
## A tibble: 2 x 4
# var `2000` `2001` `2002`
#* <chr> <int> <int> <int>
#1 ExplanatoryVariable1 2 2 1
#2 ExplanatoryVariable2 2 2 2
Just to let OP know, since they have ~200 explanatory variables to select. You can use another option of summarise_at to select the variables. You can simply name the first:last variable, if they are ordered correctly in the data, for example:
data %>%
group_by(Year) %>%
summarise_at(vars(ExplanatoryVariable1:ExplanatoryVariable2), ~sum(!is.na(.)))
Or:
data %>%
group_by(Year) %>%
summarise_at(3:4, ~sum(!is.na(.)))
Or store the variable names in a vector and use that:
vars <- names(data)[4:5]
data %>%
group_by(Year) %>%
summarise_at(vars, ~sum(!is.na(.)))
data %>%
gather(cat, val, -(1:3)) %>%
filter(complete.cases(.)) %>%
group_by(Year, cat) %>%
summarize(n = n()) %>%
spread(Year, n)
# # A tibble: 2 x 4
# cat `2000` `2001` `2002`
# * <chr> <int> <int> <int>
# 1 ExplanatoryVariable1 2 2 1
# 2 ExplanatoryVariable2 2 2 2
Should do it. You start by making the data stacked, and the simply calculating the n for both year and each explanatory variable. If you want the data back to wide format, then use spread, but either way without spread, you get the counts for both variables.
Using base R:
do.call(cbind,by(data[3:5], data$Year,function(x) colSums(!is.na(x[-1]))))
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
For aggregate:
aggregate(.~Year,data[3:5],function(x) sum(!is.na(x)),na.action = function(x)x)
You could do it with aggregate in base R.
aggregate(list(ExplanatoryVariable1 = data$ExplanatoryVariable1,
ExplanatoryVariable2 = data$ExplanatoryVariable2),
list(Year = data$Year),
function(x) length(x[!is.na(x)]))
# Year ExplanatoryVariable1 ExplanatoryVariable2
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2

Resources