Compute row-wise counts in subsets of columns in dplyr

Compute row-wise counts in subsets of columns in dplyr - r

I want to count the number of instances of some text (or factor level) row wise, across a subset of columns using dplyr.
Here's the input:
> input_df
num_col_1 num_col_2 text_col_1 text_col_2
1 1 4 yes yes
2 2 5 no yes
3 3 6 no <NA>
And here's the desired output:
> output_df
num_col_1 num_col_2 text_col_1 text_col_2 sum_yes
1 1 4 yes yes 2
2 2 5 no yes 1
3 3 6 no <NA> 0
In sum_yes we have counted the number of "yes" in that row.
I have tried two methods:
Attempted solution 1:
text_cols = c("text_col_1","text_col_2")
df = input_df %>% mutate(sum_yes = rowSums( select(text_cols) == "yes" ), na.rm = TRUE)
Errors with:
Error in mutate_impl(.data, dots) :
Evaluation error: no applicable method for 'select_' applied to an object of class "character".
Attempted solution 2:
text_cols = c("text_col_1","text_col_2")
df = input_df %>% select(text_cols) %>% rowsum("yes", na.rm = TRUE)
Errors with:
Error in rowsum.data.frame(., "yes", na.rm = TRUE) :
incorrect length for 'group'

We can use mutate and take sum of number of "yes" for each row.
library(dplyr)
df %>% mutate(sum_yes = rowSums(.[text_cols] == "yes"))
# num_col_1 num_col_2 text_col_1 text_col_2 sum_yes
#* <int> <int> <fct> <fct> <int>
#1 1 4 yes yes 2
#2 2 5 no yes 1
#3 3 6 no <NA> 0
Inspired from this answer.
rowwise with c_across :
df %>%
rowwise() %>%
mutate(sum_yes = sum(c_across(all_of(text_cols)) == "yes"))
do with rowwise
df %>%
rowwise() %>%
do((.) %>% as.data.frame %>%
mutate(sum_yes = sum(.=="yes")))
without do and rowwise
df %>%
select(text_cols) %>%
mutate(sum_yes = rowSums(. == "yes"))
In base R, it is actually more simple
df$sum_yes <- rowSums(df[text_cols] == "yes")

We can also use reduce with map
library(tidyverse)
df %>%
select(text_cols) %>%
map(~ .x == "yes" & !is.na(.x)) %>%
reduce(`+`) %>%
bind_cols(df, sum_yes = .)
# num_col_1 num_col_2 text_col_1 text_col_2 sum_yes
#1 1 4 yes yes 2
#2 2 5 no yes 1
#3 3 6 no <NA> 0

Related

Aggregate string variable using summarise and across function

df_input is the input file, and the ideal output file is df_output.
df_input <- data.frame(id = c(1,2,3,4,4,5,5,5,6,7,8,9,10),
party = c("A","B","C","D","E","F","G","H","I","J","K","L","M"),
winner= c(1,1,1,1,1,1,1,1,1,1,1,1,1))
df_output <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
party = c("A","B","C","D,E","F_G_H","I","J","K","L","M"),
winner_sum = c(1,1,1,2,3,1,1,1,1,1))
Previously the code worked using the "summarise_at" function as follows:
df_output <- df_input %>%
dplyr::group_by_at(.vars = vars(id)) %>%
{left_join(
dplyr::summarise_at(., vars(party), ~ str_c(., collapse = ",")),
dplyr::summarise_at(., vars(winner), funs(sum))
)}
But it no longer works as it seems both "summarise_at" and "funs" has been deprecated.
I am trying to replicate using across with dplyr (1.0.10), but I am getting an error. Here is my attempt:
df_output <- df_input %>%
group_by(id) %>%
summarise(across(winner, sum, na.rm=T)) %>%
summarise(across(party, str_c(., collapse = ",")))
I have multiple numeric and character variables,s not just one, as in the example. Thanks a lot.

We don't need across if we need to apply different functions on single columns
library(dplyr)
library(stringr)
df_input %>%
group_by(id) %>%
summarise(party = str_c(party, collapse = ","),
winner_sum = sum(winner))
-output
# A tibble: 10 × 3
id party winner_sum
<dbl> <chr> <dbl>
1 1 A 1
2 2 B 1
3 3 C 1
4 4 D,E 2
5 5 F,G,H 3
6 6 I 1
7 7 J 1
8 8 K 1
9 9 L 1
10 10 M 1
If there are multiple 'party', 'winner' columns, loop across them in a single summarise as after the first summarise we have only the summarised column with the group column
df_input %>%
group_by(id) %>%
summarise(across(winner, sum, na.rm=TRUE),
across(party, ~ str_c(.x, collapse = ",")), .groups = "drop")
-output
# A tibble: 10 × 3
id winner party
<dbl> <dbl> <chr>
1 1 1 A
2 2 1 B
3 3 1 C
4 4 2 D,E
5 5 3 F,G,H
6 6 1 I
7 7 1 J
8 8 1 K
9 9 1 L
10 10 1 M
NOTE: If the columns have a simplar prefix then use starts_with to select all those columns i.e. across(starts_with("party"), or if there are different column names - across(c(party, othercol), or if the functions applied are based on their type - across(where(is.numeric), sum,, na.rm = TRUE)
df_input %>%
group_by(id) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE),
across(where(is.character), str_c, collapse = ","),
.groups = 'drop')

Fill cells dataframe based on multiple conditions

How to fill cells based on multiple conditions?
There are a lot of players (columns) in this game, but I only included 2 for the sake of this example. I want to loop over a lot of players.
Every row represents a game round.
Conditions:
IF player00[i] score = 0 &
IF lossallowed00[i] = "no"
THEN Fill flag00[i] with "FLAG"
df <-data.frame(
player001 = c(1,0,3),
player002 = c(1,0,5),
lossallowed001 = c("no", "yes", "no"),
lossallowed002 = c("no", "no", "yes"),
flag001 = NA,
flag002 = NA
)
#desired output:
#player001 player002 lossallowed001 lossallowed002 flag001 flag002
# 1 1 no no NA NA
# 0 0 yes no NA FLAG
# 3 5 no yes NA NA

If you use a method of reshaping to long format, splitting out the IDs based on the pattern of column names being variables made of letters and IDs being made of numbers, you can do the operation all at once in a couple lines and reshape back to wide. Using regex means you're not bound by either the number of players or the names of columns. I added an ID column for the games to differentiate rows; you could drop it afterward.
The reshaping itself is covered pretty extensively already (Reshaping multiple sets of measurement columns (wide format) into single columns (long format) for example) but is useful for problems that need to scale like this.
library(dplyr)
df %>%
tibble::rowid_to_column(var = "game") %>%
tidyr::pivot_longer(-game, names_to = c(".value", "num"),
names_pattern = "(^[a-z]+)(\\d+$)") %>%
mutate(flag = ifelse(player == 0 & lossallowed == "no", "FLAG", NA_character_)) %>%
tidyr::pivot_wider(id_cols = game, names_from = num, values_from = player:flag,
names_glue = "{.value}{num}")
#> # A tibble: 3 × 7
#> game player001 player002 lossallowed001 lossallowed002 flag001 flag002
#> <int> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 1 1 no no <NA> <NA>
#> 2 2 0 0 yes no <NA> FLAG
#> 3 3 3 5 no yes <NA> <NA>

A possible solution:
library(tidyverse)
df <-data.frame(player001 = c(1,0,3), player002 = c(1,0,5),lossallowed001 = c("no", "yes", "no"), loseallowed002 = c("no", "no", "yes"),flag001 = NA, flag002 = NA)
df %>%
rownames_to_column("id") %>%
mutate(across(where(is.numeric), as.character)) %>%
pivot_longer(cols = -id) %>%
group_by(str_extract(name, "\\d{3}$"), id) %>%
mutate(value = if_else(row_number() == 3 & first(value) == "0" &
nth(value, 2) == "no", "FLAG", value)) %>%
ungroup %>% select(name, value) %>%
pivot_wider(names_from = name, values_from = value, values_fn = list) %>%
unnest(cols = everything()) %>% type.convert(as.is = TRUE)
#> # A tibble: 3 × 6
#> player001 player002 lossallowed001 loseallowed002 flag001 flag002
#> <int> <int> <chr> <chr> <lgl> <chr>
#> 1 1 1 no no NA <NA>
#> 2 0 0 yes no NA FLAG
#> 3 3 5 no yes NA <NA>

You can do this. First reshape the data, and then add the column. Use bind_cols if you want the data to be merged back.
library(purrr)
library(dplyr)
map(set_names(paste0("00", 1:2)), ~ select(df, ends_with(.x))) %>%
map(., ~ mutate(., newcol = ifelse(.[[1]] == 0 & .[[2]] == "no", "FLAG", NA)))
$`001`
player001 lossallowed001 flag001 newcol
1 1 no NA NA
2 0 yes NA NA
3 3 no NA NA
$`002`
player002 loseallowed002 flag002 newcol
1 1 no NA <NA>
2 0 no NA FLAG
3 5 yes NA <NA>

Here's a solution in the tidyverse. While I arrived at this solution independently, this is likely a duplicate of #camille's solution here, which was posted shortly before mine.
library(tidyverse)
# ...
# Code to generate 'df'.
# ...
df %>%
# Index the matches.
mutate(match_id = row_number()) %>%
# Pivot to get a row for each player {001, 002, ...} and match.
pivot_longer(
# Target columns whose names end with a separate suffix of 3+ digits.
matches("^(.*\\D)(\\d{3,})$"),
names_pattern = "^(.*\\D)(\\d{3,})$",
# Index the players by their suffixes; and give each the following three columns:
# 'player' (score), 'lossallowed', and 'flag'.
names_to = c(".value", "player_id")
) %>%
# Flag the appropriate cases.
mutate(
flag = if_else(player == 0 & lossallowed == "no", "FLAG", NA_character_)
) %>%
# Return to original, wide format.
pivot_wider(
names_from = player_id,
values_from = !c(match_id, player_id),
names_glue = "{.value}{player_id}"
) %>%
arrange(match_id) %>% select(!match_id)

mutate if feature exists else NA_real_

A dataframe:
exdf <- data.frame(
a = 1:3,
b = c(2,2,2)
)
Sometimes b is present, in which case one can do this:
exdf %>% mutate(c = a / b)
But, sometimes feature b will not be present, in which case:
exdf %>% select(-b) %>% mutate(c = a / b)
Error: Problem with `mutate()` input `c`.
x object 'b' not found
ℹ Input `c` is `a/b`.
I want to tell dplyr to try the mutation, else if something goes wrong just make new feature c all NA_real_ as opposed to a / b.
Can this be done?

We can use a condition with if/else on exists
library(dplyr)
exdf %>%
select(-b) %>%
mutate(c = if(exists('b')) a/b else NA_real_)

Set up a simple if else statement within mutate which checks whether the column name is in the data.frame or not.
> exdf %>%
... dplyr::rowwise() %>%
... dplyr::mutate(q = ifelse("b" %in% colnames(.), a/b, NA_real_))
# A tibble: 3 x 3
# Rowwise:
a b q
<int> <dbl> <dbl>
1 1 2 0.5
2 2 2 1
3 3 2 1.5
> exdf %>%
... dplyr::select(-b) %>%
... dplyr::rowwise() %>%
... dplyr::mutate(q = ifelse("b" %in% colnames(.), a/b, NA_real_))
# A tibble: 3 x 2
# Rowwise:
a q
<int> <dbl>
1 1 NA
2 2 NA
3 3 NA

rollsumr with window-length>1: filling missing values

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?

Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.

Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

separate_columns for tidyr

Let's say I had a survey question that read:
What did you eat?
[ ] apple
[ ] pear
[x] banana
[x] grapes
Now, I have the endorsed options as comma-separated strings in one variable.
I wrote myself a little helper to turn this comma-separated list of answers into boolean dummies showing whether each box was checked.
df <- data.frame(
x = 1:5,
ate = c("apple", "apple, pear, banana", "banana, grapes", NA_character_, ""),
stringsAsFactors = FALSE
)
separate_columns <- function(df, col, convert = TRUE, sep = ", ") {
colname <- deparse(substitute(col))
# sorry about this ugly non-rlang approach, hoping not to reuse this
df$.splitcol <- df %>% pull(colname)
separate_rows(df, .splitcol, convert = convert, sep = sep) %>%
mutate(.splitcol = stringr::str_c(colname, "_", .splitcol), value = 1) %>%
mutate(.splitcol = if_else(is.na(.splitcol), stringr::str_c(colname, "_nonresponse"), .splitcol)) %>%
spread(.splitcol, value, fill = 0) %>%
select(-colname)
}
separate_columns(df, ate)
Gets me to this:
x ate_apple ate_banana ate_grapes ate_nonresponse ate_pear
1 1 0 0 0 0
2 1 1 0 0 1
3 0 1 1 0 0
4 0 0 0 1 0
5 0 0 0 1 0
Writing the helper felt clunky, and I feel like I'm missing a more tidyverse way of accomplishing the same transformation (despite lots of searching).
Also, I found no easy way for missings to propagate using this method (I'd prefer if all dummies would be missing if the response was NA, but 0 if it was an empty string). So, I'd rather get this
x ate_apple ate_banana ate_grapes ate_pear
1 1 0 0 0
2 1 1 0 1
3 0 1 1 0
4 NA NA NA NA
5 0 0 0 0
Is there a nicer tidyverse way?

After changing into 'long' format by splitting the 'ate' column by the delimiter ,, create a column of 1 and spread from 'long' to 'wide'
library(tidyverse)
df %>%
separate_rows(ate, sep=", ", convert = TRUE) %>%
mutate(ate = replace(ate, is.na(ate), "NA"),
n = paste(NA ^ (ate == "NA")),
ate = paste0("ate_", replace(ate, ate == "", "nonresponse" ))) %>%
spread(ate, n, fill = "0") %>%
mutate_at(vars(-x, -ate_NA),
funs(replace(as.integer(.), ate_NA=="NA", NA_integer_))) %>%
select(-ate_NA)
# x ate_apple ate_banana ate_grapes ate_nonresponse ate_pear
#1 1 1 0 0 0 0
#2 2 1 1 0 0 1
#3 3 0 1 1 0 0
#4 4 NA NA NA NA NA
#5 5 0 0 0 1 0

I take a different approach, by first extracting want was there to eat and then matching it in the data:
total_eat_list <- map(df$ate, str_split, patter = ",") %>%
unlist() %>%
str_trim() %>%
na.exclude() %>%
unique()
Remove empty strings:
total_eat_list <- total_eat_list[total_eat_list != ""]
total_eat_list
# [1] "apple" "pear" "banana" "grapes"
Now lets map everything in the original data:
map_df(total_eat_list, ~
df %>%
mutate(ate_what = str_c("ate_", .x),
ind = case_when(str_detect(string = df$ate, .x) ~ 1,
!str_detect(string = df$ate, .x) ~ 0,
TRUE ~ NA_real_))) %>%
spread(ate_what, ind) %>%
select(-ate)
# A tibble: 5 x 5
# x ate_apple ate_banana ate_grapes ate_pear
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 0 0 0
# 2 2 1 1 0 1
# 3 3 0 1 1 0
# 4 4 NA NA NA NA
# 5 5 0 0 0 0
The nice thing is that NAs are infectious for the str_-functions.
As function:
who_ate_what <- function(data, col) {
col <- enquo(col)
col_name <- quo_name(col)
match_list <- data %>%
select(!!col) %>%
map(str_split, patter = ",") %>%
unlist() %>%
str_trim() %>%
na.exclude() %>%
unique()
match_list <- match_list[match_list != ""]
map_df(match_list, ~
data %>%
mutate(matches = str_c(!!col_name, "_", .x),
ind = case_when(str_detect(string = !!col, .x) ~ 1,
!str_detect(string = !!col, .x) ~ 0,
TRUE ~ NA_real_)
)) %>%
spread(matches, ind) %>%
select(-!!col)
}

This is way too verbose I'm sure, but I guess its a start.
library(tidyverse)
df <- data.frame(
x = 1:5,
ate = c("apple", "apple, pear, banana", "banana, grapes", NA_character_, ""),
stringsAsFactors = FALSE
)
df %>%
nest(-x) %>%
mutate(data = map(data, ~str_split(.x$ate, ",") %>% unlist())) %>%
unnest() %>%
group_by(x, data) %>%
summarise(n = n()) %>%
ungroup() %>%
spread(data, n, fill = NA) %>%
select(-`<NA>`) %>%
mutate(rs = rowSums(.[2:ncol(.)],na.rm = TRUE)) %>%
gather(nm, val, -x, -rs) %>%
mutate(val = case_when(
is.na(val) & rs > 0 ~ "0",
is.na(val) & rs == 0 ~ "NA",
!is.na(val) ~ as.character(val)
), val = as.numeric(val)) %>%
spread(nm, val, fill = NA) %>%
select(-rs, -V1)
#> # A tibble: 5 x 6
#> x ` banana` ` grapes` ` pear` apple banana
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 0 1 0
#> 2 2 1 0 1 1 0
#> 3 3 0 1 0 0 1
#> 4 4 NA NA NA NA NA
#> 5 5 0 0 0 0 0
EDIT
Lets wrap this into a function and take care of the name issue. I adopted the splitting from your original function to make the use of quosures easier.
my_sep_fun <- function(data, col){
col <- enquo(col)
col_name <- quo_name(col)
data %>%
separate_rows(!!col, sep =', ', convert = TRUE) %>%
group_by(x, !!col) %>%
summarise(n = n()) %>%
ungroup() %>%
spread(!!col, n, fill = NA) %>%
select(-`<NA>`) %>%
mutate(rs = rowSums(.[2:ncol(.)],na.rm = TRUE)) %>%
gather(nm, val, -x, -rs) %>%
mutate(val = case_when(
is.na(val) & rs > 0 ~ "0",
is.na(val) & rs == 0 ~ "NA",
!is.na(val) ~ as.character(val)
), val = as.numeric(val)) %>%
spread(nm, val, fill = NA) %>%
select(-rs, -V1) %>%
rename_at(vars(2:ncol(.)), funs(paste0(!!col_name,"_", .)))
}
my_sep_fun(df, ate)
#> # A tibble: 5 x 5
#> x ate_apple ate_banana ate_grapes ate_pear
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 0 0
#> 2 2 1 1 0 1
#> 3 3 0 1 1 0
#> 4 4 NA NA NA NA
#> 5 5 0 0 0 0
Created on 2018-08-20 by the reprex
package (v0.2.0).

One solution, much less verbose, in just three lines. Once you have the dataframe:
First, separate the values in each cell:
df <- separate_rows_(df, 'ate')
Second, dummify every answer using the function dummify from DataExplorer:
df <- DataExplorer::dummify(df, 'ate')
Third, aggregate the redundant rows like that:
df <- aggregate(df[,2:6], by=df$x, FUN= sum)
(you could also apply a max function here since you want to capture all 1's in the columns).
Done!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Compute row-wise counts in subsets of columns in dplyr - r

We can also use reduce with map library(tidyverse) df %>% select(text_cols) %>% map(~ .x == "yes" & !is.na(.x)) %>% reduce(`+`) %>% bind_cols(df, sum_yes = .) # num_col_1 num_col_2 text_col_1 text_col_2 sum_yes #1 1 4 yes yes 2 #2 2 5 no yes 1 #3 3 6 no <NA> 0

Related

Aggregate string variable using summarise and across function

Fill cells dataframe based on multiple conditions

mutate if feature exists else NA_real_

rollsumr with window-length>1: filling missing values

separate_columns for tidyr

Categories

Resources