R dplyr::c_across() strange behaviour in rowSums - r

I'm trying to see how to apply rowSums() to specific columns only.
here is a reprex:
df <- tibble(
"ride" = c("bicycle", "motorcycle", "car", "other"),
"A" = c(1, NA, 1, NA),
"B" = c(NA, 2, NA, 2)
)
I can get the desired result, by index[2:3]
df %>%
mutate(total = rowSums(.[2:3], na.rm = TRUE))
# A tibble: 4 × 4
ride A B total
<chr> <dbl> <dbl> <dbl>
1 bicycle 1 NA 1
2 motorcycle NA 2 2
3 car 1 NA 1
4 other NA 2 2
however, if I try specifying columns by name, strange results occur
df %>%
mutate(total = sum(c_across(c("A":"B")), na.rm = TRUE))
# A tibble: 4 × 4
ride A B total
<chr> <dbl> <dbl> <dbl>
1 bicycle 1 NA 6
2 motorcycle NA 2 6
3 car 1 NA 6
4 other NA 2 6
What am I doing wrong?
I can achieve what I want, by something like this:
df %>%
mutate_all(~replace(., is.na(.), 0)) %>%
mutate(total = A + B)
but I'd like to specify column names by passing a vector, so I can change to different combination of column names in future.
Something like this is what I'd like to achieve:
cols_to_sum <- c("A","B")
df %>%
mutate(total = sum(across(cols_to_sum), na.rm = TRUE))

You may use select to specify the columns you want to sum.
library(dplyr)
cols_to_sum <- c("A","B")
df %>%
mutate(total = rowSums(select(., all_of(cols_to_sum)), na.rm = TRUE))
# ride A B total
# <chr> <dbl> <dbl> <dbl>
#1 bicycle 1 NA 1
#2 motorcycle NA 2 2
#3 car 1 NA 1
#4 other NA 2 2
c_across works with rowwise -
df %>%
rowwise() %>%
mutate(total = sum(c_across(all_of(cols_to_sum)), na.rm = TRUE)) %>%
ungroup

Related

R - Summarize dataframe to avoid NAs

Having a dataframe like:
id = c(1,1,1)
A = c(3,NA,NA)
B = c(NA,5,NA)
C= c(NA,NA,2)
df = data.frame(id,A,B,C)
id A B C
1 1 3 NA NA
2 1 NA 5 NA
3 1 NA NA 2
I want to summarize the whole dataframe in one row that it contains no NAs. It should looke like:
id A B C
1 1 3 5 2
It should work also when the dataframe is bigger and contains more ids but in the same logic.
I didnt found the right function for that and tried some variations of summarise().
You can group_by id and use max with na.rm = TRUE:
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(everything(), max, na.rm = TRUE))
id A B C
1 1 3 5 2
If multiple cases, max may not be what you want, you can use sum instead.
Using fmax from collapse
library(collapse)
fmax(df[-1], df$id)
A B C
1 3 5 2
Alternatively please check the below code
data.frame(id,A,B,C) %>% group_by(id) %>% fill(c(A,B,C), .direction = 'downup') %>%
slice_head(n=1)
Created on 2023-02-03 with reprex v2.0.2
# A tibble: 1 × 4
# Groups: id [1]
id A B C
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 2

summarise by group returns 0 instead of NA if all values are NA

library(dplyr)
dat <-
data.frame(id = rep(c(1,2,3,4), each = 3),
value = c(NA, NA, NA, 0, 1, 2, 0, 1, NA, 1, 2,3))
dat %>%
dplyr::group_by(id) %>%
dplyr::summarise(value_sum = sum(value, na.rm = T))
# A tibble: 4 x 2
id value_sum
1 0
2 3
3 1
4 6
Is there any way I can return NA if all the entries in a group are NA. For e.g. id 1 has all the entries as NA so I want the value_sum to be NA as well.
# A tibble: 4 x 2
id value_sum
1 NA
2 3
3 1
4 6
One way is to use an if/else statement: If all is Na return NA else return sum():
dat %>%
dplyr::group_by(id) %>%
#dplyr::summarise(value_sum = sum(value, na.rm = F)) %>%
summarise(number = if(all(is.na(value))) NA_real_ else sum(value, na.rm = TRUE))
id number
<dbl> <dbl>
1 1 NA
2 2 3
3 3 1
4 4 6
We could use fsum
library(collapse)
fsum(dat$value, g = dat$id)
1 2 3 4
NA 3 1 6
Or with dplyr
library(dplyr)
dat %>%
group_by(id) %>%
summarise(number = fsum(value))
# A tibble: 4 × 2
id number
<dbl> <dbl>
1 1 NA
2 2 3
3 3 1
4 4 6

Replace NA values per group with concatenated values from the same column

I wish to achieve the following:
For each Group, when the ID column is NA, then fill the corresponding NA value in Name with the concatenation of the other values of Name while ignoring other NA values in Name
My data frame looks as follows:
x <- data.frame(Group = c("A","A","A","A","B","B"),ID = c(1,2,3,NA,NA,5),Name = c("Bob","Jane",NA,NA,NA,"Tim"))
This is what I wish to achieve:
y <- data.frame(Group = c("A","A","A","A","B","B"),ID = c(1,2,3,NA,NA,5),Name = c("Bob","Jane",NA,"Bob Jane","Tim","Tim"))
If there's a way to achieve this in the tidyverse I would be very grateful for any pointers.
I've tried the following but it doesn't find the object 'Name'
x %>% group_by(Group) %>% replace_na(list(Name = paste(unique(.Name))))
We may use a conditional expression with replace
library(dplyr)
library(stringr)
x %>%
group_by(Group) %>%
mutate(Name = replace(Name, is.na(ID), str_c(Name[!is.na(Name)],
collapse = ' '))) %>%
ungroup
-output
# A tibble: 6 × 3
Group ID Name
<chr> <dbl> <chr>
1 A 1 Bob
2 A 2 Jane
3 A 3 <NA>
4 A NA Bob Jane
5 B NA Tim
6 B 5 Tim
Does this work:
library(dplyr)
x %>% group_by(Group) %>%
mutate(Name = case_when(is.na(ID) ~ paste(Name[!is.na(Name)], collapse = ' '), TRUE ~ Name))
# A tibble: 6 x 3
# Groups: Group [2]
Group ID Name
<chr> <dbl> <chr>
1 A 1 Bob
2 A 2 Jane
3 A 3 NA
4 A NA Bob Jane
5 B NA Tim
6 B 5 Tim

Why does case_when() compute false condition?

I have a data.frame with a group variable and an integer variable, with missing data.
df<-data.frame(group=c(1,1,2,2,3,3),a=as.integer(c(1,2,NA,NA,1,NA)))
I want to compute the maximum available value of variable a within each group : in my example, I should get 2 for group 1, NA for group 2 and 1 for group 3.
df %>% group_by(group) %>% mutate(max.a=case_when(sum(!is.na(a))==0 ~ NA_integer_,
T ~ max(a,na.rm=T)))
The above code generates an error, seemingly because in group 2 all values of a are missing so max(a,na.rm=T) is set to -Inf, which is not an integer.
Why is this case computed for group 2 whereas the condition is false, as the following verification confirms ?
df %>% group_by(group) %>% mutate(test=sum(!is.na(a))==0)
I found a workaround converting a to double, but I still get a warning and dissatisfaction not to have found a better solution.
case_when evaluates all the RHS of the condition irrespective if the condition is satisfied or not hence you get an error. You may use hablar::max_ which returns NA if all the values are NA.
library(dplyr)
df %>%
group_by(group) %>%
mutate(max.a= hablar::max_(a)) %>%
ungroup
# group a max.a
# <dbl> <int> <int>
#1 1 1 2
#2 1 2 2
#3 2 NA NA
#4 2 NA NA
#5 3 1 1
#6 3 NA 1
Instead of making use of case_when I would suggest to use an if () statement like so:
library(dplyr)
df <- data.frame(group = c(1, 1, 2, 2, 3, 3), a = as.integer(c(1, 2, NA, NA, 1, NA)))
df %>%
group_by(group) %>%
mutate(max.a = if (all(is.na(a))) NA_real_ else max(a, na.rm = T))
#> # A tibble: 6 x 3
#> # Groups: group [3]
#> group a max.a
#> <dbl> <int> <dbl>
#> 1 1 1 2
#> 2 1 2 2
#> 3 2 NA NA
#> 4 2 NA NA
#> 5 3 1 1
#> 6 3 NA 1
This code gives a warning but it works.
library(dplyr)
df %>%
group_by(group) %>%
dplyr::summarise(max.a = max(a, na.rm=TRUE))
Output:
group max.a
<dbl> <dbl>
1 1 2
2 2 -Inf
3 3 1

How to use summarise?

Here is the dataframe
df <- data.frame(number = c(1,1,2,2,2,3,3),
heahache = c(1,1,na,na,na,1,na),
pain = c(na,1,1,na,1,na,na),
futigue = c(na,na,1,na,1,1,1))
headache pain futigue
1 1 na na
1 1 1 na
2 na 1 1
2 na na na
2 na 1 1
3 1 na 1
3 na na 1
The first result that I want is to get how many times each symptom appeared like this
headache pain futigue
1 2 1 0
2 0 2 2
3 1 0 2
The second result is to calculate how many symptoms each person got like
symptoms
1 2
2 2
3 2
Since the real data set has 50+ columns discribing different symptoms, any idea to manage large data set? Thank you.
First, tidy your data (note the corrections of typos: na should be NA, heahache should be headache and futigue should be fatigue):
library(tidyverse)
df <- data.frame(number = c(1,1,2,2,2,3,3),
headache = c(1,1,NA,NA,NA,1,NA),
pain = c(NA,1,1,NA,1,NA,NA),
fatigue = c(NA,NA,1,NA,1,1,1))
longDF <- df %>%
pivot_longer(
cols=c(headache, pain, fatigue),
names_to="Symptom",
values_to="Present"
) %>%
replace_na(list(Present=0))
Then to count appearances:
longDF %>%
group_by(number, Symptom) %>%
summarise(Count=sum(Present)) %>%
pivot_wider(
names_from=Symptom,
values_from=Count
)
# A tibble: 3 x 4
# Groups: number [3]
number fatigue headache pain
<dbl> <dbl> <dbl> <dbl>
1 1 0 2 1
2 2 2 0 2
3 3 2 1 0
and the number of symptoms experienced by each number:
longDF %>%
filter(Present == 1) %>%
group_by(number) %>%
summarise(symptoms=length(unique(Symptom)))
# A tibble: 3 x 2
number symptoms
* <dbl> <int>
1 1 2
2 2 2
3 3 2
Note that this final calculation will omit numbers who do not experience any symptoms. To do that, a little more work will be required. To show the problem, add a number who exprienced no symptoms:
newDF <- longDF %>%
add_row(number=4, Symptom="headache", Present=0) %>%
add_row(number=4, Symptom="fatigue", Present=0) %>%
add_row(number=4, Symptom="pain", Present=0)
Demonstrate the problem:
newDF %>%
filter(Present == 1) %>%
group_by(number) %>%
summarise(symptoms=length(unique(Symptom)))
# A tibble: 3 x 2
number symptoms
* <dbl> <int>
1 1 2
2 2 2
3 3 2
And solve it:
newDF %>%
filter(Present == 1) %>%
group_by(number) %>%
summarise(symptoms=length(unique(Symptom))) %>%
right_join(newDF %>% distinct(number), by="number") %>%
replace_na(list(symptoms=0))
# A tibble: 4 x 2
number symptoms
<dbl> <dbl>
1 1 2
2 2 2
3 3 2
4 4 0
We can just use summarise from dplyr and doesn't need any additional packages. For larger dataset, reshaping could be costly. Would recommend to summarise first and use rowSums (vectorized and efficient) to create the 'Symptoms' column
library(dplyr)
df %>%
group_by(number) %>%
summarise(across(everything(), ~ sum(!is.na(.))))
-output
# A tibble: 3 x 4
number headache pain fatigue
* <dbl> <int> <int> <int>
1 1 2 1 0
2 2 0 2 2
3 3 1 0 2
If we need the symptoms column
df %>%
group_by(number) %>%
summarise(across(everything(), ~ sum(!is.na(.)))) %>%
mutate(Symptoms = rowSums(.[-1] > 0))
# A tibble: 3 x 5
# number headache pain fatigue Symptoms
#* <dbl> <int> <int> <int> <dbl>
#1 1 2 1 0 2
#2 2 0 2 2 2
#3 3 1 0 2 2
data
df <- structure(list(number = c(1, 1, 2, 2, 2, 3, 3), headache = c(1,
1, NA, NA, NA, 1, NA), pain = c(NA, 1, 1, NA, 1, NA, NA), fatigue = c(NA,
NA, 1, NA, 1, 1, 1)), class = "data.frame", row.names = c(NA,
-7L))

Resources