Why does case_when() compute false condition? - r

I have a data.frame with a group variable and an integer variable, with missing data.
df<-data.frame(group=c(1,1,2,2,3,3),a=as.integer(c(1,2,NA,NA,1,NA)))
I want to compute the maximum available value of variable a within each group : in my example, I should get 2 for group 1, NA for group 2 and 1 for group 3.
df %>% group_by(group) %>% mutate(max.a=case_when(sum(!is.na(a))==0 ~ NA_integer_,
T ~ max(a,na.rm=T)))
The above code generates an error, seemingly because in group 2 all values of a are missing so max(a,na.rm=T) is set to -Inf, which is not an integer.
Why is this case computed for group 2 whereas the condition is false, as the following verification confirms ?
df %>% group_by(group) %>% mutate(test=sum(!is.na(a))==0)
I found a workaround converting a to double, but I still get a warning and dissatisfaction not to have found a better solution.

case_when evaluates all the RHS of the condition irrespective if the condition is satisfied or not hence you get an error. You may use hablar::max_ which returns NA if all the values are NA.
library(dplyr)
df %>%
group_by(group) %>%
mutate(max.a= hablar::max_(a)) %>%
ungroup
# group a max.a
# <dbl> <int> <int>
#1 1 1 2
#2 1 2 2
#3 2 NA NA
#4 2 NA NA
#5 3 1 1
#6 3 NA 1

Instead of making use of case_when I would suggest to use an if () statement like so:
library(dplyr)
df <- data.frame(group = c(1, 1, 2, 2, 3, 3), a = as.integer(c(1, 2, NA, NA, 1, NA)))
df %>%
group_by(group) %>%
mutate(max.a = if (all(is.na(a))) NA_real_ else max(a, na.rm = T))
#> # A tibble: 6 x 3
#> # Groups: group [3]
#> group a max.a
#> <dbl> <int> <dbl>
#> 1 1 1 2
#> 2 1 2 2
#> 3 2 NA NA
#> 4 2 NA NA
#> 5 3 1 1
#> 6 3 NA 1

This code gives a warning but it works.
library(dplyr)
df %>%
group_by(group) %>%
dplyr::summarise(max.a = max(a, na.rm=TRUE))
Output:
group max.a
<dbl> <dbl>
1 1 2
2 2 -Inf
3 3 1

Related

R - Summarize dataframe to avoid NAs

Having a dataframe like:
id = c(1,1,1)
A = c(3,NA,NA)
B = c(NA,5,NA)
C= c(NA,NA,2)
df = data.frame(id,A,B,C)
id A B C
1 1 3 NA NA
2 1 NA 5 NA
3 1 NA NA 2
I want to summarize the whole dataframe in one row that it contains no NAs. It should looke like:
id A B C
1 1 3 5 2
It should work also when the dataframe is bigger and contains more ids but in the same logic.
I didnt found the right function for that and tried some variations of summarise().
You can group_by id and use max with na.rm = TRUE:
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(everything(), max, na.rm = TRUE))
id A B C
1 1 3 5 2
If multiple cases, max may not be what you want, you can use sum instead.
Using fmax from collapse
library(collapse)
fmax(df[-1], df$id)
A B C
1 3 5 2
Alternatively please check the below code
data.frame(id,A,B,C) %>% group_by(id) %>% fill(c(A,B,C), .direction = 'downup') %>%
slice_head(n=1)
Created on 2023-02-03 with reprex v2.0.2
# A tibble: 1 × 4
# Groups: id [1]
id A B C
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 2

fill NA values per group based on first value of a group

I am trying to fill NA values of my dataframe. However, I would like to fill them based on the first value of each group.
#> df = data.frame(
group = c(rep("A", 4), rep("B", 4)),
val = c(1, 2, NA, NA, 4, 3, NA, NA)
)
#> df
group val
1 A 1
2 A 2
3 A NA
4 A NA
5 B 4
6 B 3
7 B NA
8 B NA
#> fill(df, val, .direction = "down")
group val
1 A 1
2 A 2
3 A 2 # -> should be 1
4 A 2 # -> should be 1
5 B 4
6 B 3
7 B 3 # -> should be 4
8 B 3 # -> should be 4
Can I do this with tidyr::fill()? Or is there another (more or less elegant) way how to do this? I need to use this in a longer chain (%>%) operation.
Thank you very much!
Use tidyr::replace_na() and dplyr::first() (or val[[1]]) inside a grouped mutate():
library(dplyr)
library(tidyr)
df %>%
group_by(group) %>%
mutate(val = replace_na(val, first(val))) %>%
ungroup()
#> # A tibble: 8 × 2
#> group val
#> <chr> <dbl>
#> 1 A 1
#> 2 A 2
#> 3 A 1
#> 4 A 1
#> 5 B 4
#> 6 B 3
#> 7 B 4
#> 8 B 4
PS - #richarddmorey points out the case where the first value for a group is NA. The above code would keep all NA values as NA. If you'd like to instead replace with the first non-missing value per group, you could subset the vector using !is.na():
df %>%
group_by(group) %>%
mutate(val = replace_na(val, first(val[!is.na(val)]))) %>%
ungroup()
Created on 2022-11-17 with reprex v2.0.2
This should work, which uses dplyr's case_when
library(dplyr)
df %>%
group_by(group) %>%
mutate(val = case_when(
is.na(val) ~ val[1],
TRUE ~ val
))
Output:
group val
<chr> <dbl>
1 A 1
2 A 2
3 A 1
4 A 1
5 B 4
6 B 3
7 B 4
8 B 4

Including missing values in summarise output

I am trying to still keep all rows in a summarise output even when one of the columns does not exist. I have a data frame that looks like this:
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9))
I then need to summarize by all ids, where id is a row and there is a column for the first seq_num and second one. Even if the second one doesn't exist, I'd still like that row to be maintained, with an NA in that slot. I've tried the answers in this answer, but they are not working.
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
outputs
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
I would still like a 3rd row, though, with seq_0_time=9, and seq_1_time=NA since it doesn't exist.
How can I do this?
If there are only max one observation per 'seq_num' for each 'id', then it is possible to coerce to NA where there are no cases with [1]
library(dplyr)
dat %>%
group_by(id) %>%
summarise(seq_0_time = time[seq_num ==0][1],
seq_1_time = time[seq_num == 1][1], .groups = 'drop')
-output
# A tibble: 3 × 3
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
3 3 9 NA
It is just that the length of 0 can be modified to length 1 by assigning NA Or similarly this can be used to replicate NA to fill for 2, 3, etc, by specifying the index that didn't occur
> with(dat, time[seq_num==1 & id == 3])
numeric(0)
> with(dat, time[seq_num==1 & id == 3][1])
[1] NA
> numeric(0)
numeric(0)
> numeric(0)[1]
[1] NA
> numeric(0)[1:2]
[1] NA NA
Or using length<-
> `length<-`(numeric(0), 3)
[1] NA NA NA
This can actually be pretty easily solved using reshape.
> reshape(dat, timevar='seq_num', idvar = 'id', direction = 'wide')
id time.0 time.1 time.2
1 1 4 5 NA
3 2 6 7 8
6 3 9 NA NA
My understanding is that you must use complete() on both the seq_num and id variables to achieve your desired result:
library(tidyverse)
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9)) %>%
complete(seq_num = seq_num,
id = id)
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
#> # A tibble: 3 x 3
#> id seq_0_time seq_1_time
#> <dbl> <dbl> <dbl>
#> 1 1 4 5
#> 2 2 6 7
#> 3 3 9 NA
Created on 2022-04-20 by the reprex package (v2.0.1)

A computation efficient way to find the IDs of the Type 1 rows just above and below each Type 2 rows?

I have the following data
df <- tibble(Type=c(1,2,2,1,1,2),ID=c(6,4,3,2,1,5))
Type ID
1 6
2 4
2 3
1 2
1 1
2 5
For each of the type 2 rows, I want to find the IDs of the type 1 rows just below and above them. For the above dataset, the output will be:
Type ID IDabove IDbelow
1 6 NA NA
2 4 6 2
2 3 6 2
1 2 NA NA
1 1 NA NA
2 5 1 NA
Naively, I can write a for loop to achieve this, but that would be too time consuming for the dataset I am dealing with.
One approach using dplyr lead,lag to get next and previous value respectively and data.table's rleid to create groups of consecutive Type values.
library(dplyr)
library(data.table)
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = rleid(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
# Type ID IDabove IDbelow
# <dbl> <dbl> <dbl> <dbl>
#1 1 6 NA NA
#2 2 4 6 2
#3 2 3 6 2
#4 1 2 NA NA
#5 1 1 NA NA
#6 2 5 1 NA
A dplyr only solution:
You could create your own rleid function then apply the logic provided by Ronak(Many thanks. Upvoted).
library(dplyr)
my_func <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
# this part is the same as provided by Ronak.
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = my_func(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
Output:
Type ID IDabove IDbelow
<dbl> <dbl> <dbl> <dbl>
1 1 6 NA NA
2 2 4 6 2
3 2 3 6 2
4 1 2 NA NA
5 1 1 NA NA
6 2 5 1 NA

Moving rows to columns in R using identifier

I have a dataset in r with two columns of numerical data and one with an identifier. Some of the rows share the same identifier (i.e. they are the same individual), but contain different data. I want to use the identifier to move those that share an identifier from a row into a columns. There are currently 600 rows, but there should be 400.
Can anyone share r code that might do this? I am new to R, and have tried the reshape (cast) programme, but I can't really follow it, and am not sure it's exactly what i'm trying to do.
Any help gratefully appreciated.
UPDATE:
Current
ID Age Sex
1 3 1
1 5 1
1 6 1
1 7 1
2 1 2
2 12 2
2 5 2
3 3 1
Expected output
ID Age Sex Age2 Sex2 Age3 Sex3 Age4 Sex4
1 3 1 5 1 6 1 7 1
2 1 2 12 2 5 2
3 3 1
UPDATE 2:
So far I have tried using the melt and dcast commands from reshape2. I am getting there, but it still doesn't look quite right. Here is my code:
x <- melt(example, id.vars = "ID")
x$time <- ave(x$ID, x$ID, FUN = seq_along)
example2 <- dcast (x, ID ~ time, value.var = "value")
and here is the output using that code:
ID A B C D E F G H (for clarity i have labelled these)
1 3 5 6 7 1 1 1 1
2 1 12 5 2 2 2
3 3 1
So, as you can probably see, it is mixing up the 'sex' and 'age' variables and combining them in the same column. For example column D has the value '7' for person 1 (age4), but '2' for person 2 (Sex). I can see that my code is not instructing where the numerical values should be cast to, but I do not know how to code that part. Any ideas?
Here's an approach using gather, spread and unite from the tidyr package:
suppressPackageStartupMessages(library(tidyverse))
x <- tribble(
~ID, ~Age, ~Sex,
1, 3, 1,
1, 5, 1,
1, 6, 1,
1, 7, 1,
2, 1, 2,
2, 12, 2,
2, 5, 2,
3, 3, 1
)
x %>% group_by(ID) %>%
mutate(grp = 1:n()) %>%
gather(var, val, -ID, -grp) %>%
unite("var_grp", var, grp, sep ='') %>%
spread(var_grp, val, fill = '')
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Age2 Age3 Age4 Sex1 Sex2 Sex3 Sex4
#> * <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 5 6 7 1 1 1 1
#> 2 2 1 12 5 2 2 2
#> 3 3 3 1
If you prefer to keep the columns numeric then just remove the fill='' argument from spread(var_grp, val, fill = '').
Other questions which might help with this include:
R spreading multiple columns with tidyr
How can I spread repeated measures of multiple variables into wide format?
I have recently come across a similar issue in my data, and wanted to provide an update using the tidyr 1.0 functions as gather and spread have been retired. The new pivot_longer and pivot_wider are currently much slower than gather and spread, especially on very large datasets, but this is supposedly fixed in the next update of tidyr, so hope this updated solution is useful to people.
library(tidyr)
library(dplyr)
x %>%
group_by(ID) %>%
mutate(grp = 1:n()) %>%
pivot_longer(-c(ID, grp), names_to = "var", values_to = "val") %>%
unite("var_grp", var, grp, sep = "") %>%
pivot_wider(names_from = var_grp, values_from = val)
#> # A tibble: 3 x 9
#> # Groups: ID [3]
#> ID Age1 Sex1 Age2 Sex2 Age3 Sex3 Age4 Sex4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 1 5 1 6 1 7 1
#> 2 2 1 2 12 2 5 2 NA NA
#> 3 3 3 1 NA NA NA NA NA NA

Resources