Function breaks when looped within dplyr::case_when() - r

I have a function that extracts the min or minimal of a range of values (within a character string) that appears to work fine on individual cases.
However, when I try to use it within case_when() it does not behave as expected.
Reproducible example
library(dplyr)
library(tibble)
library(stringr)
val_from_range <- function(.str, .fun = "min"){
str_extract_all(.str, "\\d*\\.?\\d+") |>
unlist() |>
as.numeric() |>
(\(x) if (.fun == "min") x |> min()
else if (.fun == "max") x |> max())()
}
tibble(x = c("5-6", "4", "6-9", "5", "NA")) |>
mutate(min = case_when(str_detect(x, "-") ~ val_from_range(x, "min"))) |>
mutate(max = case_when(str_detect(x, "-") ~ val_from_range(x, "max")))
# A tibble: 5 x 3
x min max
<chr> <dbl> <dbl>
1 5-6 4 9
2 4 NA NA
3 6-9 4 9
4 5 NA NA
5 NA NA NA
However, I want:
# A tibble: 5 x 3
x min max
<chr> <dbl> <dbl>
1 5-6 5 6
2 4 NA NA
3 6-9 6 9
4 5 NA NA
5 NA NA NA
The function performs as expected on individual cases
> val_from_range("5-6", "min")
[1] 5
> val_from_range("5-6", "max")
[1] 6
> val_from_range("5-6-8-10", "max")
[1] 10
Any help would be greatly appreciated. Thanks in advance.

Couple of changes required. The function works only for one value at a time . If you pass in more than one value it ignores the second value.
val_from_range("5-6", "min")
#[1] 5
val_from_range(c("5-6", "8-10"), "min")
#[1] 5
To pass them one by one you can take help of rowwise. Secondly, case_when still executes the function for values that do not satisfy the condition hence it returns a warning for "NA" value. We can use if/else here to avoid that.
library(dplyr)
library(stringr)
tibble(x = c("5-6", "4", "6-9", "5", "NA")) %>%
rowwise() %>%
mutate(min = if(str_detect(x, "-")) val_from_range(x, "min") else NA,
max = if(str_detect(x, "-")) val_from_range(x, "max") else NA) %>%
ungroup
# x min max
# <chr> <dbl> <dbl>
#1 5-6 5 6
#2 4 NA NA
#3 6-9 6 9
#4 5 NA NA
#5 NA NA NA

Related

Including missing values in summarise output

I am trying to still keep all rows in a summarise output even when one of the columns does not exist. I have a data frame that looks like this:
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9))
I then need to summarize by all ids, where id is a row and there is a column for the first seq_num and second one. Even if the second one doesn't exist, I'd still like that row to be maintained, with an NA in that slot. I've tried the answers in this answer, but they are not working.
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
outputs
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
I would still like a 3rd row, though, with seq_0_time=9, and seq_1_time=NA since it doesn't exist.
How can I do this?
If there are only max one observation per 'seq_num' for each 'id', then it is possible to coerce to NA where there are no cases with [1]
library(dplyr)
dat %>%
group_by(id) %>%
summarise(seq_0_time = time[seq_num ==0][1],
seq_1_time = time[seq_num == 1][1], .groups = 'drop')
-output
# A tibble: 3 × 3
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
3 3 9 NA
It is just that the length of 0 can be modified to length 1 by assigning NA Or similarly this can be used to replicate NA to fill for 2, 3, etc, by specifying the index that didn't occur
> with(dat, time[seq_num==1 & id == 3])
numeric(0)
> with(dat, time[seq_num==1 & id == 3][1])
[1] NA
> numeric(0)
numeric(0)
> numeric(0)[1]
[1] NA
> numeric(0)[1:2]
[1] NA NA
Or using length<-
> `length<-`(numeric(0), 3)
[1] NA NA NA
This can actually be pretty easily solved using reshape.
> reshape(dat, timevar='seq_num', idvar = 'id', direction = 'wide')
id time.0 time.1 time.2
1 1 4 5 NA
3 2 6 7 8
6 3 9 NA NA
My understanding is that you must use complete() on both the seq_num and id variables to achieve your desired result:
library(tidyverse)
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9)) %>%
complete(seq_num = seq_num,
id = id)
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
#> # A tibble: 3 x 3
#> id seq_0_time seq_1_time
#> <dbl> <dbl> <dbl>
#> 1 1 4 5
#> 2 2 6 7
#> 3 3 9 NA
Created on 2022-04-20 by the reprex package (v2.0.1)

compare sets of columns in R dataframe and keep one value from each set of two columns

Basically, I have a large dataset with many different variables. The data is ordered in pairs (2019 and 2020) and for some variables for neither year data is available for some only 2019 and some only 2020. I would like the 2020 data to 'override' the 2019 data, but only if it is available in 2020 and 2019. If no data is available for either year, then the data should stay missing. I now do this with a little helper function, but this should be more scalable, so that I can do it for 200+ column pairs. What am I missing in mutate(across(....),)
# Create data
mydf <- tibble(ID = 1:5,
var1_2019 = c(9, NA, 3, 2, NA),
var1_2020 = c(NA, NA, 3, 2, 4),
var2_2019 = c("A", "B",NA, "D", "C"),
var2_2020 = c(NA, "B",NA, "R", NA),
var3_2019 = c(T, F, NA, NA, NA),
var3_2020 = c(NA, NA, NA, NA, F))
# create little helper function. this is good because
# it could be made more complex in the future,
# for example for numeric variables keeping the larger of the two
which_to_keep_f <-
function(x, y) {
if (is.na(x) && is.na(y)) {
output <- NA
}
if (is.na(x) && !is.na(y)) {
output <- y
}
if (!is.na(x) && is.na(y)) {
output <- x
}
if (!is.na(x) && !is.na(y)) {
output <- y
}
output
}
# vectorize it
which_to_keep_f_vec <- Vectorize(which_to_keep_f)
# use function inside mutate
mydf %>%
mutate(var1 = which_to_keep_f_vec(var1_2019, var1_2020)) %>%
mutate(var2 = which_to_keep_f_vec(var2_2019, var2_2020)) %>%
mutate(var3 = which_to_keep_f_vec(var3_2019, var3_2020)) %>%
select(-contains("_20"))
Solution
Thanks to TarJae and micahkimel I got to 99% of the solution. This is the complete solution (including dropping the variables that are no longer needed and renaming the variables to their desired format)
mydf %>%
mutate(across(ends_with('_2019'),
~(which_to_keep_f_vec(.,
get(stringr::str_replace(cur_column(), "_2019$", "_2020"))))) %>%
unnest(cols=c()))%>%
select(-contains("_2020")) %>%
rename_all(~ stringr::str_replace(., regex("_2019$", ignore_case = TRUE), ""))
Update: Thanks to micahkimel removing list to not duplicate the data:
Is this what you are looking for. Here we apply your function to sets of pairs:
library(dplyr)
library(stringr)
mydf %>%
mutate(across(ends_with('_2019'),
~(which_to_keep_f_vec(.,
get(str_replace(cur_column(), "_2019$", "_2020"))))) %>%
unnest(cols=c())
ID var1_2019 var1_2020 var2_2019 var2_2020 var3_2019 var3_2020
<int> <dbl> <dbl> <chr> <chr> <lgl> <lgl>
1 1 9 NA A NA TRUE NA
2 2 NA NA B B FALSE NA
3 3 3 3 NA NA NA NA
4 4 2 2 R R NA NA
5 5 4 4 C NA FALSE FALSE
Here's an approach that results in just one variable for each pair of variables in your input table. First, use pivot_longer() to collapse the pairs into single variables, and add year as a column (with twice as many observations).
mydf_long = mydf %>%
pivot_longer(cols = matches("_20"), names_to = c(".value", "year"),
names_sep = "_")
ID year var1 var2 var3
<int> <chr> <dbl> <chr> <lgl>
1 1 2019 9 A TRUE
2 1 2020 NA NA NA
3 2 2019 NA B FALSE
4 2 2020 NA B NA
5 3 2019 3 NA NA
6 3 2020 3 NA NA
7 4 2019 2 D NA
8 4 2020 2 R NA
9 5 2019 NA C NA
10 5 2020 4 NA FALSE
Next, use fill() to populate later NA values with earlier non-missing values. Then we can just filter to the most recent year (2020). For each variable, that year will have its own value if it had one before; otherwise, it will carry over the value from the previous year.
mydf_long %>%
group_by(ID) %>%
fill(var1, var2, var3) %>%
filter(year == 2020)
ID year var1 var2 var3
<int> <chr> <dbl> <chr> <lgl>
1 1 2020 9 A TRUE
2 2 2020 NA B FALSE
3 3 2020 3 NA NA
4 4 2020 2 R NA
5 5 2020 4 C FALSE

Why does case_when() compute false condition?

I have a data.frame with a group variable and an integer variable, with missing data.
df<-data.frame(group=c(1,1,2,2,3,3),a=as.integer(c(1,2,NA,NA,1,NA)))
I want to compute the maximum available value of variable a within each group : in my example, I should get 2 for group 1, NA for group 2 and 1 for group 3.
df %>% group_by(group) %>% mutate(max.a=case_when(sum(!is.na(a))==0 ~ NA_integer_,
T ~ max(a,na.rm=T)))
The above code generates an error, seemingly because in group 2 all values of a are missing so max(a,na.rm=T) is set to -Inf, which is not an integer.
Why is this case computed for group 2 whereas the condition is false, as the following verification confirms ?
df %>% group_by(group) %>% mutate(test=sum(!is.na(a))==0)
I found a workaround converting a to double, but I still get a warning and dissatisfaction not to have found a better solution.
case_when evaluates all the RHS of the condition irrespective if the condition is satisfied or not hence you get an error. You may use hablar::max_ which returns NA if all the values are NA.
library(dplyr)
df %>%
group_by(group) %>%
mutate(max.a= hablar::max_(a)) %>%
ungroup
# group a max.a
# <dbl> <int> <int>
#1 1 1 2
#2 1 2 2
#3 2 NA NA
#4 2 NA NA
#5 3 1 1
#6 3 NA 1
Instead of making use of case_when I would suggest to use an if () statement like so:
library(dplyr)
df <- data.frame(group = c(1, 1, 2, 2, 3, 3), a = as.integer(c(1, 2, NA, NA, 1, NA)))
df %>%
group_by(group) %>%
mutate(max.a = if (all(is.na(a))) NA_real_ else max(a, na.rm = T))
#> # A tibble: 6 x 3
#> # Groups: group [3]
#> group a max.a
#> <dbl> <int> <dbl>
#> 1 1 1 2
#> 2 1 2 2
#> 3 2 NA NA
#> 4 2 NA NA
#> 5 3 1 1
#> 6 3 NA 1
This code gives a warning but it works.
library(dplyr)
df %>%
group_by(group) %>%
dplyr::summarise(max.a = max(a, na.rm=TRUE))
Output:
group max.a
<dbl> <dbl>
1 1 2
2 2 -Inf
3 3 1

exists function doesn't work as expected with dplyr transmute

I would like to do create the output as for df3 with dplyr transmute. But somehow it just takes the first row of the dataframe columns a and b and not the column itselft. any ideas?
df = data.frame(a=1:10, b=2:11)
df2 <- df %>%
transmute(
newcol = ifelse(exists("a", df)==TRUE,a, NA),
newcol2 = ifelse(exists("b", df)==TRUE,b, NA),
newcol3 = ifelse(exists("c", df)==TRUE,c, NA),
)
df2
df3 = data.frame(newcol=1:10, newcol2=2:11, newcol3 = NA)
df3
The problem is that exists("a", df) returns a length-1 logical vector, so the ifelse returns a length-1 numeric vector. This is then recycled, which is why the first number in each column get recycled. You can use if(condition) a else NA instead:
df = data.frame(a=1:10, b=2:11)
df2 <- df %>%
transmute(
newcol = if(exists("a", df)) a else NA,
newcol2 = if(exists("b", df)) b else NA,
newcol3 = if(exists("c", df)) c else NA)
)
df2
#> newcol newcol2 newcol3
#> 1 1 2 NA
#> 2 2 3 NA
#> 3 3 4 NA
#> 4 4 5 NA
#> 5 5 6 NA
#> 6 6 7 NA
#> 7 7 8 NA
#> 8 8 9 NA
#> 9 9 10 NA
#> 10 10 11 NA
An option with map
library(dplyr)
library(purrr)
map_dfc(c('a', 'b', 'c'),
~ if(exists(.x, df)) df %>% select(.x) else df %>% transmute(!! .x := NA))

Using cummean with group_by and ignoring NAs

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
I'd like to add a new column to the above dataframe which takes the cumulative mean of the value column, not taking into account NAs. Is it possible to do this with dplyr? I've tried
df <- df %>% group_by(category) %>% mutate(new_col=cummean(value))
but cummean just doesn't know what to do with NAs.
EDIT: I do not want to count NAs as 0.
You could use ifelse to treat NAs as 0 for the cummean call:
library(dplyr)
df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
df %>%
group_by(category) %>%
mutate(new_col = cummean(ifelse(is.na(value), 0, value)))
Output:
# A tibble: 8 x 3
# Groups: category [2]
category value new_col
<fct> <dbl> <dbl>
1 cat1 NA 0.
2 cat1 2. 1.00
3 cat2 3. 3.00
4 cat1 4. 2.00
5 cat2 5. 4.00
6 cat2 NA 2.67
7 cat1 7. 3.25
8 cat2 8. 4.00
EDIT: Now I see this isn't the same as ignoring NAs.
Try this one instead. I group by a column which specifies if the value is NA or not, meaning cummean can run without encountering any NAs:
library(dplyr)
df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
df %>%
group_by(category, isna = is.na(value)) %>%
mutate(new_col = ifelse(isna, NA, cummean(value)))
Output:
# A tibble: 8 x 4
# Groups: category, isna [4]
category value isna new_col
<fct> <dbl> <lgl> <dbl>
1 cat1 NA TRUE NA
2 cat1 2. FALSE 2.00
3 cat2 3. FALSE 3.00
4 cat1 4. FALSE 3.00
5 cat2 5. FALSE 4.00
6 cat2 NA TRUE NA
7 cat1 7. FALSE 4.33
8 cat2 8. FALSE 5.33
An option is to remove value before calculating cummean. In this method rows with NA value will not be accounted for cummean calculation. Not sure if OP wants to consider NA value as 0 in calculation.
df %>% mutate(rn = row_number()) %>%
filter(!is.na(value)) %>%
group_by(category) %>%
mutate(new_col = cummean(value)) %>%
ungroup() %>%
right_join(mutate(df, rn = row_number()), by="rn") %>%
select(category = category.y, value = value.y, new_col) %>%
as.data.frame()
# category value new_col
# 1 cat1 NA NA
# 2 cat1 2 2.000000
# 3 cat2 3 3.000000
# 4 cat1 4 3.000000
# 5 cat2 5 4.000000
# 6 cat2 NA NA
# 7 cat1 7 4.333333
# 8 cat2 8 5.333333
I needed something similar, but cannot replace NAs with 0. So I created this simple function, which works with dplyr. Hope this helps.
cummean.na <- function(x, na.rm = T)
{
# x = c(NA, seq(1, 10, 1)); na.rm = T
n <- length(x)
op <- rep(NA, n)
for(i in 1:n) {op[i] <- ifelse(is.na(x[i]), NA, mean(x[1:i], na.rm = !!na.rm))}
rm(x, na.rm, n, i)
return(op)
}
Custom function to calculate "cummean", ignoring NA's and carrying forward the previous cumulative mean value to the next NA value:
cummean.na <-
function(x) {
tmp_ind <- cumsum(!is.na(x))
x_nona <- x[!is.na(x)]
out <- cummean(x_nona)[tmp_ind]
return(out)
}
Example output:
> cummean.na(1:5)
[1] 1.0 1.5 2.0 2.5 3.0
> cummean.na(c(1, 2, 3, NA, 4, 5))
[1] 1.0 1.5 2.0 2.0 2.5 3.0

Resources