Using cummean with group_by and ignoring NAs - r

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
I'd like to add a new column to the above dataframe which takes the cumulative mean of the value column, not taking into account NAs. Is it possible to do this with dplyr? I've tried
df <- df %>% group_by(category) %>% mutate(new_col=cummean(value))
but cummean just doesn't know what to do with NAs.
EDIT: I do not want to count NAs as 0.

You could use ifelse to treat NAs as 0 for the cummean call:
library(dplyr)
df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
df %>%
group_by(category) %>%
mutate(new_col = cummean(ifelse(is.na(value), 0, value)))
Output:
# A tibble: 8 x 3
# Groups: category [2]
category value new_col
<fct> <dbl> <dbl>
1 cat1 NA 0.
2 cat1 2. 1.00
3 cat2 3. 3.00
4 cat1 4. 2.00
5 cat2 5. 4.00
6 cat2 NA 2.67
7 cat1 7. 3.25
8 cat2 8. 4.00
EDIT: Now I see this isn't the same as ignoring NAs.
Try this one instead. I group by a column which specifies if the value is NA or not, meaning cummean can run without encountering any NAs:
library(dplyr)
df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
df %>%
group_by(category, isna = is.na(value)) %>%
mutate(new_col = ifelse(isna, NA, cummean(value)))
Output:
# A tibble: 8 x 4
# Groups: category, isna [4]
category value isna new_col
<fct> <dbl> <lgl> <dbl>
1 cat1 NA TRUE NA
2 cat1 2. FALSE 2.00
3 cat2 3. FALSE 3.00
4 cat1 4. FALSE 3.00
5 cat2 5. FALSE 4.00
6 cat2 NA TRUE NA
7 cat1 7. FALSE 4.33
8 cat2 8. FALSE 5.33

An option is to remove value before calculating cummean. In this method rows with NA value will not be accounted for cummean calculation. Not sure if OP wants to consider NA value as 0 in calculation.
df %>% mutate(rn = row_number()) %>%
filter(!is.na(value)) %>%
group_by(category) %>%
mutate(new_col = cummean(value)) %>%
ungroup() %>%
right_join(mutate(df, rn = row_number()), by="rn") %>%
select(category = category.y, value = value.y, new_col) %>%
as.data.frame()
# category value new_col
# 1 cat1 NA NA
# 2 cat1 2 2.000000
# 3 cat2 3 3.000000
# 4 cat1 4 3.000000
# 5 cat2 5 4.000000
# 6 cat2 NA NA
# 7 cat1 7 4.333333
# 8 cat2 8 5.333333

I needed something similar, but cannot replace NAs with 0. So I created this simple function, which works with dplyr. Hope this helps.
cummean.na <- function(x, na.rm = T)
{
# x = c(NA, seq(1, 10, 1)); na.rm = T
n <- length(x)
op <- rep(NA, n)
for(i in 1:n) {op[i] <- ifelse(is.na(x[i]), NA, mean(x[1:i], na.rm = !!na.rm))}
rm(x, na.rm, n, i)
return(op)
}

Custom function to calculate "cummean", ignoring NA's and carrying forward the previous cumulative mean value to the next NA value:
cummean.na <-
function(x) {
tmp_ind <- cumsum(!is.na(x))
x_nona <- x[!is.na(x)]
out <- cummean(x_nona)[tmp_ind]
return(out)
}
Example output:
> cummean.na(1:5)
[1] 1.0 1.5 2.0 2.5 3.0
> cummean.na(c(1, 2, 3, NA, 4, 5))
[1] 1.0 1.5 2.0 2.0 2.5 3.0

Related

Function breaks when looped within dplyr::case_when()

I have a function that extracts the min or minimal of a range of values (within a character string) that appears to work fine on individual cases.
However, when I try to use it within case_when() it does not behave as expected.
Reproducible example
library(dplyr)
library(tibble)
library(stringr)
val_from_range <- function(.str, .fun = "min"){
str_extract_all(.str, "\\d*\\.?\\d+") |>
unlist() |>
as.numeric() |>
(\(x) if (.fun == "min") x |> min()
else if (.fun == "max") x |> max())()
}
tibble(x = c("5-6", "4", "6-9", "5", "NA")) |>
mutate(min = case_when(str_detect(x, "-") ~ val_from_range(x, "min"))) |>
mutate(max = case_when(str_detect(x, "-") ~ val_from_range(x, "max")))
# A tibble: 5 x 3
x min max
<chr> <dbl> <dbl>
1 5-6 4 9
2 4 NA NA
3 6-9 4 9
4 5 NA NA
5 NA NA NA
However, I want:
# A tibble: 5 x 3
x min max
<chr> <dbl> <dbl>
1 5-6 5 6
2 4 NA NA
3 6-9 6 9
4 5 NA NA
5 NA NA NA
The function performs as expected on individual cases
> val_from_range("5-6", "min")
[1] 5
> val_from_range("5-6", "max")
[1] 6
> val_from_range("5-6-8-10", "max")
[1] 10
Any help would be greatly appreciated. Thanks in advance.
Couple of changes required. The function works only for one value at a time . If you pass in more than one value it ignores the second value.
val_from_range("5-6", "min")
#[1] 5
val_from_range(c("5-6", "8-10"), "min")
#[1] 5
To pass them one by one you can take help of rowwise. Secondly, case_when still executes the function for values that do not satisfy the condition hence it returns a warning for "NA" value. We can use if/else here to avoid that.
library(dplyr)
library(stringr)
tibble(x = c("5-6", "4", "6-9", "5", "NA")) %>%
rowwise() %>%
mutate(min = if(str_detect(x, "-")) val_from_range(x, "min") else NA,
max = if(str_detect(x, "-")) val_from_range(x, "max") else NA) %>%
ungroup
# x min max
# <chr> <dbl> <dbl>
#1 5-6 5 6
#2 4 NA NA
#3 6-9 6 9
#4 5 NA NA
#5 NA NA NA

Sum values from rows with conditions in R

I need to create a new column with the sum values in several other columns, but with conditions.
My data is
ID <- c(A,B,C,D,E,F)
Q1 <- c(0,1,7,9,na,3)
Q2 <- c(0,3,2,2,na,3)
Q3 <- c(0,0,7,9,na,3)
dta <- as.data.frame (ID,Q1,Q2,Q3)
I need to sum values from the columns only if the values are < 4. If there is any value in any column that is > 4, the result should be dismissed. And I need to preserve the rows with only "na".
The result should look like
Result
0
4
na
na
na
9
I have tried :
library(dplyr)
dta %>% filter(Q1 < 4) %>% mutate(Result = rowSums(.[2:4]))
but then, all the rows with values > 4 disappear, and I was only able filter one row at a time. I have also tried:
dta$Result <- ifelse(c("Q1", "Q2", "Q3") < 4, rowSums(.[2:4]), NA)
but then all my results are "na"
ID <- c("A","B","C","D","E","F")
Q1 <- c(0,1,7,9,NA,3)
Q2 <- c(0,3,2,2,NA,3)
Q3 <- c(0,0,7,9,NA,3)
dta <- data.frame(ID,Q1,Q2,Q3)
You have to switch the sum and ifelse statement.
dta %>%
rowwise() %>%
mutate(result = sum(ifelse(c(Q1, Q2, Q3)<4, c(Q1, Q2, Q3), NA)))
You can use the following solution:
library(dplyr)
dta %>%
rowwise() %>%
mutate(Result = ifelse(any(c_across(Q1:Q3) > 4), NA, Reduce(`+`, c_across(Q1:Q3))))
# A tibble: 6 x 5
# Rowwise:
ID Q1 Q2 Q3 Result
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 0 0 0 0
2 B 1 3 0 4
3 C 7 2 7 NA
4 D 9 2 9 NA
5 E NA NA NA NA
6 F 3 3 3 9

exists function doesn't work as expected with dplyr transmute

I would like to do create the output as for df3 with dplyr transmute. But somehow it just takes the first row of the dataframe columns a and b and not the column itselft. any ideas?
df = data.frame(a=1:10, b=2:11)
df2 <- df %>%
transmute(
newcol = ifelse(exists("a", df)==TRUE,a, NA),
newcol2 = ifelse(exists("b", df)==TRUE,b, NA),
newcol3 = ifelse(exists("c", df)==TRUE,c, NA),
)
df2
df3 = data.frame(newcol=1:10, newcol2=2:11, newcol3 = NA)
df3
The problem is that exists("a", df) returns a length-1 logical vector, so the ifelse returns a length-1 numeric vector. This is then recycled, which is why the first number in each column get recycled. You can use if(condition) a else NA instead:
df = data.frame(a=1:10, b=2:11)
df2 <- df %>%
transmute(
newcol = if(exists("a", df)) a else NA,
newcol2 = if(exists("b", df)) b else NA,
newcol3 = if(exists("c", df)) c else NA)
)
df2
#> newcol newcol2 newcol3
#> 1 1 2 NA
#> 2 2 3 NA
#> 3 3 4 NA
#> 4 4 5 NA
#> 5 5 6 NA
#> 6 6 7 NA
#> 7 7 8 NA
#> 8 8 9 NA
#> 9 9 10 NA
#> 10 10 11 NA
An option with map
library(dplyr)
library(purrr)
map_dfc(c('a', 'b', 'c'),
~ if(exists(.x, df)) df %>% select(.x) else df %>% transmute(!! .x := NA))

Select a maximum value across rows and columns with grouped data

The data below have an IndID field as well as three columns containing numbers, including NA in some instances, with a varying number of rows for each IndID.
library(dplyr)
n = 10
set.seed(123)
dat <- data.frame(IndID = sample(c("AAA", "BBB", "CCC", "DDD"), n, replace = T),
Num1 = c(2,4,2,4,4,1,3,4,3,2),
Num2 = sample(c(1,2,5,8,7,8,NA), n, replace = T),
Num3 = sample(c(NA, NA,NA,8,7,9,NA), n, replace = T)) %>%
arrange(IndID)
head(dat)
IndID Num1 Num2 Num3
1 AAA 1 NA 7
2 BBB 2 NA NA
3 BBB 2 7 7
4 BBB 2 NA NA
5 CCC 3 2 8
6 CCC 3 5 NA
For each IndID, I would like to make a new column Max that contains the maximum value for Num1:Num3. In most instances this involves finding the max value across multiple rows and columns. Within dplyr I am missing the final step (below) and would appreciate any suggestions.
dat %>%
group_by(IndID) %>%
mutate(Max = "???")
An option is pmax to get the rowwise maxs
dat %>%
mutate(Max = pmax(Num1, Num2, Num3, na.rm = TRUE))
If there are many columns, we can get the column names, convert it to symbol and then evaluate (!!!)
dat %>%
mutate(Max = pmax(!!! rlang::syms(names(.)[-1]), na.rm = TRUE))
# A tibble: 10 x 5
# Groups: IndID [4]
# IndID Num1 Num2 Num3 Max
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 AAA 1 NA 7 7
# 2 BBB 2 NA NA 2
# 3 BBB 2 7 7 7
# 4 BBB 2 NA NA 2
# 5 CCC 3 2 8 8
# 6 CCC 3 5 NA 5
# 7 DDD 4 8 7 8
# 8 DDD 4 7 NA 7
# 9 DDD 4 1 7 7
#10 DDD 4 1 7 7
If this is to get the max of all 'Num' column grouped by 'IndID', there are multiple ways.
1) From the above step, we can extend it to group by 'IndID' and then take the max of row maxs ('Max')
dat %>%
mutate(Max = pmax(!!! rlang::syms(names(.)[-1]), na.rm = TRUE)) %>%
group_by(IndID) %>%
mutate(Max = max(Max))
2) Another option is to convert the 'wide' format to 'long' with gather, then grouped by 'IndID', get the max of 'val' column and right_join with the original dataset
library(tidyverse)
gather(dat, key, val, -IndID) %>%
group_by(IndID) %>%
summarise(Max = max(val,na.rm = TRUE)) %>%
right_join(dat)
3) Or another option without reshaping into 'long' format would be to nest the dataset after grouping by 'IndID', unlist and get the max of the 'Num' columns
dat %>%
group_by(IndID) %>%
nest %>%
mutate(data = map(data, ~ .x %>%
mutate(Max = max(unlist(.), na.rm = TRUE)))) %>%
unnest

Cumulative mean non including the current observation - using cummean and group_by while ignoring NAs

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
value=c(NA,2,3,4,5,NA,7,8))
I'd like to add a new column to the above dataframe which takes the cumulative mean of the value column up to the prior observation (ie not including the current observation) and not taking into account NAs. I've tried
df %>%
group_by(category, isna = is.na(value)) %>%
mutate(new_col = ifelse(isna, NA, cummean(lag(value))))
but cummean just doesn't know what to do with NAs and unfortunately lag generates them.
I do not want to count NAs as 0.
One can workout first cummean and then take lag of the same.
library(dplyr)
df %>%
group_by(category, isna = is.na(value)) %>%
mutate(new_col = lag(cummean(value))) %>%
ungroup() %>%
select(-isna)
# # A tibble: 8 x 3
# category value new_col
# <fctr> <dbl> <dbl>
# 1 cat1 NA NA
# 2 cat1 2.00 NA
# 3 cat2 3.00 NA
# 4 cat1 4.00 2.00
# 5 cat2 5.00 3.00
# 6 cat2 NA NA
# 7 cat1 7.00 3.00
# 8 cat2 8.00 4.00

Resources