Using dplyr summarise with conditions - r

I am currently trying to apply the summarise function in order to isolate the relevant observations from a large data set. A simple reproducible example is given here:
df <- data.frame(c(1,1,1,2,2,2,3,3,3), as.logical(c(TRUE,FALSE,TRUE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)),
as.numeric(c(0,5,0,0,0,0,7,0,7)))
colnames(df) <- c("ID", "Status", "Price")
ID Status Price
1 1 TRUE 0
2 1 FALSE 5
3 1 TRUE 0
4 2 TRUE 0
5 2 TRUE 0
6 2 TRUE 0
7 3 FALSE 7
8 3 TRUE 0
9 3 FALSE 7
I would like to sort the table by observation and get the status TRUE only if all three observations are TRUE (figured out) and then want to get the price corresponding to the status (i.e. 5 for observation 1 as FALSE, 0 for observation 2 as TRUE and 7 for observation 3 as FALSE).
From Summarize with conditions in dplyr I have figured out that I can - just as usually - specify the conditions in square brackets. My code so far thus looks like this:
library(dplyr)
result <- df %>%
group_by(ID) %>%
summarize(Status = all(Status), Test = ifelse(all(Status) == TRUE,
first(Price[Status == TRUE]), first(Price[Status == FALSE])))
# This is what I get:
# A tibble: 3 x 3
ID Status Test
<dbl> <lgl> <dbl>
1 1. FALSE 0.
2 2. TRUE 0.
3 3. FALSE 7.
But as you can see, for ID = 1 it gives an incorrect price. I have been trying this forever, so I would appreciate any hint as to where I have been going wrong.

We could keep the all(Status) as second argument in summarise (or change the column name) and also, it can be done with if/else as the logic seems to return a single TRUE/FALSE based on whether all of the 'Status' is TRUE or not
df %>%
group_by(ID) %>%
summarise( Test = if(all(Status)) first(Price[Status]) else
first(Price[!Status]), Status = all(Status))
# A tibble: 3 x 3
# ID Test Status
# <dbl> <dbl> <lgl>
#1 1 5 FALSE
#2 2 0 TRUE
#3 3 7 FALSE
NOTE: It is better not to use ifelse with unequal lengths for its arguments

Could do:
df %>%
group_by(ID) %>%
mutate(status = Status) %>%
summarise(
Status = all(Status),
Test = ifelse(Status == TRUE,
first(Price),
first(Price[status == FALSE]))
)
Output:
# A tibble: 3 x 3
ID Status Test
<dbl> <lgl> <dbl>
1 1 FALSE 5
2 2 TRUE 0
3 3 FALSE 7
The issue is that you want to use Status for Test column while you've already modified it so that it doesn't contain original values anymore.
Make a copy before (I've saved it in status), execute ifelse on it and it'll run fine.

Related

R: How to create a is.a column for a list of given columns

let's say I have the following tibble:
df <- tibble(a=c(NA, 1, 0), b=c(2, 0, NA))
I now want to replace all NA with zero. However, in order to distinguish between a NA-zero and an actual zero, there should be an a_is_na and b_is_na column.
I know that this can easily done with mutate like mutate(a_is_na = is.na(a)). However, I have about 20 columns and what to automatically create the correct column names and values. What is a clever way in R to do so?
Use across:
library(dplyr)
df %>%
mutate(across(a:b, is.na, .names = "{.col}_is_na")) %>%
replace(is.na(.), 0)
# A tibble: 3 × 4
a b a_is_na b_is_na
<dbl> <dbl> <lgl> <lgl>
1 0 2 TRUE FALSE
2 1 0 FALSE FALSE
3 0 0 FALSE TRUE

Writing other columns when matching a condition

I would like to create a new column, document it only when it matches a specific condition (here x > 2 ) and then directly overwrite another existing column (here auxiliary) for these rows where the condition (x > 2) returned TRUE.
df <- tibble(x = 1:5, y = 1:5, auxiliary = NA)
# A tibble: 5 x 3
x y auxiliary
<int> <dbl> <lgl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
I can do this successfully in two different calls within mutate() :
df %>%
mutate(result = if_else(condition = x > 2,
true = x+y,
false = NA_real_),
auxiliary = if_else(condition = x > 2,
true = "Calculation done",
false = NA_character_))
# A tibble: 5 x 4
x y auxiliary result
<int> <dbl> <chr> <dbl>
1 1 NA NA
2 2 NA NA
3 3 Calculation done 6
4 4 Calculation done 8
5 5 Calculation done 10
But there's some code repetition (condition = x > 2) which, in more complex cases, makes reading the code very uneasy and prone to errors, especially when there are multiple conditions.
Is there a way to simplify the code above by not repeating the condition ? :
Create new variable (mutate())
Document only if condition is matched (if_else or case_when())
Write another column's value only if the row's condition is matched. (I'm stuck here)
Something that would look like this :
df %>%
mutate(result = case_when(
x > 2 ~ x + y & auxiliary == "Calculation done", # we'd add the column reference here...
TRUE ~ NA_real & auxiliary = NA_character_))
Many thanks ! Any solution from the tidyverse would be ideal.
You can save the result of the condition in a column and use that to avoid evaluating the same condition again and again.
library(dplyr)
df <- tibble(x = 1:5, y = 1:5)
df %>%
mutate(condition = x > 2,
result = if_else(condition,
true = x+y,
false = NA_integer_),
auxiliary = if_else(condition,
true = "Calculation done",
false = NA_character_))
# x y condition result auxiliary
# <int> <int> <lgl> <int> <chr>
#1 1 1 FALSE NA NA
#2 2 2 FALSE NA NA
#3 3 3 TRUE 6 Calculation done
#4 4 4 TRUE 8 Calculation done
#5 5 5 TRUE 10 Calculation done
I would suggest saving the condition which should be used multiple times as string and then using the string as variable in the code, e.g.:
condition <- "x>2"
df %>%
mutate(result = ifelse(eval(parse(text=condition)),
x+y,
NA),
auxiliary = ifelse(eval(parse(text=condition)),
"Calculation done",
NA))
Note, that I am using base ifelse statement, to avoid the restriction that I have to use the same type in the column ("dplyr::if_else is specifically written to force you to have the same type in your true and false arguments."). See further information on that e.g. Different behavior of if else statement and if_else.
It is possible to achieve the kind of abstraction you would like to have, but it does require more set-ups. mutate is actually more flexible than you think it is. You can pass a script to it. Suppose you write something like A %>% mutate({...}). If the script {...} returns a dataframe, then its columns will be created directly in A or replace the existing columns in A if they share the same names. So you can do
df %>% mutate({
cond <- x > 2
out <- tibble(.rows = n())
mapply(
\(var, true, false) out[[var]] <<- if_else(cond, true, false),
var = c("result", "auxiliary"),
true = list(x + y, "Calculation done"),
false = list(NA_integer_, NA_character_)
)
out
})
Output
# A tibble: 5 x 4
x y auxiliary result
<int> <int> <chr> <int>
1 1 1 NA NA
2 2 2 NA NA
3 3 3 Calculation done 6
4 4 4 Calculation done 8
5 5 5 Calculation done 10

How to use condition statement in pipe in r

I am trying to use the condition statement in the pipe but failed.
The data like this:
group = rep(letters[1:3], each = 3)
status = c(T,T,T, T,T,F, F,F,F)
value = c(1:9)
df = data.frame(group = group, status = status, value = value)
> df
group status value
1 a TRUE 1
2 a TRUE 2
3 a TRUE 3
4 b TRUE 4
5 b TRUE 5
6 b FALSE 6
7 c FALSE 7
8 c FALSE 8
9 c FALSE 9
I want to get the rows in each group that have max value with the condition that if any of the status in each group have TRUE then filter(status == T) %>% slice_max(value) or slice_max(value) otherwise.
What I have tried is this:
# way 1
df %>%
group_by(group) %>%
if(any(status) == T) {
filter(status == T) %>% slice_max(value)
} else {
slice_max(value)
}
# way 2
df %>%
group_by(group) %>%
when(any(status) == T,
filter(status == T) %>% slice_max(value),
slice_max(value))
What I expected output should like this:
> expected_df
group status value
1 a TRUE 3
2 b TRUE 5
3 c FALSE 9
Any help will be highly appreciated!
Try arranging the data by status then value, then just taking the first result
df %>%
group_by(group) %>%
arrange(!status, desc(value)) %>%
slice(1)
Since we arrange by status, if they have a TRUE value, it will come first, if not, then you just get the largest value. Generally it's a bit awkward to combine pipes and if statements but if that's something you want to look into, that's covered in this existing question but if statements don't work with group_by.
A bit more verbose :
library(dplyr)
df %>%
group_by(group) %>%
filter(if(any(status)) value ==max(value[status]) else value == max(value)) %>%
ungroup
# group status value
# <chr> <lgl> <int>
#1 a TRUE 3
#2 b TRUE 5
#3 c FALSE 9
df %>%
group_by(group) %>%
slice(which.max(value*(all(!status)|status)))
# A tibble: 3 x 3
# Groups: group [3]
group status value
<chr> <lgl> <int>
1 a TRUE 3
2 b TRUE 5
3 c FALSE 9
Though the best is to arrange the data

For rows in a tibble, how to count the greatest number of TRUE values between FALSE values?

So I have a tibble in the form
passengerId FlightChain
1 1 c("TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "FALSE", "TRUE")
2 2 TRUE
3 3 c("TRUE", "FALSE", "TRUE", "TRUE", "FALSE")
and I'm trying to get the highest count of "TRUES" between "FALSE"s as it's own column.
so in this case:
passengerId fullFlightChain
1 1 3
2 2 1
3 3 2
I first had the tibble in format:
passengerId flightTo
<int> <lgl>
1 1 TRUE
2 1 TRUE
3 1 FALSE
4 1 TRUE
5 1 TRUE
6 1 TRUE
7 1 FALSE
8 1 TRUE
9 2 TRUE
10 3 TRUE
11 3 FALSE
etc....
so if it would actually be better to work from (grouping by passengerId) there I'm all ears. From what I've heard rle() is a function that might work, but I can't get it to work properly.
Thanks
Edit: now with code
df_q3 <- df1 %>%
group_by(passengerId) %>%
arrange(passengerId) %>%
mutate(flightToUK = if_else(to == "uk", FALSE, TRUE)) %>%
summarise(fullFlightChain = paste(flightToUK, collapse = "-")) %>%
mutate(fullFlightChainSplit = str_split(fullFlightChain, "-")) %>%
map(fullFlightChainSplit,rle(fullFlightChainSplit))) %>%
print()
Where the last line is where I'm trying to make the count as seen in the first table
Taking as an input your initial tibble format, i.e.:
library(readr)
library(dplyr)
df <- read_table2("passengerId flightTo
1 TRUE
1 TRUE
1 FALSE
1 TRUE
1 TRUE
1 TRUE
1 FALSE
1 TRUE
2 TRUE
3 TRUE
3 FALSE")
This is the best solution to your problem:
df1 <- df %>%
group_by(passengerId) %>%
transmute(fullFlightChain = with(rle(flightTo), max(lengths[values]))
) %>%
unique(.)
Output:
> df1
# A tibble: 3 x 2
# Groups: passengerId [3]
passengerId fullFlightChain
<dbl> <int>
1 1 3
2 2 1
3 3 1
EDIT: Adding the missing rows to your initial tibble and producing the output:
df <- read_table2("passengerId flightTo
1 TRUE
1 TRUE
1 FALSE
1 TRUE
1 TRUE
1 TRUE
1 FALSE
1 TRUE
2 TRUE
3 TRUE
3 FALSE
3 TRUE
3 TRUE
3 FALSE")
df1 <- df %>%
group_by(passengerId) %>%
transmute(fullFlightChain = with(rle(flightTo), max(lengths[values]))
) %>%
unique(.)
Output:
> df1
# A tibble: 3 x 2
# Groups: passengerId [3]
passengerId fullFlightChain
<dbl> <int>
1 1 3
2 2 1
3 3 2
Using the rle function which encodes a vector by values and lengths would allow you to examine the max length that had a TRUE value. Something along these lines although untested in the absence of example built in code.
RLE <- rle(flightTo)
mxT <- max( RLE$lengths[RLE$values == TRUE] )
Or for multiple items in a list:
lapply( list_name, function(line){
RLE <- rle(flightTo)
mxT <- max( RLE$lengths[RLE$values == TRUE] ) }
Here is both a reproducible example and the solution based on rle
library(tibble)
library(magrittr)
library(dplyr)
set.seed(4242)
tbl <- tibble(passID = sample(1:3, 20, replace = TRUE),
flightTO = sample(c(T, F), 20, replace = TRUE)) %>%
arrange(passID)
rle(tbl$flightTO)
tbl %>%
group_by(passID) %>%
do({tmp <- with(rle(.$flightTO==TRUE), lengths[values])
data.frame(passID= .$passID, Max=if(length(tmp)==0) 0
else max(tmp)) }) %>%
slice(1L)
UPDATE
simply use my code to create a temporary object which you will use to join to the main summarised object, keeping the critical "Max" column which summarises the maximum run length by passID. "tbl" is your "df1"
temp_obj <- tbl %>%
group_by(passID) %>%
do({tmp <- with(rle(.$flightTO==TRUE), lengths[values])
data.frame(passID= .$passID, Max=if(length(tmp)==0) 0
else max(tmp)) }) %>%
slice(1L)
your_new_obj_where_you_summarise_other_stuff <- tbl %>%
group_by(passID) %>%
summarise(..other summary statistics you need..) %>%
inner_join(temp_obj, by = "passID")

Select groups which have at least one of a certain value

How to select groups based on a condition on the individual rows, say keep all groups that contain at least one (ANY) of a certain value, e.g. 4, (or any other condition that is TRUE at least once). Or phrased the other way around: if a group does not have any rows where condition is true, the entire group should be removed.
Let's take a very simple data, with two groups, and I want to select the group that has at least one row with a Value of 4, (i.e. group B here)
library(dplyr)
df <- data.frame(Group = LETTERS[c(1,1,1,2,2,2)], Value=c(1:5, 4))
df
# Group Value
# 1 A 1 # Group A has no values == 4 ~~> remove entire group
# 2 A 2
# 3 B 3
# 4 B 4 # Group B has at least one 4 ~~> keep the whole group
Doing group_by() and then filter (as in this post) will only select individual rows that contains a value of 4, not the whole group:
df %>%
group_by(Group) %>%
filter(Value == 4)
# Group Value
# <fctr> <int>
# 1 B 4
This turns out to be pretty easy: you just need to use the any() function in the filter call. Indeed, it appears that:
filter(any(...)) evaluates at the group_by() level,
filter(...) evaluates at the rowwise() level, even when preceded by group_by().
Hence use:
df %>%
group_by(Group) %>%
filter(any(Value==4))
Group Value
<fctr> <int>
1 B 3
2 B 4
Interestingly, the same appear with mutate, compare:
df %>%
group_by(Group) %>%
mutate(check1=any(Value==4),
check2=Value==4)
Group Value check1 check2
<fctr> <int> <lgl> <lgl>
1 A 1 FALSE FALSE
2 A 2 FALSE FALSE
3 B 3 TRUE FALSE
4 B 4 TRUE TRUE
A data.table option is
library(data.table)
setDT(df)[, if(any(Value==4)) .SD, by = Group]
# Group Value
#1: B 4
#2: B 5
#3: B 4
In base R, without performing any grouping operation we can do :
subset(df, Group %in% unique(Group[Value == 4]))
# Group Value
#4 B 4
#5 B 5
#6 B 4

Resources