I am trying to use the condition statement in the pipe but failed.
The data like this:
group = rep(letters[1:3], each = 3)
status = c(T,T,T, T,T,F, F,F,F)
value = c(1:9)
df = data.frame(group = group, status = status, value = value)
> df
group status value
1 a TRUE 1
2 a TRUE 2
3 a TRUE 3
4 b TRUE 4
5 b TRUE 5
6 b FALSE 6
7 c FALSE 7
8 c FALSE 8
9 c FALSE 9
I want to get the rows in each group that have max value with the condition that if any of the status in each group have TRUE then filter(status == T) %>% slice_max(value) or slice_max(value) otherwise.
What I have tried is this:
# way 1
df %>%
group_by(group) %>%
if(any(status) == T) {
filter(status == T) %>% slice_max(value)
} else {
slice_max(value)
}
# way 2
df %>%
group_by(group) %>%
when(any(status) == T,
filter(status == T) %>% slice_max(value),
slice_max(value))
What I expected output should like this:
> expected_df
group status value
1 a TRUE 3
2 b TRUE 5
3 c FALSE 9
Any help will be highly appreciated!
Try arranging the data by status then value, then just taking the first result
df %>%
group_by(group) %>%
arrange(!status, desc(value)) %>%
slice(1)
Since we arrange by status, if they have a TRUE value, it will come first, if not, then you just get the largest value. Generally it's a bit awkward to combine pipes and if statements but if that's something you want to look into, that's covered in this existing question but if statements don't work with group_by.
A bit more verbose :
library(dplyr)
df %>%
group_by(group) %>%
filter(if(any(status)) value ==max(value[status]) else value == max(value)) %>%
ungroup
# group status value
# <chr> <lgl> <int>
#1 a TRUE 3
#2 b TRUE 5
#3 c FALSE 9
df %>%
group_by(group) %>%
slice(which.max(value*(all(!status)|status)))
# A tibble: 3 x 3
# Groups: group [3]
group status value
<chr> <lgl> <int>
1 a TRUE 3
2 b TRUE 5
3 c FALSE 9
Though the best is to arrange the data
Related
I have a dataframe df (tibble in my case) in R and several files in a given directory which have a loose correspondence with the elements of one of the columns in df. I want to track which rows in df correspond to these files by adding a column has_file.
Here's what I've tried.
# SETUP
dir.create("temp")
setwd("temp")
LETTERS[1:4] %>%
str_c(., ".png") %>%
file.create()
df <- tibble(x = LETTERS[3:6])
file_list <- list.files()
# ATTEMPT
df %>%
mutate(
has_file = file_list %>%
str_remove(".png") %>%
is.element(x, .) %>%
any()
)
# RESULT
# A tibble: 4 x 2
x has_file
<chr> <lgl>
1 C TRUE
2 D TRUE
3 E TRUE
4 F TRUE
I would expect that only the rows with C and D get values of TRUE for has_file, but E and F do as well.
What is happening here, and how may I generate this correspondence in a column?
(Tidyverse solution preferred.)
We may need to add rowwise at the top because the any is going to do the evaluation on the whole column and as there are already two TRUE elements, any returns TRUE from that row to fill up the whole column. With rowwise, there is no need for any as is.element returns a single TRUE/FALSE per each element of 'x' column
df %>%
rowwise %>%
mutate(
has_file = file_list %>%
str_remove(".png") %>%
is.element(x, .)) %>%
ungroup
# A tibble: 4 × 2
x has_file
<chr> <lgl>
1 C TRUE
2 D TRUE
3 E FALSE
4 F FALSE
i.e. check the difference after adding the any
> is.element(df$x, LETTERS[1:4])
[1] TRUE TRUE FALSE FALSE
> any(is.element(df$x, LETTERS[1:4]))
[1] TRUE
We may also use map to do this
library(purrr)
df %>%
mutate(has_file = map_lgl(x, ~ file_list %>%
str_remove(".png") %>%
is.element(.x, .)))
# A tibble: 4 × 2
x has_file
<chr> <lgl>
1 C TRUE
2 D TRUE
3 E FALSE
4 F FALSE
Or if we want to use vectorized option, instead of using is.element, do the %in% directly
df %>%
mutate(has_file = x %in% str_remove(file_list, ".png"))
# A tibble: 4 × 2
x has_file
<chr> <lgl>
1 C TRUE
2 D TRUE
3 E FALSE
4 F FALSE
A dataframe:
exdf <- data.frame(
a = 1:3,
b = c(2,2,2)
)
Sometimes b is present, in which case one can do this:
exdf %>% mutate(c = a / b)
But, sometimes feature b will not be present, in which case:
exdf %>% select(-b) %>% mutate(c = a / b)
Error: Problem with `mutate()` input `c`.
x object 'b' not found
ℹ Input `c` is `a/b`.
I want to tell dplyr to try the mutation, else if something goes wrong just make new feature c all NA_real_ as opposed to a / b.
Can this be done?
We can use a condition with if/else on exists
library(dplyr)
exdf %>%
select(-b) %>%
mutate(c = if(exists('b')) a/b else NA_real_)
Set up a simple if else statement within mutate which checks whether the column name is in the data.frame or not.
> exdf %>%
... dplyr::rowwise() %>%
... dplyr::mutate(q = ifelse("b" %in% colnames(.), a/b, NA_real_))
# A tibble: 3 x 3
# Rowwise:
a b q
<int> <dbl> <dbl>
1 1 2 0.5
2 2 2 1
3 3 2 1.5
> exdf %>%
... dplyr::select(-b) %>%
... dplyr::rowwise() %>%
... dplyr::mutate(q = ifelse("b" %in% colnames(.), a/b, NA_real_))
# A tibble: 3 x 2
# Rowwise:
a q
<int> <dbl>
1 1 NA
2 2 NA
3 3 NA
So I have a tibble in the form
passengerId FlightChain
1 1 c("TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "FALSE", "TRUE")
2 2 TRUE
3 3 c("TRUE", "FALSE", "TRUE", "TRUE", "FALSE")
and I'm trying to get the highest count of "TRUES" between "FALSE"s as it's own column.
so in this case:
passengerId fullFlightChain
1 1 3
2 2 1
3 3 2
I first had the tibble in format:
passengerId flightTo
<int> <lgl>
1 1 TRUE
2 1 TRUE
3 1 FALSE
4 1 TRUE
5 1 TRUE
6 1 TRUE
7 1 FALSE
8 1 TRUE
9 2 TRUE
10 3 TRUE
11 3 FALSE
etc....
so if it would actually be better to work from (grouping by passengerId) there I'm all ears. From what I've heard rle() is a function that might work, but I can't get it to work properly.
Thanks
Edit: now with code
df_q3 <- df1 %>%
group_by(passengerId) %>%
arrange(passengerId) %>%
mutate(flightToUK = if_else(to == "uk", FALSE, TRUE)) %>%
summarise(fullFlightChain = paste(flightToUK, collapse = "-")) %>%
mutate(fullFlightChainSplit = str_split(fullFlightChain, "-")) %>%
map(fullFlightChainSplit,rle(fullFlightChainSplit))) %>%
print()
Where the last line is where I'm trying to make the count as seen in the first table
Taking as an input your initial tibble format, i.e.:
library(readr)
library(dplyr)
df <- read_table2("passengerId flightTo
1 TRUE
1 TRUE
1 FALSE
1 TRUE
1 TRUE
1 TRUE
1 FALSE
1 TRUE
2 TRUE
3 TRUE
3 FALSE")
This is the best solution to your problem:
df1 <- df %>%
group_by(passengerId) %>%
transmute(fullFlightChain = with(rle(flightTo), max(lengths[values]))
) %>%
unique(.)
Output:
> df1
# A tibble: 3 x 2
# Groups: passengerId [3]
passengerId fullFlightChain
<dbl> <int>
1 1 3
2 2 1
3 3 1
EDIT: Adding the missing rows to your initial tibble and producing the output:
df <- read_table2("passengerId flightTo
1 TRUE
1 TRUE
1 FALSE
1 TRUE
1 TRUE
1 TRUE
1 FALSE
1 TRUE
2 TRUE
3 TRUE
3 FALSE
3 TRUE
3 TRUE
3 FALSE")
df1 <- df %>%
group_by(passengerId) %>%
transmute(fullFlightChain = with(rle(flightTo), max(lengths[values]))
) %>%
unique(.)
Output:
> df1
# A tibble: 3 x 2
# Groups: passengerId [3]
passengerId fullFlightChain
<dbl> <int>
1 1 3
2 2 1
3 3 2
Using the rle function which encodes a vector by values and lengths would allow you to examine the max length that had a TRUE value. Something along these lines although untested in the absence of example built in code.
RLE <- rle(flightTo)
mxT <- max( RLE$lengths[RLE$values == TRUE] )
Or for multiple items in a list:
lapply( list_name, function(line){
RLE <- rle(flightTo)
mxT <- max( RLE$lengths[RLE$values == TRUE] ) }
Here is both a reproducible example and the solution based on rle
library(tibble)
library(magrittr)
library(dplyr)
set.seed(4242)
tbl <- tibble(passID = sample(1:3, 20, replace = TRUE),
flightTO = sample(c(T, F), 20, replace = TRUE)) %>%
arrange(passID)
rle(tbl$flightTO)
tbl %>%
group_by(passID) %>%
do({tmp <- with(rle(.$flightTO==TRUE), lengths[values])
data.frame(passID= .$passID, Max=if(length(tmp)==0) 0
else max(tmp)) }) %>%
slice(1L)
UPDATE
simply use my code to create a temporary object which you will use to join to the main summarised object, keeping the critical "Max" column which summarises the maximum run length by passID. "tbl" is your "df1"
temp_obj <- tbl %>%
group_by(passID) %>%
do({tmp <- with(rle(.$flightTO==TRUE), lengths[values])
data.frame(passID= .$passID, Max=if(length(tmp)==0) 0
else max(tmp)) }) %>%
slice(1L)
your_new_obj_where_you_summarise_other_stuff <- tbl %>%
group_by(passID) %>%
summarise(..other summary statistics you need..) %>%
inner_join(temp_obj, by = "passID")
I'd like to mark my first top-ranked value with a marker using the tidyverse - if possible.
Assume the following data
test = tibble(group=c(1,1,1,1,2,2,2,2), values = c(1,2,3,4,7,6,5,2))
I'd now like to mark the first top values, which would be the values 3 and 4 for group 1 and 7 and 6 for group 2, yielding:
# A tibble: 8 x 3
group values marker
<dbl> <dbl> <lgl>
1 1 1 FALSE
2 1 2 FALSE
3 1 3 TRUE
4 1 4 TRUE
5 2 7 TRUE
6 2 6 TRUE
7 2 5 FALSE
8 2 2 FALSE
I thought about ranking them and than doing a comparison to get the boolean values or utilizing purrr but I could not figure out how.
After grouping by 'group', either rank the 'values' check the sorted 'n' tail elements are %in% the ranked ones to create a logical vector
library(tidyverse)
test %>%
group_by(group) %>%
mutate(marker = dense_rank(values),
marker = marker %in% tail(sort(marker), 2))
Or directly use order, %in% on the tail
test %>%
group_by(group) %>%
mutate(marker = values %in% tail(values[order(values)], 2))
Or
test %>%
group_by(group) %>%
mutate(marker = dense_rank(values) > n()-2)
Or it can be done in a single line with data.table
library(data.table)
setDT(test)[order(values), marker := values %in% tail(values, 2), group]
Or another option is after grouping by 'group', get the top_n rows (n - specified as 2, wt as 'values'), right_join with the original dataset after creating a 'marker' column of 'TRUE's, and then replace the NA elements with FALSE
test %>%
group_by(group) %>%
top_n(2, values) %>%
mutate(marker = TRUE) %>%
right_join(test) %>%
mutate(marker = replace_na(marker, FALSE))
I am currently trying to apply the summarise function in order to isolate the relevant observations from a large data set. A simple reproducible example is given here:
df <- data.frame(c(1,1,1,2,2,2,3,3,3), as.logical(c(TRUE,FALSE,TRUE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)),
as.numeric(c(0,5,0,0,0,0,7,0,7)))
colnames(df) <- c("ID", "Status", "Price")
ID Status Price
1 1 TRUE 0
2 1 FALSE 5
3 1 TRUE 0
4 2 TRUE 0
5 2 TRUE 0
6 2 TRUE 0
7 3 FALSE 7
8 3 TRUE 0
9 3 FALSE 7
I would like to sort the table by observation and get the status TRUE only if all three observations are TRUE (figured out) and then want to get the price corresponding to the status (i.e. 5 for observation 1 as FALSE, 0 for observation 2 as TRUE and 7 for observation 3 as FALSE).
From Summarize with conditions in dplyr I have figured out that I can - just as usually - specify the conditions in square brackets. My code so far thus looks like this:
library(dplyr)
result <- df %>%
group_by(ID) %>%
summarize(Status = all(Status), Test = ifelse(all(Status) == TRUE,
first(Price[Status == TRUE]), first(Price[Status == FALSE])))
# This is what I get:
# A tibble: 3 x 3
ID Status Test
<dbl> <lgl> <dbl>
1 1. FALSE 0.
2 2. TRUE 0.
3 3. FALSE 7.
But as you can see, for ID = 1 it gives an incorrect price. I have been trying this forever, so I would appreciate any hint as to where I have been going wrong.
We could keep the all(Status) as second argument in summarise (or change the column name) and also, it can be done with if/else as the logic seems to return a single TRUE/FALSE based on whether all of the 'Status' is TRUE or not
df %>%
group_by(ID) %>%
summarise( Test = if(all(Status)) first(Price[Status]) else
first(Price[!Status]), Status = all(Status))
# A tibble: 3 x 3
# ID Test Status
# <dbl> <dbl> <lgl>
#1 1 5 FALSE
#2 2 0 TRUE
#3 3 7 FALSE
NOTE: It is better not to use ifelse with unequal lengths for its arguments
Could do:
df %>%
group_by(ID) %>%
mutate(status = Status) %>%
summarise(
Status = all(Status),
Test = ifelse(Status == TRUE,
first(Price),
first(Price[status == FALSE]))
)
Output:
# A tibble: 3 x 3
ID Status Test
<dbl> <lgl> <dbl>
1 1 FALSE 5
2 2 TRUE 0
3 3 FALSE 7
The issue is that you want to use Status for Test column while you've already modified it so that it doesn't contain original values anymore.
Make a copy before (I've saved it in status), execute ifelse on it and it'll run fine.