I have a dataframe df with a set of IDs that may appear multiple times with a different Status for each row. I need to create a 0/1 indicator column for whether Status "B" ever appears for that ID. B_appears shows my desired result.
I have done something kind of related by creating a "Count" column that counts the number of times the Status listed in that row appears for that ID. But I can't figure out how to create the indicator variable that is specifically related to Status "B."
This is how I created the "Count" column, fwiw.
df <- ddply(df),.(ID,Status), transform, Count = length(ID))
Thanks in advance!
ID
Status
Count
B_appears
1
A
1
0
2
A
1
1
2
B
2
1
2
B
2
1
3
A
1
1
3
B
1
1
With tidyverse, we group by 'ID', get the Count column with group size (n()) and the 'B_appears' by creating a logical vector check whether 'B' is %in% the Status and convert the logical to binary (+ or as.integer)
library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(Count = n(),
B_appears = +('B' %in% Status)) %>%
# or may also create B_appears as
# B_appears = +(any(Status %in% 'B'))) %>%
ungroup
-output
# A tibble: 6 × 4
ID Status Count B_appears
<int> <chr> <int> <int>
1 1 A 1 0
2 2 A 3 1
3 2 B 3 1
4 2 B 3 1
5 3 A 2 1
6 3 B 2 1
data
df <- structure(list(ID = c(1L, 2L, 2L, 2L, 3L, 3L), Status = c("A",
"A", "B", "B", "A", "B")), row.names = c(NA, -6L), class = "data.frame")
I would like to create a new column, document it only when it matches a specific condition (here x > 2 ) and then directly overwrite another existing column (here auxiliary) for these rows where the condition (x > 2) returned TRUE.
df <- tibble(x = 1:5, y = 1:5, auxiliary = NA)
# A tibble: 5 x 3
x y auxiliary
<int> <dbl> <lgl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
I can do this successfully in two different calls within mutate() :
df %>%
mutate(result = if_else(condition = x > 2,
true = x+y,
false = NA_real_),
auxiliary = if_else(condition = x > 2,
true = "Calculation done",
false = NA_character_))
# A tibble: 5 x 4
x y auxiliary result
<int> <dbl> <chr> <dbl>
1 1 NA NA
2 2 NA NA
3 3 Calculation done 6
4 4 Calculation done 8
5 5 Calculation done 10
But there's some code repetition (condition = x > 2) which, in more complex cases, makes reading the code very uneasy and prone to errors, especially when there are multiple conditions.
Is there a way to simplify the code above by not repeating the condition ? :
Create new variable (mutate())
Document only if condition is matched (if_else or case_when())
Write another column's value only if the row's condition is matched. (I'm stuck here)
Something that would look like this :
df %>%
mutate(result = case_when(
x > 2 ~ x + y & auxiliary == "Calculation done", # we'd add the column reference here...
TRUE ~ NA_real & auxiliary = NA_character_))
Many thanks ! Any solution from the tidyverse would be ideal.
You can save the result of the condition in a column and use that to avoid evaluating the same condition again and again.
library(dplyr)
df <- tibble(x = 1:5, y = 1:5)
df %>%
mutate(condition = x > 2,
result = if_else(condition,
true = x+y,
false = NA_integer_),
auxiliary = if_else(condition,
true = "Calculation done",
false = NA_character_))
# x y condition result auxiliary
# <int> <int> <lgl> <int> <chr>
#1 1 1 FALSE NA NA
#2 2 2 FALSE NA NA
#3 3 3 TRUE 6 Calculation done
#4 4 4 TRUE 8 Calculation done
#5 5 5 TRUE 10 Calculation done
I would suggest saving the condition which should be used multiple times as string and then using the string as variable in the code, e.g.:
condition <- "x>2"
df %>%
mutate(result = ifelse(eval(parse(text=condition)),
x+y,
NA),
auxiliary = ifelse(eval(parse(text=condition)),
"Calculation done",
NA))
Note, that I am using base ifelse statement, to avoid the restriction that I have to use the same type in the column ("dplyr::if_else is specifically written to force you to have the same type in your true and false arguments."). See further information on that e.g. Different behavior of if else statement and if_else.
It is possible to achieve the kind of abstraction you would like to have, but it does require more set-ups. mutate is actually more flexible than you think it is. You can pass a script to it. Suppose you write something like A %>% mutate({...}). If the script {...} returns a dataframe, then its columns will be created directly in A or replace the existing columns in A if they share the same names. So you can do
df %>% mutate({
cond <- x > 2
out <- tibble(.rows = n())
mapply(
\(var, true, false) out[[var]] <<- if_else(cond, true, false),
var = c("result", "auxiliary"),
true = list(x + y, "Calculation done"),
false = list(NA_integer_, NA_character_)
)
out
})
Output
# A tibble: 5 x 4
x y auxiliary result
<int> <int> <chr> <int>
1 1 1 NA NA
2 2 2 NA NA
3 3 3 Calculation done 6
4 4 4 Calculation done 8
5 5 5 Calculation done 10
I have a set of observed behaviour of nurses conducting patient care and record what they touch or do. This might look like:
df<-data.frame(ActivityID=rep(1:3, each=3),
Action=c("Door", "Hygiene", "Patient", "Door", "Patient", "Door", "Door", "Patient", "Hygiene"))
I'd like to check whether they wash their hands before the first time they touch the patient for each ActivityID and count for how many ActivityID's this occurs. Essentially I'd like to know if X happens before Y for each activity.
My thought was to use which to find the first occurrence for both Patient and Hygiene:
require(dplyr)
a=df%>%
group_by(ActivityID) %>%
which(Action=="Hygiene")
b=df%>%
group_by(ActivityID) %>%
which(Action=="Patient")
which(a<b)
But this doesn't seem to work in pipe form and sometimes, they don't touch the patient. Any help would be much appreciated.
Total unique activities can be calculated using :
library(dplyr)
total_Activities <- n_distinct(df$ActivityID)
total_Activities
#[1] 3
We can write a function to check if hands were washed anytime before touching the Patient for first time:
hands_washed_before_touch <- function(x) {
ind1 <- which(x == 'Hygiene')
ind2 <- which(x == 'Patient')
length(ind1) && length(ind2) && ind1[1] < ind2[1]
}
and use it by group :
df1 <- df %>%
group_by(ActivityID) %>%
summarise(hands_washed = hands_washed_before_touch(Action))
df1
# ActivityID hands_washed
# <int> <lgl>
#1 1 TRUE
#2 2 FALSE
#3 3 FALSE
To get count we can sum hands_washed column i.e sum(df1$hands_washed).
Here is another alternative using case_when from dplyr package.
library(dplyr)
df1<- df %>%
group_by(ActivityID) %>%
mutate(hands_washed = case_when(
!any(Action == "Hygiene") ~ "False",
min(c(which(Action == "Hygiene"), Inf)) > which.max(Action == "Patient")~ "False",
TRUE ~ "True"))%>%
ungroup()
df1
# A tibble: 9 x 3
# Groups: ActivityID [3]
# ActivityID Action hands_washed
# <int> <fct> <chr>
#1 1 Door True
#2 1 Hygiene True
#3 1 Patient True
#4 2 Door False
#5 2 Patient False
#6 2 Door False
#7 3 Door False
#8 3 Patient False
#9 3 Hygiene False
I want to do a row wise check if multiple columns are all equal or not. I came up with a convoluted approach to count the occurences of each value per group. But this seems somewhat... cumbersome.
sample data
sample_df <- data.frame(id = letters[1:6], group = rep(c('r','l'),3), stringsAsFactors = FALSE)
set.seed(4)
for(i in 3:5) {
sample_df[i] <- sample(1:4, 6, replace = TRUE)
sample_df
}
desired output
library(tidyverse)
sample_df %>%
gather(var, value, V3:V5) %>%
mutate(n_var = n_distinct(var)) %>% # get the number of columns
group_by(id, group, value) %>%
mutate(test = n_distinct(var) == n_var ) %>% # check how frequent values occur per "var"
spread(var, value) %>%
select(-n_var)
#> # A tibble: 6 x 6
#> # Groups: id, group [6]
#> id group test V3 V4 V5
#> <chr> <chr> <lgl> <int> <int> <int>
#> 1 a r FALSE 3 3 1
#> 2 b l FALSE 1 4 4
#> 3 c r FALSE 2 4 2
#> 4 d l FALSE 2 1 2
#> 5 e r TRUE 4 4 4
#> 6 f l FALSE 2 2 3
Created on 2019-02-27 by the reprex package (v0.2.1)
Does not need to be dplyr. I just used it for showing what I want to achieve.
There are a bunch of ways to check for equality row-wise. Two good ways:
# test that all values equal the first column
rowSums(df == df[, 1]) == ncol(df)
# count the unique values, see if there is just 1
apply(df, 1, function(x) length(unique(x)) == 1)
If you only want to test some columns, then use a subset of columns rather than the whole data frame:
cols_to_test = c(3, 4, 5)
rowSums(df[cols_to_test] == df[, cols_to_test[1]]) == length(cols_to_test)
# count the unique values, see if there is just 1
apply(df[cols_to_test], 1, function(x) length(unique(x)) == 1)
Note I use df[cols_to_test] instead of df[, cols_to_test] when I want to be sure the result is a data.frame even if cols_to_test has length 1.
Consider the following dataframe (ordered by id and time):
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,32,1,2,6,17,24))
df
id event time
1 1 a 1
2 1 b 3
3 1 b 6
4 1 b 12
5 1 a 24
6 1 b 30
7 1 a 42
8 2 a 1
9 2 a 2
10 2 b 6
11 2 a 17
12 2 a 24
I want to count how many times a given sequence of events appears in each "id" group. Consider the following sequence with time constraints:
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
It means that event "a" can start at any time, event "b" must start no earlier than 2 and no later than 8 after event "a", another event "a" must start no earlier than 12 and no later than 18 after event "b".
Some rules for creating sequences:
Events don't need to be consecutive with respect to "time" column. For example, seq can be constructed from rows 1, 3, and 5.
To be counted, sequences must have different first event. For example, if seq = rows 8, 10, and 11 was counted, then seq = rows 8, 10, and 12 must not be counted.
The events may be included in many constructed sequences if they do not violate the second rule. For example, we count both sequences: rows 1, 3, 5 and rows 5, 6, 7.
The expected result:
df1
id count
1 1 2
2 2 2
There are some related questions in R - Identify a sequence of row elements by groups in a dataframe and Finding rows in R dataframe where a column value follows a sequence.
Is it a way to solve the problem using "dplyr"?
I believe this is what you're looking for. It gives you the desired output. Note that there is a typo in your original question where you have a 32 instead of a 42 when you define the time column in df. I say this is a typo because it doesn't match your output immediately below the definition of df. I changed the 32 to a 42 in the code below.
library(dplyr)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
df %>%
full_join(df,by='id',suffix=c('1','2')) %>%
full_join(df,by='id') %>%
rename(event3 = event, time3 = time) %>%
filter(event1 == seq[1] & event2 == seq[2] & event3 == seq[3]) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>%
filter((time2-time1) %>% between(time_LB[2],time_UB[2])) %>%
filter((time3-time2) %>% between(time_LB[3],time_UB[3])) %>%
group_by(id,time1) %>%
slice(1) %>% # slice 1 row for each unique id and time1 (so no duplicate time1s)
group_by(id) %>%
count()
Here's the output:
# A tibble: 2 x 2
id n
<dbl> <int>
1 1 2
2 2 2
Also, if you omit the last 2 parts of the dplyr pipe that do the counting (to see the sequences it is matching), you get the following sequences:
Source: local data frame [4 x 7]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3
<dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl>
1 1 a 1 b 6 a 24
2 1 a 24 b 30 a 42
3 2 a 1 b 6 a 24
4 2 a 2 b 6 a 24
EDIT IN RESPONSE TO COMMENT REGARDING GENERALIZING THIS: Yes it is possible to generalize this to arbitrary length sequences but requires some R voodoo. Most notably, note the use of Reduce, which allows you to apply a common function on a list of objects as well as foreach, which I'm borrowing from the foreach package to do some arbitrary looping. Here's the code:
library(dplyr)
library(foreach)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
multi_full_join = function(df1,df2) {full_join(df1,df2,by='id')}
df_list = foreach(i=1:length(seq)) %do% {df}
df2 = Reduce(multi_full_join,df_list)
names(df2)[grep('event',names(df2))] = paste0('event',seq_along(seq))
names(df2)[grep('time',names(df2))] = paste0('time',seq_along(seq))
df2 = df2 %>% mutate_if(is.factor,as.character)
df2 = df2 %>%
mutate(seq_string = Reduce(paste0,df2 %>% select(grep('event',names(df2))) %>% as.list)) %>%
filter(seq_string == paste0(seq,collapse=''))
time_diff = df2 %>% select(grep('time',names(df2))) %>%
t %>%
as.data.frame() %>%
lapply(diff) %>%
unlist %>% matrix(ncol=2,byrow=TRUE) %>%
as.data.frame
foreach(i=seq_along(time_diff),.combine=data.frame) %do%
{
time_diff[[i]] %>% between(time_LB[i+1],time_UB[i+1])
} %>%
Reduce(`&`,.) %>%
which %>%
slice(df2,.) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>% # deal with time1 bounds, which we skipped over earlier
group_by(id,time1) %>%
slice(1) # slice 1 row for each unique id and time1 (so no duplicate time1s)
This outputs the following:
Source: local data frame [4 x 8]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3 seq_string
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 1 a 1 b 6 a 24 aba
2 1 a 24 b 30 a 42 aba
3 2 a 1 b 6 a 24 aba
4 2 a 2 b 6 a 24 aba
If you want just the counts, you can group_by(id) then count() as in the original code snippet.
Perhaps it's easier to represent event sequences as strings and use regex:
df.str = lapply(split(df, df$id), function(d) {
z = rep('-', tail(d,1)$time); z[d$time] = as.character(d$event); z })
df.str = lapply(df.str, paste, collapse='')
# > df.str
# $`1`
# [1] "a-b--b-----b-----------a-----b-----------a"
#
# $`2`
# [1] "aa---b----------a------a"
df1 = lapply(df.str, function(s) length(gregexpr('(?=a.{1,7}b.{11,17}a)', s, perl=T)[[1]]))
> data.frame(id=names(df1), count=unlist(df1))
# id count
# 1 1 2
# 2 2 2