How to check multiple values using if condition [duplicate] - r

This question already has answers here:
Idiom for ifelse-style recoding for multiple categories
(13 answers)
Closed 4 years ago.
I have like below mentioned dataframe:
Records:
ID Remarks Value
1 ABC 10
1 AAB 12
1 ZZX 15
2 XYZ 12
2 ABB 14
By utilizing the above mentioned dataframe, I want to add new column Status in the existing dataframe.
Where if the Remarks is ABC, AAB or ABB than status would be TRUE and for XYZ and ZZX it should be FALSE.
I am using below mentioned method for that but it didn't work.
Records$Status<-ifelse(Records$Remarks %in% ("ABC","AAB","ABB"),"TRUE",
ifelse(Records$Remarks %in%
("XYZ","ZZX"),"FALSE"))
And, bases on the Status i want to derive following output:
ID TRUE FALSE Sum
1 2 1 37
2 1 1 26

Records$Status<-ifelse(Records$Remarks %in% c("ABC","AAB","ABB"),TRUE,
ifelse(Records$Remarks %in%
c("XYZ","ZZX"),FALSE, NA))
You need to enclose your lists of strings with c(), and add an "else" condition for the second ifelse (but see Roman's answer below for a better way of doing this with case_when). (Also note that here I changed the "TRUE" and "FALSE" (as character class) into TRUE and FALSE (the logical class).
For the summary (using dplyr):
Records %>% group_by(ID) %>%
dplyr::summarise(trues=sum(Status), falses=sum(!Status), sum=sum(Value))
# A tibble: 2 x 4
ID trues falses sum
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Of course, if you don't really need the intermediate Status column but just want the summary table, you can skip the first step altogether:
Records %>% group_by(ID) %>%
dplyr::summarise(trues=sum(Remarks %in% c("ABC","AAB","ABB")),
falses=sum(Remarks %in% c("XYZ","ZZX")),
sum=sum(Value))

Since it makes sense to use dplyr for your second question (see #iod's answer) it is also a good opportunity to use the package's very straightforward case_when() function for the first part.
Records %>%
mutate(Status = case_when(Remarks %in% c("ABC", "AAB", "ABB") ~ TRUE,
Remarks %in% c("XYZ", "ZZX") ~ FALSE,
TRUE ~ NA))
ID Remarks Value Status
1 1 ABC 10 TRUE
2 1 AAB 12 TRUE
3 1 ZZX 15 FALSE
4 2 XYZ 12 FALSE
5 2 ABB 14 TRUE

This approach will scale to a large number of remarks.
Load the data and prepare a matching data frame
The second data frame makes a matching between remarks and their TRUE or FALSE value.
library(readr)
library(dplyr)
library(tidyr)
dtf <- read_table("id remarks value
1 ABC 10
1 AAB 12
1 ZZX 15
2 XYZ 12
2 ABB 14")
truefalse <- data_frame(remarks = c("ABC", "AAB", "ABB", "ZZX", "XYZ"),
tf = c(TRUE, TRUE, TRUE, FALSE, FALSE))
Group by id and summarise
This is the format as asked in the question
dtf %>%
left_join(truefalse, by = "remarks") %>%
group_by(id) %>%
summarise(true = sum(tf),
false = sum(!tf),
value = sum(value))
# A tibble: 2 x 4
id true false value
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Alternative proposal: group by id, tf and summarise
This option retains more details on the spread of value along the grouping variables id and tf.
dtf %>%
left_join(truefalse, by = "remarks") %>%
group_by(id, tf) %>%
summarise(n = n(),
value = sum(value))
# A tibble: 4 x 4
# Groups: id [?]
id tf n value
<int> <lgl> <int> <int>
1 1 FALSE 1 15
2 1 TRUE 2 22
3 2 FALSE 1 12
4 2 TRUE 1 14

In most cases, life is easier and lines are shorter without ifelse:
# short version
df$Status <- df$Remarks %in% c("ABC","AAB","ABB")
This version is OK for most purposes but it has shortcomings. Status will be FALSE if Remarks is NA or, say "garbage" but one might want it to be NA in these cases and FALSE only if Remarks %in% c("XYZ", "ZZX"). So one can add and multiply the conditions and finally convert it to logical:
df$Status <- as.logical(with(df,
Remarks %in% c("ABC","AAB","ABB") +
! Remarks %in% c("XYZ","ZZX") ))
And the summary table with base R:
aggregate(df[,-(1:2)], df["ID"], function(x) if(is.numeric(x)) sum(x) else table(x))
Umm... perhaps some formatting would be useful:
t1 <- aggregate(df[,-(1:2)], df["ID"], function(x) if(is.numeric(x)) sum(x) else table(x))
t1 <- t1[, c(1,3,2)]
colnames(t1) <- c("ID", "", "Sum")
t1
# ID FALSE TRUE Sum
# 1 1 1 2 37
# 2 2 1 1 26

This one returns correct result, only if there are two mentioned groups ("ABC", "AAB", "ABB" vs "XYZ","ZZX", ...). For me #iod's solution, is more R-like, but I've tried to avoid ifelse, and do it another way:
Code:
library(tidyverse)
dt %>%
group_by(ID, Status = grepl("^A[AB][CB]$", Remarks)) %>%
summarise(N = n(), Sum = sum(Value)) %>%
spread(Status, N) %>%
summarize_all(sum, na.rm = T) %>% # data still groupped by ID
select("ID", "TRUE", "FALSE", "Sum")
# A tibble: 2 x 4
ID `TRUE` `FALSE` Sum
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Data:
dt <- structure(
list(ID = c(1L, 1L, 1L, 2L, 2L),
Remarks = c("ABC", "AAB", "ZZX", "XYZ", "ABB"),
Value = c(10L, 12L, 15L, 12L, 14L)),
.Names = c("ID", "Remarks", "Value"), class = "data.frame", row.names = c(NA, -5L)
)

Related

R - create indicator column for whether a value appears within a group

I have a dataframe df with a set of IDs that may appear multiple times with a different Status for each row. I need to create a 0/1 indicator column for whether Status "B" ever appears for that ID. B_appears shows my desired result.
I have done something kind of related by creating a "Count" column that counts the number of times the Status listed in that row appears for that ID. But I can't figure out how to create the indicator variable that is specifically related to Status "B."
This is how I created the "Count" column, fwiw.
df <- ddply(df),.(ID,Status), transform, Count = length(ID))
Thanks in advance!
ID
Status
Count
B_appears
1
A
1
0
2
A
1
1
2
B
2
1
2
B
2
1
3
A
1
1
3
B
1
1
With tidyverse, we group by 'ID', get the Count column with group size (n()) and the 'B_appears' by creating a logical vector check whether 'B' is %in% the Status and convert the logical to binary (+ or as.integer)
library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(Count = n(),
B_appears = +('B' %in% Status)) %>%
# or may also create B_appears as
# B_appears = +(any(Status %in% 'B'))) %>%
ungroup
-output
# A tibble: 6 × 4
ID Status Count B_appears
<int> <chr> <int> <int>
1 1 A 1 0
2 2 A 3 1
3 2 B 3 1
4 2 B 3 1
5 3 A 2 1
6 3 B 2 1
data
df <- structure(list(ID = c(1L, 2L, 2L, 2L, 3L, 3L), Status = c("A",
"A", "B", "B", "A", "B")), row.names = c(NA, -6L), class = "data.frame")

Writing other columns when matching a condition

I would like to create a new column, document it only when it matches a specific condition (here x > 2 ) and then directly overwrite another existing column (here auxiliary) for these rows where the condition (x > 2) returned TRUE.
df <- tibble(x = 1:5, y = 1:5, auxiliary = NA)
# A tibble: 5 x 3
x y auxiliary
<int> <dbl> <lgl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
I can do this successfully in two different calls within mutate() :
df %>%
mutate(result = if_else(condition = x > 2,
true = x+y,
false = NA_real_),
auxiliary = if_else(condition = x > 2,
true = "Calculation done",
false = NA_character_))
# A tibble: 5 x 4
x y auxiliary result
<int> <dbl> <chr> <dbl>
1 1 NA NA
2 2 NA NA
3 3 Calculation done 6
4 4 Calculation done 8
5 5 Calculation done 10
But there's some code repetition (condition = x > 2) which, in more complex cases, makes reading the code very uneasy and prone to errors, especially when there are multiple conditions.
Is there a way to simplify the code above by not repeating the condition ? :
Create new variable (mutate())
Document only if condition is matched (if_else or case_when())
Write another column's value only if the row's condition is matched. (I'm stuck here)
Something that would look like this :
df %>%
mutate(result = case_when(
x > 2 ~ x + y & auxiliary == "Calculation done", # we'd add the column reference here...
TRUE ~ NA_real & auxiliary = NA_character_))
Many thanks ! Any solution from the tidyverse would be ideal.
You can save the result of the condition in a column and use that to avoid evaluating the same condition again and again.
library(dplyr)
df <- tibble(x = 1:5, y = 1:5)
df %>%
mutate(condition = x > 2,
result = if_else(condition,
true = x+y,
false = NA_integer_),
auxiliary = if_else(condition,
true = "Calculation done",
false = NA_character_))
# x y condition result auxiliary
# <int> <int> <lgl> <int> <chr>
#1 1 1 FALSE NA NA
#2 2 2 FALSE NA NA
#3 3 3 TRUE 6 Calculation done
#4 4 4 TRUE 8 Calculation done
#5 5 5 TRUE 10 Calculation done
I would suggest saving the condition which should be used multiple times as string and then using the string as variable in the code, e.g.:
condition <- "x>2"
df %>%
mutate(result = ifelse(eval(parse(text=condition)),
x+y,
NA),
auxiliary = ifelse(eval(parse(text=condition)),
"Calculation done",
NA))
Note, that I am using base ifelse statement, to avoid the restriction that I have to use the same type in the column ("dplyr::if_else is specifically written to force you to have the same type in your true and false arguments."). See further information on that e.g. Different behavior of if else statement and if_else.
It is possible to achieve the kind of abstraction you would like to have, but it does require more set-ups. mutate is actually more flexible than you think it is. You can pass a script to it. Suppose you write something like A %>% mutate({...}). If the script {...} returns a dataframe, then its columns will be created directly in A or replace the existing columns in A if they share the same names. So you can do
df %>% mutate({
cond <- x > 2
out <- tibble(.rows = n())
mapply(
\(var, true, false) out[[var]] <<- if_else(cond, true, false),
var = c("result", "auxiliary"),
true = list(x + y, "Calculation done"),
false = list(NA_integer_, NA_character_)
)
out
})
Output
# A tibble: 5 x 4
x y auxiliary result
<int> <int> <chr> <int>
1 1 1 NA NA
2 2 2 NA NA
3 3 3 Calculation done 6
4 4 4 Calculation done 8
5 5 5 Calculation done 10

Count how many times a Nurse washes their hand before patient contact: Is X before Y, group_by(ID)?

I have a set of observed behaviour of nurses conducting patient care and record what they touch or do. This might look like:
df<-data.frame(ActivityID=rep(1:3, each=3),
Action=c("Door", "Hygiene", "Patient", "Door", "Patient", "Door", "Door", "Patient", "Hygiene"))
I'd like to check whether they wash their hands before the first time they touch the patient for each ActivityID and count for how many ActivityID's this occurs. Essentially I'd like to know if X happens before Y for each activity.
My thought was to use which to find the first occurrence for both Patient and Hygiene:
require(dplyr)
a=df%>%
group_by(ActivityID) %>%
which(Action=="Hygiene")
b=df%>%
group_by(ActivityID) %>%
which(Action=="Patient")
which(a<b)
But this doesn't seem to work in pipe form and sometimes, they don't touch the patient. Any help would be much appreciated.
Total unique activities can be calculated using :
library(dplyr)
total_Activities <- n_distinct(df$ActivityID)
total_Activities
#[1] 3
We can write a function to check if hands were washed anytime before touching the Patient for first time:
hands_washed_before_touch <- function(x) {
ind1 <- which(x == 'Hygiene')
ind2 <- which(x == 'Patient')
length(ind1) && length(ind2) && ind1[1] < ind2[1]
}
and use it by group :
df1 <- df %>%
group_by(ActivityID) %>%
summarise(hands_washed = hands_washed_before_touch(Action))
df1
# ActivityID hands_washed
# <int> <lgl>
#1 1 TRUE
#2 2 FALSE
#3 3 FALSE
To get count we can sum hands_washed column i.e sum(df1$hands_washed).
Here is another alternative using case_when from dplyr package.
library(dplyr)
df1<- df %>%
group_by(ActivityID) %>%
mutate(hands_washed = case_when(
!any(Action == "Hygiene") ~ "False",
min(c(which(Action == "Hygiene"), Inf)) > which.max(Action == "Patient")~ "False",
TRUE ~ "True"))%>%
ungroup()
df1
# A tibble: 9 x 3
# Groups: ActivityID [3]
# ActivityID Action hands_washed
# <int> <fct> <chr>
#1 1 Door True
#2 1 Hygiene True
#3 1 Patient True
#4 2 Door False
#5 2 Patient False
#6 2 Door False
#7 3 Door False
#8 3 Patient False
#9 3 Hygiene False

row wise test if multiple (not all) columns are equal

I want to do a row wise check if multiple columns are all equal or not. I came up with a convoluted approach to count the occurences of each value per group. But this seems somewhat... cumbersome.
sample data
sample_df <- data.frame(id = letters[1:6], group = rep(c('r','l'),3), stringsAsFactors = FALSE)
set.seed(4)
for(i in 3:5) {
sample_df[i] <- sample(1:4, 6, replace = TRUE)
sample_df
}
desired output
library(tidyverse)
sample_df %>%
gather(var, value, V3:V5) %>%
mutate(n_var = n_distinct(var)) %>% # get the number of columns
group_by(id, group, value) %>%
mutate(test = n_distinct(var) == n_var ) %>% # check how frequent values occur per "var"
spread(var, value) %>%
select(-n_var)
#> # A tibble: 6 x 6
#> # Groups: id, group [6]
#> id group test V3 V4 V5
#> <chr> <chr> <lgl> <int> <int> <int>
#> 1 a r FALSE 3 3 1
#> 2 b l FALSE 1 4 4
#> 3 c r FALSE 2 4 2
#> 4 d l FALSE 2 1 2
#> 5 e r TRUE 4 4 4
#> 6 f l FALSE 2 2 3
Created on 2019-02-27 by the reprex package (v0.2.1)
Does not need to be dplyr. I just used it for showing what I want to achieve.
There are a bunch of ways to check for equality row-wise. Two good ways:
# test that all values equal the first column
rowSums(df == df[, 1]) == ncol(df)
# count the unique values, see if there is just 1
apply(df, 1, function(x) length(unique(x)) == 1)
If you only want to test some columns, then use a subset of columns rather than the whole data frame:
cols_to_test = c(3, 4, 5)
rowSums(df[cols_to_test] == df[, cols_to_test[1]]) == length(cols_to_test)
# count the unique values, see if there is just 1
apply(df[cols_to_test], 1, function(x) length(unique(x)) == 1)
Note I use df[cols_to_test] instead of df[, cols_to_test] when I want to be sure the result is a data.frame even if cols_to_test has length 1.

R - Find a sequence of row elements based on time constraints in a dataframe

Consider the following dataframe (ordered by id and time):
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,32,1,2,6,17,24))
df
id event time
1 1 a 1
2 1 b 3
3 1 b 6
4 1 b 12
5 1 a 24
6 1 b 30
7 1 a 42
8 2 a 1
9 2 a 2
10 2 b 6
11 2 a 17
12 2 a 24
I want to count how many times a given sequence of events appears in each "id" group. Consider the following sequence with time constraints:
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
It means that event "a" can start at any time, event "b" must start no earlier than 2 and no later than 8 after event "a", another event "a" must start no earlier than 12 and no later than 18 after event "b".
Some rules for creating sequences:
Events don't need to be consecutive with respect to "time" column. For example, seq can be constructed from rows 1, 3, and 5.
To be counted, sequences must have different first event. For example, if seq = rows 8, 10, and 11 was counted, then seq = rows 8, 10, and 12 must not be counted.
The events may be included in many constructed sequences if they do not violate the second rule. For example, we count both sequences: rows 1, 3, 5 and rows 5, 6, 7.
The expected result:
df1
id count
1 1 2
2 2 2
There are some related questions in R - Identify a sequence of row elements by groups in a dataframe and Finding rows in R dataframe where a column value follows a sequence.
Is it a way to solve the problem using "dplyr"?
I believe this is what you're looking for. It gives you the desired output. Note that there is a typo in your original question where you have a 32 instead of a 42 when you define the time column in df. I say this is a typo because it doesn't match your output immediately below the definition of df. I changed the 32 to a 42 in the code below.
library(dplyr)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
df %>%
full_join(df,by='id',suffix=c('1','2')) %>%
full_join(df,by='id') %>%
rename(event3 = event, time3 = time) %>%
filter(event1 == seq[1] & event2 == seq[2] & event3 == seq[3]) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>%
filter((time2-time1) %>% between(time_LB[2],time_UB[2])) %>%
filter((time3-time2) %>% between(time_LB[3],time_UB[3])) %>%
group_by(id,time1) %>%
slice(1) %>% # slice 1 row for each unique id and time1 (so no duplicate time1s)
group_by(id) %>%
count()
Here's the output:
# A tibble: 2 x 2
id n
<dbl> <int>
1 1 2
2 2 2
Also, if you omit the last 2 parts of the dplyr pipe that do the counting (to see the sequences it is matching), you get the following sequences:
Source: local data frame [4 x 7]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3
<dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl>
1 1 a 1 b 6 a 24
2 1 a 24 b 30 a 42
3 2 a 1 b 6 a 24
4 2 a 2 b 6 a 24
EDIT IN RESPONSE TO COMMENT REGARDING GENERALIZING THIS: Yes it is possible to generalize this to arbitrary length sequences but requires some R voodoo. Most notably, note the use of Reduce, which allows you to apply a common function on a list of objects as well as foreach, which I'm borrowing from the foreach package to do some arbitrary looping. Here's the code:
library(dplyr)
library(foreach)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
multi_full_join = function(df1,df2) {full_join(df1,df2,by='id')}
df_list = foreach(i=1:length(seq)) %do% {df}
df2 = Reduce(multi_full_join,df_list)
names(df2)[grep('event',names(df2))] = paste0('event',seq_along(seq))
names(df2)[grep('time',names(df2))] = paste0('time',seq_along(seq))
df2 = df2 %>% mutate_if(is.factor,as.character)
df2 = df2 %>%
mutate(seq_string = Reduce(paste0,df2 %>% select(grep('event',names(df2))) %>% as.list)) %>%
filter(seq_string == paste0(seq,collapse=''))
time_diff = df2 %>% select(grep('time',names(df2))) %>%
t %>%
as.data.frame() %>%
lapply(diff) %>%
unlist %>% matrix(ncol=2,byrow=TRUE) %>%
as.data.frame
foreach(i=seq_along(time_diff),.combine=data.frame) %do%
{
time_diff[[i]] %>% between(time_LB[i+1],time_UB[i+1])
} %>%
Reduce(`&`,.) %>%
which %>%
slice(df2,.) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>% # deal with time1 bounds, which we skipped over earlier
group_by(id,time1) %>%
slice(1) # slice 1 row for each unique id and time1 (so no duplicate time1s)
This outputs the following:
Source: local data frame [4 x 8]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3 seq_string
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 1 a 1 b 6 a 24 aba
2 1 a 24 b 30 a 42 aba
3 2 a 1 b 6 a 24 aba
4 2 a 2 b 6 a 24 aba
If you want just the counts, you can group_by(id) then count() as in the original code snippet.
Perhaps it's easier to represent event sequences as strings and use regex:
df.str = lapply(split(df, df$id), function(d) {
z = rep('-', tail(d,1)$time); z[d$time] = as.character(d$event); z })
df.str = lapply(df.str, paste, collapse='')
# > df.str
# $`1`
# [1] "a-b--b-----b-----------a-----b-----------a"
#
# $`2`
# [1] "aa---b----------a------a"
df1 = lapply(df.str, function(s) length(gregexpr('(?=a.{1,7}b.{11,17}a)', s, perl=T)[[1]]))
> data.frame(id=names(df1), count=unlist(df1))
# id count
# 1 1 2
# 2 2 2

Resources