How to loop ifelse function through a grouped variable with dplyr - r

I'm trying to apply a rule for a group of IDs that, upon the first instance where the value for a variable in one row equals 1, all values for another variable in all subsequent rows in that group equal 1.
Essentially, here is what I am trying to do:
I have:
ID D
1 1
1 0
1 0
2 0
2 0
3 1
3 0
3 0
4 1
4 0
4 1
4 1
4 1
4 0
I want:
ID D PREV
1 1 0
1 0 1
1 0 1
2 0 0
2 0 0
3 1 0
3 0 1
3 0 1
4 1 0
4 0 1
4 1 1
4 1 1
4 0 1
I'm trying to use dplyr to iterate through a series of grouped rows, in each one applying an ifelse function. My code looks like this:
data$prev = 0
data <-
data %>%
group_by(id)%>%
mutate(prev = if_else(lag(prev) == 1 | lag(d) == 1, 1, 0))
But for some reason, this is not applying the ifelse function over the whole group, resulting in data that looks something like this:
ID D PREV
1 1 0
1 0 1
1 0 0
2 0 0
2 0 0
3 1 0
3 0 1
3 0 0
4 1 0
4 0 1
4 1 0
4 1 1
4 0 1
Can anyone help me with this?

What about this:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(prev = +(cumsum(c(0, D[-length(D)])) > 0)) %>%
ungroup()
#> # A tibble: 14 x 3
#> ID D prev
#> <int> <int> <int>
#> 1 1 1 0
#> 2 1 0 1
#> 3 1 0 1
#> 4 2 0 0
#> 5 2 0 0
#> 6 3 1 0
#> 7 3 0 1
#> 8 3 0 1
#> 9 4 1 0
#> 10 4 0 1
#> 11 4 1 1
#> 12 4 1 1
#> 13 4 1 1
#> 14 4 0 1
To explain what it does, let's just take a simple vector.
The calc will be the same for each group.
Be x our vector
x <- c(0,0,0,1,1,0,0,2,3,4)
Do the cumulative sum over x
cumsum(x)
#> [1] 0 0 0 1 2 2 2 4 7 11
You are interested only on value above zeros, therefore:
cumsum(x)>0
#> [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
You don't want logical, but numeric. Just a + makes the trick
+(cumsum(x)>0)
#> [1] 0 0 0 1 1 1 1 1 1 1
However, you want the 1s delayed by 1. Thus, we had a zero on top of x
+(cumsum(c(0,x))>0)
#> [1] 0 0 0 0 1 1 1 1 1 1 1
We need to keep the same length, so we remove the last value of x.
+(cumsum(c(0, x[-length(x)])) > 0)
#> [1] 0 0 0 0 1 1 1 1 1 1
And that makes the trick.

We can use lag
library(dplyr)
df %>%
group_by(ID) %>%
mutate(prev = lag(cumsum(D) > 0, default = 0))
-output
# A tibble: 14 x 3
# Groups: ID [4]
# ID D prev
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 0 1
# 3 1 0 1
# 4 2 0 0
# 5 2 0 0
# 6 3 1 0
# 7 3 0 1
# 8 3 0 1
# 9 4 1 0
#10 4 0 1
#11 4 1 1
#12 4 1 1
#13 4 1 1
#14 4 0 1
data
df <- data.frame(
ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4),
D = c(1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0)
)

You can use a new function from dplyr dplyr::group_modify to apply function over groups
df <- data.frame(
ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4),
D = c(1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0)
)
df %>% group_by(ID) %>% group_modify(
function(x, y){
boo <- x[1, ]$D == 1
ifelse(boo,
{x$prev = 1
x$prev[1] = 0
},
{x$prev = 0})
x
}
)
# A tibble: 14 x 3
# Groups: ID [4]
ID D prev
<dbl> <dbl> <dbl>
1 1 1 0
2 1 0 1
3 1 0 1
4 2 0 0
5 2 0 0
6 3 1 0
7 3 0 1
8 3 0 1
9 4 1 0
10 4 0 1
11 4 1 1
12 4 1 1
13 4 1 1
14 4 0 1

Related

How can I create a new column with values 1/0, where the value in the new column is 1 only if values in two other columns are both 1?

I have two columns within a DF, "wet" and "cold", with values of 1 and 0 respectively, e.g:
Wet Cold
1 1
0 1
0 1
1 0
1 1
0 0
I would like to create a new column, wet&cold, where only if wet=1 and cold=1, then wet&cold=1. If any or both of them are 0 or not matching, then wet&cold=0.
I tried to work around with grepl, but without success.
Base R solution
df$`wet&cold` <- df$Wet*df$Cold
df
Wet Cold wet&cold
1 1 1 1
2 0 1 0
3 0 1 0
4 1 0 0
5 1 1 1
6 0 0 0
dplyr solution
df %>%
mutate(`wet&cold`=Wet*Cold)
Wet Cold wet&cold
1 1 1 1
2 0 1 0
3 0 1 0
4 1 0 0
5 1 1 1
6 0 0 0
Another option by checking I all row values have the value 1 for all the columns and convert the TRUE/FALSE to 1/0 with as.integer like this:
df$wet_cold = as.integer(rowSums(df == 1) == ncol(df))
df
#> Wet Cold wet_cold
#> 1 1 1 1
#> 2 0 1 0
#> 3 0 1 0
#> 4 1 0 0
#> 5 1 1 1
#> 6 0 0 0
Created on 2023-01-18 with reprex v2.0.2
Other solution works great with the clever multiplication. Here's perhaps a more general solution using ifelse(), which works well for this two case situation.
df <- data.frame(
wet = c(1, 0, 0, 1, 1, 0),
cold = c(1, 1, 1, 0, 1, 0)
)
df$wet_cold <- ifelse(df$wet == 1 & df$cold == 1, 1, 0)
df
# df
# wet cold wet_cold
# 1 1 1 1
# 2 0 1 0
# 3 0 1 0
# 4 1 0 0
# 5 1 1 1
# 6 0 0 0
You can use & to check if both are 1 and using + to convert TRUE or FLASE to 1 and 0.
DF["wet&cold"] <- +(DF$wet & DF$cold)
#DF
# wet cold wet&cold
#1 1 1 1
#2 0 1 0
#3 0 1 0
#4 1 0 0
#5 1 1 1
#6 0 0 0
Two more general approaches for more than two columns and also other conditions than 1 will be.
DF["wet&cold"] <- +(apply(DF==1, 1, all))
DF["wet&cold"] <- +(rowSums(DF != 1) == 0)
Data
DF <- data.frame(wet = c(1, 0, 0, 1, 1, 0), cold = c(1, 1, 1, 0, 1, 0))

R: Count frequencies of levels across whole time series

I've created this dummy dataframe that represents my real data. For simplicity, I've dropped the Time column:
df <- tibble(ID = c(1, 2, 1, 3, 4, 1, 2, 3),
level = c(0, 0, 1, 2, 1, 2, 3, 0),
n_0 = 4,
n_1 = 0,
n_2 = 0,
n_3 = 0,
previous_level = c(0, 0, 0, 0, 0, 1, 0, 2))
ID level n_0 n_1 n_2 n_3 previous_level
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 4 0 0 0 0
2 2 0 4 0 0 0 0
3 1 1 4 0 0 0 0
4 3 2 4 0 0 0 0
5 4 1 4 0 0 0 0
6 1 2 4 0 0 0 1
7 2 3 4 0 0 0 0
8 3 0 4 0 0 0 2
So some words to explain this structure. The actual data comprises only the ID and level column. A specific ID can only have one level, however, this might change over time. All IDs start with level 0. Now I want columns that track how much of my IDs (here in total 4) have levels 0, 1, 2 and 3. Therefore I've already created the count columns. Also, I think a column with previous level might be helpful.
The following table shows the result I'm expecting:
ID level n_0 n_1 n_2 n_3 previous_level
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 4 0 0 0 0
2 2 0 4 0 0 0 0
3 1 1 3 1 0 0 0
4 3 2 2 1 1 0 0
5 4 1 1 2 1 0 0
6 1 2 1 1 2 0 1
7 2 3 0 1 2 1 0
8 3 0 1 1 1 1 2
Is there a sneaky way to do so in R?
You may try
fn <- function(df){
res <- as.data.frame(matrix(0, ncol = length(unique(df$level)), nrow = nrow(df)))
key <- factor(rep(0, length(unique(df$level))), levels = unique(df$level))
for (i in 1:nrow(df)){
if (df$level[i] != key[df$ID[i]]){
key[df$ID[i]] <- df$level[i]
res[i,] <- table(key)
} else {
res[i,] <- table(key)
}
}
names(res) <- paste0("n_",levels(key))
names(res)
df <- cbind(df, res)
return(df)
}
fn(df)
ID level previous_level n_0 n_1 n_2 n_3
1 1 0 0 4 0 0 0
2 2 0 0 4 0 0 0
3 1 1 0 3 1 0 0
4 3 2 0 2 1 1 0
5 4 1 0 1 2 1 0
6 1 2 1 1 1 2 0
7 2 3 0 0 1 2 1
8 3 0 2 1 1 1 1
library(dplyr)
library(margrittr)
n_states = 4L
state = vector(mode = 'numeric', length = n_states)
state[1L] = n_distinct(df$ID)
for (i in seq_len(nrow(df))) {
state[df[i, 'previous_level'] + 1] %<>% subtract(1)
state[df[i, 'level'] + 1] %<>% add(1)
df[i, paste0('n', seq_len(n_states) - 1L)] = state
}
# ID level previous_level n0 n1 n2 n3
# 1 1 0 0 4 0 0 0
# 2 2 0 0 4 0 0 0
# 3 1 1 0 3 1 0 0
# 4 3 2 0 2 1 1 0
# 5 4 1 0 1 2 1 0
# 6 1 2 1 1 1 2 0
# 7 2 3 0 0 1 2 1
# 8 3 0 2 1 1 1 1
Data:
df <- data.frame(
ID = c(1, 2, 1, 3, 4, 1, 2, 3),
level = c(0, 0, 1, 2, 1, 2, 3, 0),
previous_level = c(0, 0, 0, 0, 0, 1, 0, 2)
)

Number of switches in a vector by group

Given the vector
vec <- c(1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0)
the number of switches can simply be calculated with
cumsum(diff(vec) == 1)
However given the df
group <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
df_test <- data.frame(vec,group)
and using
df_test <- df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(diff(vec) == 1))%>%
ungroup()
the calculation fails because n_switches has 2 rows (1 per group) less than vec.
How can I overcome this problem and generate n_switches per group? Any help would be much appreciated!
With diff, the output is always an output with length 1 difference from the original output length. We can append an element to make the length same as mutate wants to return the column with the same length
library(dplyr)
df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(c(FALSE, diff(vec) == 1))) %>%
ungroup
Here is another option besides diff, which applies gregexpr for positioning the shifts
df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(replace(0 * vec, gregexpr("(?<=0)1", paste0(vec, collapse = ""), perl = TRUE)[[1]], 1))) %>%
ungroup()
# A tibble: 20 x 3
vec group n_switches
<dbl> <dbl> <dbl>
1 1 1 0
2 0 1 0
3 1 1 1
4 1 1 1
5 1 1 1
6 0 1 1
7 0 1 1
8 1 1 2
9 1 1 2
10 0 1 2
11 1 2 0
12 0 2 0
13 1 2 1
14 0 2 1
15 1 2 2
16 1 2 2
17 1 2 2
18 0 2 2
19 1 2 3
20 0 2 3
A similar realization but with data.table
setDT(df_test)[
,
n_switches := cumsum(replace(0 * vec, gregexpr("(?<=0)1", paste0(vec, collapse = ""), perl = TRUE)[[1]], 1)),
group
]
> df_test
vec group n_switches
1: 1 1 0
2: 0 1 0
3: 1 1 1
4: 1 1 1
5: 1 1 1
6: 0 1 1
7: 0 1 1
8: 1 1 2
9: 1 1 2
10: 0 1 2
11: 1 2 0
12: 0 2 0
13: 1 2 1
14: 0 2 1
15: 1 2 2
16: 1 2 2
17: 1 2 2
18: 0 2 2
19: 1 2 3
20: 0 2 3
You can define a switch as a position where the current value is 1 and previous value was 0 and then using cumsum create n_switches variable.
Using dplyr you can do this as :
library(dplyr)
df_test %>%
group_by(group) %>%
mutate(n_switches = cumsum(vec == 1 & lag(vec, default = 1) == 0))
# vec group n_switches
# <dbl> <dbl> <int>
# 1 1 1 0
# 2 0 1 0
# 3 1 1 1
# 4 1 1 1
# 5 1 1 1
# 6 0 1 1
# 7 0 1 1
# 8 1 1 2
# 9 1 1 2
#10 0 1 2
#11 1 2 0
#12 0 2 0
#13 1 2 1
#14 0 2 1
#15 1 2 2
#16 1 2 2
#17 1 2 2
#18 0 2 2
#19 1 2 3
#20 0 2 3
and same logic with data.table :
library(data.table)
setDT(df_test)[, n_switches:=cumsum(vec == 1 & shift(vec, fill = 1) == 0), group]

sum certain values between when the first and last observation equals a certain value

I have something like:
df<-data.frame(group=c(1, 1, 1, 1,1, 2, 2, 2, 2, 2, 3, 3, 3),
have=c(1, 0, 1, 0, 1, 1, 0,0,0,1, 1,0,0),
wantsum=c(2,2,2,2,2,3,3,3,3,3,0,0,0))
I want to sum the number of 0's per group, but only when the first and last observation is equal to 1. Something like:
# group have wantsum
#1 1 1 2
#2 1 0 2
#3 1 1 2
#4 1 0 2
#5 1 1 2
#6 2 1 3
#7 2 0 3
#8 2 0 3
#9 2 0 3
#10 2 1 3
#11 3 1 0
#12 3 0 0
#13 3 0 0
Thanks
We can create a condition after grouping by 'group', by checking if all the first and last observation in 'have' are 1, then get the sum of '0' values or else return 0
library(dplyr)
df %>%
group_by(group) %>%
mutate(wantsum2 = if(all(c(first(have), last(have)) == 1)) sum(have == 0) else 0)
# A tibble: 13 x 4
# Groups: group [3]
# group have wantsum wantsum2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 2
# 2 1 0 2 2
# 3 1 1 2 2
# 4 1 0 2 2
# 5 1 1 2 2
# 6 2 1 3 3
# 7 2 0 3 3
# 8 2 0 3 3
# 9 2 0 3 3
#10 2 1 3 3
#11 3 1 0 0
#12 3 0 0 0
#13 3 0 0 0
There are many ways to make this happen, so the if/else condition can be modified by multiplying the logical vector generated from all and as TRUE/FALSE -> 1/0, any number multiplied by 0 -> 0 and multiplied by 1 -> number
df %>%
group_by(group) %>%
mutate(wantsum2 = sum(have == 0) * all(c(first(have), last(have)) == 1) )

Identify and label repeated data in a series

I'm trying to identify cases in a dataset where a value occurs multiple times in a row, and once this is picked up, a row to the side of the nth occurrence confirms this with '1'.
df<-data.frame(user=c(1,1,1,1,2,3,3,3,4,4,4,4,4,4,4,4),
week=c(1,2,3,4,1,1,2,3,1,2,3,4,5,6,7,8),
updated=c(1,0,1,1,1,1,1,1,1,1,0,0,0,0,1,1))
In this case, users are performing a task. If the task is performed, '1' appears for that week, if not '0' appears.
Is it possible, in the event that four or more 0s are encountered in a row, that an indicator is mutated into a new column identifying that this sequence has occurred? Something like this:
user week updated warning
1 1 1 1 0
2 1 2 0 0
3 1 3 1 0
4 1 4 1 0
5 2 1 1 0
6 3 1 1 0
7 3 2 1 0
8 3 3 1 0
9 4 1 1 0
10 4 2 1 0
11 4 3 0 0
12 4 4 0 0
13 4 5 0 0
14 4 6 0 1
15 4 7 1 0
16 4 8 1 0
Thanks!
Edit:
Apologies and thanks to #akrun for helping with this.
Additional example below, where on the 4th occurring missed entry equalling to '1', the warning column is updated to show the event, where a trigger will run off of that data.
df<-data.frame(user=c(1,1,1,1,2,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7),
week=c(1,2,3,4,1,1,2,3,1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,1,2,3,4,5,6,7,8),
missed=c(0,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,1,0,1,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,1,1,0,1))
user week missed warning
1 1 1 0 0
2 1 2 1 0
3 1 3 0 0
4 1 4 0 0
5 2 1 0 0
6 3 1 0 0
7 3 2 0 0
8 3 3 0 0
9 4 1 0 0
10 4 2 0 0
11 4 3 1 0
12 4 4 1 0
13 4 5 1 0
14 4 6 1 1
15 4 7 0 0
16 4 8 0 0
17 5 1 0 0
18 5 2 1 0
19 5 3 0 0
20 5 4 1 0
21 5 5 0 0
22 5 6 0 0
23 5 7 0 0
24 5 8 0 0
25 6 1 0 0
26 6 2 1 0
27 6 3 1 0
28 6 4 1 0
29 6 5 1 1
30 6 6 1 0
31 6 7 0 0
32 7 1 0 0
33 7 2 0 0
34 7 3 0 0
35 7 4 0 0
36 7 5 1 0
37 7 6 1 0
38 7 7 0 0
39 7 8 1 0
An option would be to use rle to create the warning. Grouped by 'user', create the 'warning based by checking therun-length-id (rle) of 'updated', it would give the adjacent similar 'values' and 'lengths' as a list, create a logical condition where values is 0 and lengthsis greater than or equal to 4.
library(dplyr)
library(data.table)
df %>%
group_by(user) %>%
mutate(warning = with(rle(updated), rep(!values & lengths >= 4, lengths))) %>%
group_by(grp = rleid(warning), add = TRUE) %>%
mutate(warning = if(all(warning)) rep(c(0, 1), c(n()-1, 1)) else 0) %>%
ungroup %>%
select(-grp)
# A tibble: 16 x 4
# user week updated warning
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 0
# 2 1 2 0 0
# 3 1 3 1 0
# 4 1 4 1 0
# 5 2 1 1 0
# 6 3 1 1 0
# 7 3 2 1 0
# 8 3 3 1 0
# 9 4 1 1 0
#10 4 2 1 0
#11 4 3 0 0
#12 4 4 0 0
#13 4 5 0 0
#14 4 6 0 1
#15 4 7 1 0
#16 4 8 1 0
If we need to flag the group where any have greater than 4 0's then
df %>%
group_by(user) %>%
mutate(warning = with(rle(updated), rep(!values & lengths >= 4, lengths)),
warning = as.integer(any(warning)))
# A tibble: 16 x 4
# Groups: user [4]
# user week updated warning
# <dbl> <dbl> <dbl> <int>
# 1 1 1 1 0
# 2 1 2 0 0
# 3 1 3 1 0
# 4 1 4 1 0
# 5 2 1 1 0
# 6 3 1 1 0
# 7 3 2 1 0
# 8 3 3 1 0
# 9 4 1 1 1
#10 4 2 1 1
#11 4 3 0 1
#12 4 4 0 1
#13 4 5 0 1
#14 4 6 0 1
#15 4 7 1 1
#16 4 8 1 1
I followed a different approach. I numbered sequentially the cases where updated was 0 for each user and releid(updated). If there's a 4, that means that there are 4 consecutive homeworks not done. The warning is thus created where the new vector is equal to 4.
library(data.table)
df[,
warning := {id <- 1:.N;
warning <- as.numeric(id == 4)},
by = .(user,
rleid(updated))][,
warning := ifelse(warning == 1 & updated == 0, 1, 0)][is.na(warning),
warning := 0]
What has been done there
warning := assigns the result of the sequence that is between the {} to warning.
Now, inside the sequence:
id <- 1:.N creates a temporary variable id variable with consecutive numbers for each user and run-length group of updated values.
warning <- as.numeric(id == 4) creates a temporary variable with 1 in case id2 is equal to 4 and zero otherwise.
The by = .(user, rleid(updated)) grouped by both user and run-length values of updated. Of course there were run-length values for updated == 1, so we get rid of them by the ifelse clause. The final [is.na(warning), warning := 0] (notice the chaining) just gets rid of the NA values in the resulting variable.
Data used
> dput(df2)
structure(list(user = c(1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 4, 4,
4, 4, 4, 5, 5, 5, 5, 5), week = c(1, 2, 3, 4, 1, 1, 2, 3, 1,
2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5), updated = c(1, 0, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0)), row.names = c(NA,
-21L), class = c("data.table", "data.frame"))
Speed comparisson
I just compared with #akrun's answer:
set.seed(1)
df <- data.table(user = sample(1:10, 100, TRUE), updated = sample(c(1, 0), 100, TRUE), key = "user")
df[, week := 1:.N, by = user]
akrun <- function(df4){
df4 %>%
group_by(user) %>%
mutate(warning = with(rle(updated), rep(!values & lengths >= 4, lengths))) %>%
group_by(grp = rleid(warning), add = TRUE) %>%
mutate(warning = if(all(warning)) rep(c(0, 1), c(n()-1, 1)) else 0) %>%
ungroup %>%
select(-grp)
}
pavo <- function(df4){
df4[, warning := {id <- 1:.N; warning <- as.numeric(id == 4)}, by = .(user, rleid(updated))][, warning := ifelse(warning == 1 & updated == 0, 1, 0)][is.na(warning), warning := 0]
}
microbenchmark(akrun(df), pavo(df), times = 100)
Unit: microseconds
expr min lq mean median uq max neval
akrun(df) 1920.278 2144.049 2405.0332 2245.1735 2308.0145 6901.939 100
pavo(df) 823.193 877.061 978.7166 928.0695 991.5365 4905.450 100

Resources