How to track changes in a long format with dplyr? - r

let's say I had the following dataset in long format, where the id variable represents the participants (group variable):
id
wave
car
household
1
1
0
1
1
2
1
1
1
3
0
1
1
4
1
2
2
1
0
1
2
2
1
2
2
3
1
3
2
4
0
1
3
1
0
1
3
2
0
1
3
3
1
2
3
4
1
1
4
1
0
1
4
2
1
1
4
3
1
1
4
4
1
2
The variable "car" tells whether someone owns a car or not. The variable "household" indicates how many people live in the same household. As you can see, all participants start without owning a car and living alone in the household.
I now want to determine the changes longitudinally so that I end up with
a) only those subjects who own a car and
b) only those subjects who own a car + live with only one other person (not two or more people) in the household.
In each case, only the first change should be counted and as soon as the car is sold (or more than two people live in the household), the data points should be excluded.
So condition a) would be fulfilled for example with proband id 1 to Wave 2. However, only this should be counted, since proband id 1 in Wave 3 no longer owns a car and the subsequent car purchase in Wave 4 represents the second purchase.
Condition b) would be fulfilled, for example, for proband id 2 at Wave 2, but from Wave 3 onwards there are also three people in the household, which is why the data points from Wave 3 onwards are to be excluded. Similarly, if another person moves into the household and you already have a car, a missing value in condition B should arise.
Whether condition a) and/ or condition b) apply is to be calculated in two separate binary variables (yes/no), named, for instance, "cond-a" and "cond-b".
Does anyone know how to do this most cleverly, for example with dplyr (or other R packages)?
I would be extremely grateful for an answer!
I know, that I probably can do this with the group_by() function from dplyr, right?
Here is the code of the data.frame used in this example:
id <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
wave <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
car <- c(0,1,0,1,0,1,1,0,0,0,1,1,0,1,0,1)
household <- c(1,1,1,2,1,2,3,1,1,1,2,1,1,2,1,2)
df <- data.frame(id,wave,car,household)
The expected output should be like:
id
wave
car
household
Cond-A
Cond-B
1
1
0
1
0
0
1
2
1
1
1
0
1
3
0
1
0
0
1
4
1
2
NA
NA
2
1
0
1
0
0
2
2
1
2
0
1
2
3
1
3
NA
NA
2
4
0
1
0
0
3
1
0
1
0
0
3
2
0
1
0
0
3
3
1
2
0
1
3
4
1
1
1
NA
4
1
0
1
0
0
4
2
1
1
1
0
4
3
1
1
1
0
4
4
1
2
NA
NA
Edit: Subject 1, Wave 4 is NA because she/ he had already owned a car before (see Wave 2). If car = 1 before and then car = 0 again in the meantime, the data points should be excluded from the second time car = 1 (for both Cond-A and Cond-B). Id 2, Wave 2 shows: If the change from car = 0 is not car = 1 & household = 1, but directly car = 1 & household = 2, then Cond-B shall apply, but not Cond-A. So Cond-A shall only apply if a change from car = 0 & household = 1 is to car = 1 & household = 1. I know this is a tricky question, but if anyone knows the answer, it's probably here! :)

I revised my approach, I think it now comes very close to the desired outcome.
The last piece of logic that I don't understand is id 4, wave 4, why double NA?
The core logic is build into a tempporary variable called car_id which basically shows either NA if someone doesn't have a car at wave t or the id of the car (1, 2 etc.).
library(dplyr)
df %>%
group_by(id) %>%
mutate(car_id = rank(
ifelse(car == 1,
data.table::rleid(car == 0),
NA),
ties.method = "min",
na.last = "keep"),
condition_a = case_when(
car_id == 1 & household == 1 ~ 1,
car_id > 1 | household > 2 ~ NA_real_,
(lag(car) == 0) & car_id == 1 &
(lag(household) == 1) & household == 2 ~ 0,
TRUE ~ 0
),
condition_b =
case_when(
lag(household) != 2 & household == 2 & car_id == 1 ~ 1,
car_id > 1 | household > 2 ~ NA_real_,
lag(household == 2) & household != 2 ~ NA_real_,
household != 0 ~ 0
)
) %>%
select(!car_id)
#> # A tibble: 16 × 6
#> # Groups: id [4]
#> id wave car household condition_a condition_b
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 1 0 0
#> 2 1 2 1 1 1 0
#> 3 1 3 0 1 0 0
#> 4 1 4 1 2 NA NA
#> 5 2 1 0 1 0 0
#> 6 2 2 1 2 0 1
#> 7 2 3 1 3 NA NA
#> 8 2 4 0 1 0 0
#> 9 3 1 0 1 0 0
#> 10 3 2 0 1 0 0
#> 11 3 3 1 2 0 1
#> 12 3 4 1 1 1 NA
#> 13 4 1 0 1 0 0
#> 14 4 2 1 1 1 0
#> 15 4 3 1 1 1 0
#> 16 4 4 1 2 0 1
Created on 2023-01-22 with reprex v2.0.2
Data used from the actual table not the df object:
id <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
wave <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
car <- c(0,1,0,1,0,1,1,0,0,0,1,1,0,1,1,1)
household <- c(1,1,1,2,1,2,3,1,1,1,2,1,1,1,1,2)
df <- data.frame(id,wave,car,household)

Related

Grouping a column by variable, matching and assigning a value to a different column

I am trying to create similar_player_selected column. I have first 4 columns.
For row 1, player_id =1 and the most similar player to player 1 is player 3. But player 3 (row 3) isn't selected for campaign 1(player_selected=0) so I assign a value of 0 to similar_player_selected for row 1. For row 2, player_id=2 and the most similar player to player 2 is player 4. Player 4 is selected for the campaign 1(row 4) so I assign a value of 1 to similar_player_selected for row 2. Please note there are more than 1000 campaigns overall.
campaign_id
player_id
most_similar_player
player_selected
similar_player_selected
1
1
3
1
0
1
2
4
0
1
1
3
4
0
?
1
4
1
1
?
2
1
3
1
?
2
2
4
1
?
2
3
4
0
?
2
4
1
0
?
Using match we can subset player selected at matched locations
library(dplyr)
df |>
group_by(campaign_id) |>
mutate(
similar_player_selected = player_selected[match(most_similar_player, player_id)]
) |>
ungroup()
Faster base R alternative
df$similar_player_selected <- lapply(split(df, df$campaign_id), \(x)
with(x, player_selected[match(most_similar_player, player_id)])) |>
unlist()
campaign_id player_id most_similar_player player_selected similar_player_selected
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 3 1 0
2 1 2 4 0 1
3 1 3 4 0 1
4 1 4 1 1 1
5 2 1 3 1 0
6 2 2 4 1 0
7 2 3 4 0 0
8 2 4 1 0 1

Counting Frequencies of Sequences

Suppose there are two students - each student takes an exam multiple times (e.g.result_id = 1 is the first exam, result_id = 2 is the second exam, etc.). The student can either "pass" (1) or "fail" (0).
The data looks something like this:
library(data.table)
my_data = data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,2,2), results = c(0,1,0,1,0,0,1,1,1,0,1,1,0,1,0), result_id = c(1,2,3,4,5,6,1,2,3,4,5,6,7,8,9))
my_data = setDT(my_data)
id results result_id
1: 1 0 1
2: 1 1 2
3: 1 0 3
4: 1 1 4
5: 1 0 5
6: 1 0 6
7: 2 1 1
8: 2 1 2
9: 2 1 3
10: 2 0 4
11: 2 1 5
12: 2 1 6
13: 2 0 7
14: 2 1 8
15: 2 0 9
I am interested in counting the number of times that a student passes an exam, given that the student passed the previous two exams.
I tried to do this with the following code:
my_data$current_exam = shift(my_data$results, 0)
my_data$prev_exam = shift(my_data$results, 1)
my_data$prev_2_exam = shift(my_data$results, 2)
# Count the number of exam results for each record
out <- my_data[!is.na(prev_exam), .(tally = .N), by = .(id, current_exam, prev_exam, prev_2_exam)]
out = na.omit(out)
My code produces the following results:
> out
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 2
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 1
However, I do not think that my code is correct.
For example, with Student_ID = 2 :
My code says that "Current_Exam = 1, Prev_Exam = 1, Prev_2_Exam = 0" happens 1 time, but looking at the actual data - this does not happen at all
Can someone please show me what I am doing wrong and how I can correct this?
Note: I think that this should be the expected output:
> expected_output
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 1
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 0
You did not consider that you can not shift the results over id without placing NA.
. <- my_data[order(my_data$id, my_data$result_id),] #sort if needed
.$p1 <- ave(.$results, .$id, FUN = \(x) c(NA, x[-length(x)]))
.$p2 <- ave(.$p1, .$id, FUN = \(x) c(NA, x[-length(x)]))
aggregate(list(tally=.$p1), .[c("id","results", "p1", "p2")], length)
# id results p1 p2 tally
#1 1 0 1 0 2
#2 2 0 1 0 1
#3 2 1 1 0 1
#4 1 0 0 1 1
#5 1 1 0 1 1
#6 2 1 0 1 2
#7 2 0 1 1 2
#8 2 1 1 1 1
.
# id results result_id p1 p2
#1 1 0 1 NA NA
#2 1 1 2 0 NA
#3 1 0 3 1 0
#4 1 1 4 0 1
#5 1 0 5 1 0
#6 1 0 6 0 1
#7 2 1 1 NA NA
#8 2 1 2 1 NA
#9 2 1 3 1 1
#10 2 0 4 1 1
#11 2 1 5 0 1
#12 2 1 6 1 0
#13 2 0 7 1 1
#14 2 1 8 0 1
#15 2 0 9 1 0
An option would be to use filter to indicate those which had passed 3 times in a row.
cbind(., n=ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1), sides=1)))
# id results result_id n
#1 1 0 1 NA
#2 1 1 2 NA
#3 1 0 3 1
#4 1 1 4 2
#5 1 0 5 1
#6 1 0 6 1
#7 2 1 1 NA
#8 2 1 2 NA
#9 2 1 3 3
#10 2 0 4 2
#11 2 1 5 2
#12 2 1 6 2
#13 2 0 7 2
#14 2 1 8 2
#15 2 0 9 1
If olny the number of times that a student passes an exam, given that the student passed the previous two exams:
sum(ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1))==3), na.rm=TRUE)
#[1] 1
sum(ave(.$results, .$id, FUN = \(x)
x==1 & c(x[-1], 0) == 1 & c(x[-1:-2], 0, 0) == 1))
#[1] 1
When trying to count events that happen in series, cumsum() comes in quite handy. As opposed to creating multiple lagged variables, this scales well to counts across a larger number of events:
library(tidyverse)
d <- my_data |>
group_by(id) |> # group to cumulate within student only
mutate(
csum = cumsum(results), # cumulative sum of results
i = csum - lag(csum, 3, 0) # substract the cumulative sum from 3 observation before. This gives the number of exams passed in the current and previous 2 observations.
)
# Ungroup to get global count
d |>
ungroup() |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 2 × 2
#> `i == 3` n
#> <lgl> <int>
#> 1 FALSE 14
#> 2 TRUE 1
# Retaining the group gives counts by student
d |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 3 × 3
#> # Groups: id [2]
#> id `i == 3` n
#> <dbl> <lgl> <int>
#> 1 1 FALSE 6
#> 2 2 FALSE 8
#> 3 2 TRUE 1
Since you provided the data as data.table, here is how to do the same in that ecosystem:
my_data[ , csum := cumsum(results), .(id)]
my_data[ , i := csum - lag(csum, 3, 0), .(id)]
my_data[ , .(n_cases = sum(i ==3)), id]
#> id n_cases
#> 1: 1 0
#> 2: 2 1
Here's an approach using dplyr. It uses the lag function to look back 1 and 2 results. If the sum together with the current result is 3, then the condition is met. In the example you provided, the condition is only met once
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(!is.na(threex))
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 1 0 3 0
2 1 1 4 0
3 1 0 5 0
4 1 0 6 0
5 2 1 3 1
6 2 0 4 0
7 2 1 5 0
8 2 1 6 0
9 2 0 7 0
10 2 1 8 0
11 2 0 9 0
If you then just want to capture the cases when the condition is met, add a filter.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1)
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 2 1 3 1
If you are looking to understand how many times the condition is met per id, you can do this.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1) %>%
select(id) %>%
summarize(count = n())
id count
<dbl> <int>
1 2 1

Select all the rows belong to the groups that meet several conditions

I have a panel data with the following structure:
ID Month Action
1 1 0
1 2 0
1 3 1
1 4 1
2 1 0
2 2 1
2 3 0
2 4 1
3 1 0
3 2 0
3 3 0
4 1 0
4 2 1
4 3 1
4 4 0
where each ID has one row for each month, action indicates if this ID did this action in this month or not, 0 is no, 1 is yes.
I need to find the ID that has continuously had action=1 once they started the action (it does not matter in which month they started, but once started, in the following months the action should always be 1). I also wish to record all the rows that belong to these IDs in a new data frame.
How can I do this in R?
In my example, ID=1 consistently had action=1 since Month 3, so the final data frame I'm looking for should only have the rows belong to ID=1.
ID Month Action
1 1 0
1 2 0
1 3 1
1 4 1
You could do something like:
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(diff(Action)>=0) & max(Action)>0) -> newDF
This newDF includes only the IDs where (a) the Action is never decreasing (i.e., no 1=>0) and (b) there is at least one Action==1).
ID Month Action
<int> <int> <int>
1 1 1 0
2 1 2 0
3 1 3 1
4 1 4 1
A base R approach using ave where we check if all the numbers after first occurrence of 1 are all 1. The addition of any condition is to remove enteries with all 0's.
df[with(df, as.logical(ave(Action, ID, FUN = function(x) {
inds = cumsum(x)
any(inds > 0) & all(x[inds > 0] == 1)
}))), ]
# ID Month Action
#1 1 1 0
#2 1 2 0
#3 1 3 1
#4 1 4 1
Or another option with same logic but in a little concise way would be
df[with(df, ave(Action == 1, ID, FUN = function(x)
all(x[which.max(x):length(x)] == 1)
)), ]
# ID Month Action
#1 1 1 0
#2 1 2 0
#3 1 3 1
#4 1 4 1

Deleting unnecessary rows after column shuffling in a data frame in R

I have a data frame as below. The Status of each ID recorded in different time points. 0 means the person is alive and 1 means dead.
ID Status
1 0
1 0
1 1
2 0
2 0
2 0
3 0
3 0
3 0
3 1
I want to shuffle the column Status and each ID can have a status of 1, just one time. After that, I want to have NA for other rows. For instance, I want my data frame to look like below after shuffling:
ID Status
1 0
1 0
1 0
2 0
2 1
2 NA
3 0
3 1
3 NA
3 NA
From the data you posted and your example output, it looks like you want to randomly sample df$Status and then do the replacement. To get what you want in one step you could do:
set.seed(3)
df$Status <- ave(sample(df$Status), df$ID, FUN = function(x) replace(x, which(cumsum(x)>=1)[-1], NA))
df
# ID Status
#1 1 0
#2 1 0
#3 1 0
#4 2 1
#5 2 NA
#6 2 NA
#7 3 0
#8 3 0
#9 3 1
#10 3 NA
One option to use cumsum of cumsum to decide first 1 appearing for an ID.
Note that I have modified OP's sample dataframe to represent logic of reshuffling.
library(dplyr)
df %>% group_by(ID) %>%
mutate(Sum = cumsum(cumsum(Status))) %>%
mutate(Status = ifelse(Sum > 1, NA, Status)) %>%
select(-Sum)
# # A tibble: 10 x 2
# # Groups: ID [3]
# ID Status
# <int> <int>
# 1 1 0
# 2 1 0
# 3 1 1
# 4 2 0
# 5 2 1
# 6 2 NA
# 7 3 0
# 8 3 1
# 9 3 NA
# 10 3 NA
Data
df <- read.table(text =
"ID Status
1 0
1 0
1 1
2 0
2 1
2 0
3 0
3 1
3 0
3 0", header = TRUE)

Create New Column With Consecutive Count Of First Series Based on ID Column

I work in the healthcare industry and I'm using machine learning algorithms to develop a model to predict when patients will not show up for their appointments. I'm trying to create a new feature that will be the sum of each patient's most recent consecutive no-shows. I've looked around a lot on stackoverflow and other resources, but cannot find exactly what I'm looking for. As an example, if a patient has no-showed her past two most recent appointments, then every row of the new feature's column with her ID will be filled in with 2's. If she no-showed three times, but showed up for her most recent appointment, then the new column will be filled in with 0's.
I tried using plyr's ddply with cumsum, but it did not give me the results I'm looking for. I used:
ddply(a, .(ID), transform, ConsecutiveNoshows = cumsum(Noshow))
Here is an example data set ('1' signifies a no-show):
ID Noshow
1 1
1 1
1 0
1 0
1 1
2 0
2 1
2 1
3 1
3 0
3 1
3 1
3 1
This is my desired outcome:
ID Noshow ConsecutiveNoshows
1 1 2
1 1 2
1 0 2
1 0 2
1 1 2
2 0 0
2 1 0
2 1 0
3 1 1
3 0 1
3 1 1
3 1 1
3 1 1
I'll be very grateful for any help. Thank you.
The idea is to sum() for each ID the number of Noshow before a 0 appears.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(ConsecutiveNoshows = sum(!cumsum(Noshow == 0) >= 1))
Which gives:
#Source: local data frame [13 x 3]
#Groups: ID [3]
#
# ID Noshow ConsecutiveNoshows
# <int> <int> <int>
#1 1 1 2
#2 1 1 2
#3 1 0 2
#4 1 0 2
#5 1 1 2
#6 2 0 0
#7 2 1 0
#8 2 1 0
#9 3 1 1
#10 3 0 1
#11 3 1 1
#12 3 1 1
#13 3 1 1

Resources