Select all the rows belong to the groups that meet several conditions - r

I have a panel data with the following structure:
ID Month Action
1 1 0
1 2 0
1 3 1
1 4 1
2 1 0
2 2 1
2 3 0
2 4 1
3 1 0
3 2 0
3 3 0
4 1 0
4 2 1
4 3 1
4 4 0
where each ID has one row for each month, action indicates if this ID did this action in this month or not, 0 is no, 1 is yes.
I need to find the ID that has continuously had action=1 once they started the action (it does not matter in which month they started, but once started, in the following months the action should always be 1). I also wish to record all the rows that belong to these IDs in a new data frame.
How can I do this in R?
In my example, ID=1 consistently had action=1 since Month 3, so the final data frame I'm looking for should only have the rows belong to ID=1.
ID Month Action
1 1 0
1 2 0
1 3 1
1 4 1

You could do something like:
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(diff(Action)>=0) & max(Action)>0) -> newDF
This newDF includes only the IDs where (a) the Action is never decreasing (i.e., no 1=>0) and (b) there is at least one Action==1).
ID Month Action
<int> <int> <int>
1 1 1 0
2 1 2 0
3 1 3 1
4 1 4 1

A base R approach using ave where we check if all the numbers after first occurrence of 1 are all 1. The addition of any condition is to remove enteries with all 0's.
df[with(df, as.logical(ave(Action, ID, FUN = function(x) {
inds = cumsum(x)
any(inds > 0) & all(x[inds > 0] == 1)
}))), ]
# ID Month Action
#1 1 1 0
#2 1 2 0
#3 1 3 1
#4 1 4 1
Or another option with same logic but in a little concise way would be
df[with(df, ave(Action == 1, ID, FUN = function(x)
all(x[which.max(x):length(x)] == 1)
)), ]
# ID Month Action
#1 1 1 0
#2 1 2 0
#3 1 3 1
#4 1 4 1

Related

Grouping a column by variable, matching and assigning a value to a different column

I am trying to create similar_player_selected column. I have first 4 columns.
For row 1, player_id =1 and the most similar player to player 1 is player 3. But player 3 (row 3) isn't selected for campaign 1(player_selected=0) so I assign a value of 0 to similar_player_selected for row 1. For row 2, player_id=2 and the most similar player to player 2 is player 4. Player 4 is selected for the campaign 1(row 4) so I assign a value of 1 to similar_player_selected for row 2. Please note there are more than 1000 campaigns overall.
campaign_id
player_id
most_similar_player
player_selected
similar_player_selected
1
1
3
1
0
1
2
4
0
1
1
3
4
0
?
1
4
1
1
?
2
1
3
1
?
2
2
4
1
?
2
3
4
0
?
2
4
1
0
?
Using match we can subset player selected at matched locations
library(dplyr)
df |>
group_by(campaign_id) |>
mutate(
similar_player_selected = player_selected[match(most_similar_player, player_id)]
) |>
ungroup()
Faster base R alternative
df$similar_player_selected <- lapply(split(df, df$campaign_id), \(x)
with(x, player_selected[match(most_similar_player, player_id)])) |>
unlist()
campaign_id player_id most_similar_player player_selected similar_player_selected
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 3 1 0
2 1 2 4 0 1
3 1 3 4 0 1
4 1 4 1 1 1
5 2 1 3 1 0
6 2 2 4 1 0
7 2 3 4 0 0
8 2 4 1 0 1

In R how to find the first minimum value of a dataframe

How do I find the first minimum value in one column of a dataframe and output a new dataframe with just that row?
For example, for a dataframe named "hospital", for each node, I want to find the minimum time at which "H" is >=1.
node
time
H
1
1
0
2
1
0
3
1
0
1
2
0
2
2
0
3
2
2
1
3
0
2
3
1
3
3
2
1
4
1
2
4
4
3
4
0
The result I want to be able to output is:
node
time
H
1
4
1
2
3
1
3
2
2
One way is to filter your dataframe, and then take the first minimum element for each group:
library(dplyr)
df %>%
filter(H > 0) %>%
group_by(node) %>%
slice_min(time, n = 1)
node time H
<int> <int> <int>
1 1 4 1
2 2 3 1
3 3 2 2

how to add 2 value of a column with respect of a group and an indicator

Suppose
Household person loop utility indicator
1 1 1 3 1
1 1 1 4 0
1 1 1 5 0
1 1 1 6 1
1 1 2 3 0
1 1 2 3 0
1 2 1 2 1
1 2 1 7 1
1 2 1 8 1
2 1 1 3 0
2 1 1 3 0
in each household and each person, if indicator is 1 , I want to add the value of utility for first and last row of each loop . (if the indicator is 1 for first row of loop it is 1 for last row as well). Also it does not matter what is indicator for middle rows of loop
first person in first family: the indicator is 1 in first and last row of loop so I will add 3+6 , indicator is 0 for his second loop so I don't need it you can put it in out put as NA
output
Household person loop utility
1 1 1 3 +6
1 2 1 2 +8
We can filter groups where first and last indicator value is 1 and sum first and last utility value for them.
library(dplyr)
df1 %>%
group_by(Household, person, loop) %>%
filter(first(indicator) == 1 & last(indicator) == 1) %>%
summarise(utility = first(utility) + last(utility))
# Household person loop utility
# <int> <int> <int> <int>
#1 1 1 1 9
#2 1 2 1 10
In base R, it is bit lengthy
aggregate(utility~Household+person+loop, subset(df1,
ave(indicator == 1, Household, person, loop,
FUN = function(x) x[1L] & x[length(x)])),
function(x) x[1L] + x[length(x)])

Create a sequence conditional on values in other columns in R

I am working with a dataframe similar to the following:
df = data.frame(ID1 = c(2,2,2,2,2,2,2),
ID2 = c(1,1,1,1,1,1,1),
flagTag = c(0,0,0,0,1,0,0))
I need to create a new field "newField" such that the value increments when flagTag = 1 within group of ID1 and ID2 (thus unique records are identify by the combination of ID1 and ID2).The resulting table should look similar
ID1 ID2 flagTag newField
1 2 1 0 1
2 2 1 0 1
3 2 1 0 1
4 2 1 0 1
5 2 1 1 2
6 2 1 0 2
I am trying to do this using dplyr but couldn't come up with a logic to do such manipulation. One way is to go record by record in the dataframe and update "newField" in loop which will be a slow procedure.
Let's use cumsum and mutate:
library(dplyr)
df %>%
group_by(ID1, ID2) %>%
mutate(newField = 1 + cumsum(flagTag))
ID1 ID2 flagTag newField
<dbl> <dbl> <dbl> <dbl>
1 2 1 0 1
2 2 1 0 1
3 2 1 0 1
4 2 1 0 1
5 2 1 1 2
6 2 1 0 2
7 2 1 0 2
Here is a base R option with ave
df$newField <- with(df, ave(flagTag, ID1, ID2, FUN = cumsum)+1)
df$newField
#[1] 1 1 1 1 2 2 2
Or using data.table
library(data.table)
setDT(df)[, newField := cumsum(flagTag) + 1, .(ID1, ID2)]

Create New Column With Consecutive Count Of First Series Based on ID Column

I work in the healthcare industry and I'm using machine learning algorithms to develop a model to predict when patients will not show up for their appointments. I'm trying to create a new feature that will be the sum of each patient's most recent consecutive no-shows. I've looked around a lot on stackoverflow and other resources, but cannot find exactly what I'm looking for. As an example, if a patient has no-showed her past two most recent appointments, then every row of the new feature's column with her ID will be filled in with 2's. If she no-showed three times, but showed up for her most recent appointment, then the new column will be filled in with 0's.
I tried using plyr's ddply with cumsum, but it did not give me the results I'm looking for. I used:
ddply(a, .(ID), transform, ConsecutiveNoshows = cumsum(Noshow))
Here is an example data set ('1' signifies a no-show):
ID Noshow
1 1
1 1
1 0
1 0
1 1
2 0
2 1
2 1
3 1
3 0
3 1
3 1
3 1
This is my desired outcome:
ID Noshow ConsecutiveNoshows
1 1 2
1 1 2
1 0 2
1 0 2
1 1 2
2 0 0
2 1 0
2 1 0
3 1 1
3 0 1
3 1 1
3 1 1
3 1 1
I'll be very grateful for any help. Thank you.
The idea is to sum() for each ID the number of Noshow before a 0 appears.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(ConsecutiveNoshows = sum(!cumsum(Noshow == 0) >= 1))
Which gives:
#Source: local data frame [13 x 3]
#Groups: ID [3]
#
# ID Noshow ConsecutiveNoshows
# <int> <int> <int>
#1 1 1 2
#2 1 1 2
#3 1 0 2
#4 1 0 2
#5 1 1 2
#6 2 0 0
#7 2 1 0
#8 2 1 0
#9 3 1 1
#10 3 0 1
#11 3 1 1
#12 3 1 1
#13 3 1 1

Resources