Select first positive match per ID per date in R - r

I have a dataframe with different observations over time. As soon as an ID has a positive value for "Match", the rows with the ID in the dates that follow has to be removed. This is an example dataframe:
Date ID Match
2018-06-06 5 1
2018-06-06 6 0
2018-06-07 5 1
2018-06-07 6 0
2018-06-07 7 1
2018-06-08 5 0
2018-06-08 6 1
2018-06-08 7 1
2018-06-08 8 1
Desired output:
Date ID Match
2018-06-06 5 1
2018-06-06 6 0
2018-06-07 6 0
2018-06-07 7 1
2018-06-08 6 1
2018-06-08 8 1
In other words, because ID=5 has a positive match on 2018-06-06, the rows with ID=5 are removed for the following days BUT the row with the first positive match for this ID is kept.
Reproducable example:
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- data.frame(Date,ID,Match)
Thank you in advance

One way:
library(data.table)
setDT(df)
df[, Match := as.integer(as.character(Match))] # fix bad format
df[, .SD[shift(cumsum(Match), fill=0) == 0], by=ID]
ID Date Match
1: 5 2018-06-06 1
2: 6 2018-06-06 0
3: 6 2018-06-07 0
4: 6 2018-06-08 1
5: 7 2018-06-07 1
6: 8 2018-06-08 1
We want to drop rows after the first Match == 1.
cumsum takes the cumulative sum of Match. It is zero until the first Match == 1. We want to keep the latter row and so check cumsum on the preceding row with shift.

Here's an alternative approach, where we spot the minimum row number where Match = 1 (i.e. first row with positive match) for each ID and we filter on that:
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match))
library(dplyr)
df %>%
group_by(ID) %>% # for each ID
mutate(min_row = min(row_number()[Match == 1])) %>% # get the first row where you have 1
filter(row_number() <= min_row) %>% # keep previous rows and that row
ungroup() %>% # forget the grouping
select(-min_row) # remove unnecessary column
# # A tibble: 6 x 3
# Date ID Match
# <fct> <fct> <fct>
# 1 2018-06-06 5 1
# 2 2018-06-06 6 0
# 3 2018-06-07 6 0
# 4 2018-06-07 7 1
# 5 2018-06-08 6 1
# 6 2018-06-08 8 1
You can run the code step by step to see how it works. I've created min_row column to help you understand. You can re-write the above as
df %>%
group_by(ID) %>%
filter(row_number() <= min(row_number()[Match == 1])) %>%
ungroup()

Inspired by #Frank answer's
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag = cumsum(as.numeric(Match))) %>%
filter(Match==0 & Flag==0 | Match==1 & Flag==1)
# A tibble: 6 x 4
# Groups: ID [4]
Date ID Match Flag
<chr> <chr> <chr> <dbl>
1 2018-06-06 5 1 1
2 2018-06-06 6 0 0
3 2018-06-07 6 0 0
4 2018-06-07 7 1 1
5 2018-06-08 6 1 1
6 2018-06-08 8 1 1
Data
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match),stringsAsFactors = F)

I have another way to do it with dplyr
library(dplyr)
df %>%
group_by(ID) %>%
# You can use order(Date) if you don't want to coerce Date into date object
mutate(ord = order(Date), first_match = min(ord[Match > 0]), ind = seq_along(Date)) %>%
filter(ind <= first_match) %>%
select(Date:Match)
# A tibble: 6 x 3
# Groups: ID [4]
Date ID Match
<chr> <dbl> <dbl>
1 2018-06-06 5 1
2 2018-06-06 6 0
3 2018-06-07 6 0
4 2018-06-07 7 1
5 2018-06-08 6 1
6 2018-06-08 8 1

Here is another dplyr option:
library(dplyr)
df %>%
mutate(Date = as.Date(Date)) %>%
group_by(ID) %>%
mutate(first_match = min(Date[Match == 1])) %>%
filter((Match == 1 & Date == first_match) | (Match == 0 & Date < first_match)) %>%
ungroup() %>%
select(-first_match)
# A tibble: 6 x 3
Date ID Match
<date> <fct> <fct>
1 2018-06-06 5 1
2 2018-06-06 6 0
3 2018-06-07 6 0
4 2018-06-07 7 1
5 2018-06-08 6 1
6 2018-06-08 8 1

Related

Subsetting first Observation per id and date in r

I want to subset the first date per observation per id. For example, just get the rows for the first date in which observations A and B appeared. If we have the following dataset:
df =
id date Observation
1 3 A
1 2 B
1 8 B
2 5 B
2 3 A
2 9 A
the outcome should look like this:
df =
id date Observation
1 3 A
1 2 B
2 5 B
2 3 A
thanks
If you don't mind the order being different, it can be accomplished using dplyr by grouping then slicing:
library(tidyverse)
df <- read_table("id date Observation
1 3 A
1 2 B
1 8 B
2 5 B
2 3 A
2 9 A")
df %>%
group_by(id, Observation) %>%
slice(1)
#> # A tibble: 4 x 3
#> # Groups: id, Observation [4]
#> id date Observation
#> <dbl> <dbl> <chr>
#> 1 1 3 A
#> 2 1 2 B
#> 3 2 3 A
#> 4 2 5 B
Created on 2021-04-12 by the reprex package (v1.0.0)
library(dplyr)
df %>%
group_by(id, Observation) %>%
slice(1) %>%
ungroup()
# OR
df %>%
group_by(id, Observation) %>%
filter(row_number() == 1) %>%
ungroup()

Insert missing rows in time series data

I have an incomplete time series dataframe and I need to insert rows of NAs for missing time stamps. There should always be 6 time stamps per day, which is indicated by the variable "Signal" (1-6) in the dataframe. I am trying to merge the incomplete dataframe A with a vector Bcontaining all Signals. Simplified example data below:
B <- rep(1:6,2)
A <- data.frame(Signal = c(1,2,3,5,1,2,4,5,6), var1 = c(1,1,1,1,1,1,1,1,1))
Expected <- data.frame(Signal = c(1,2,3,NA, 5, NA, 1,2,NA,4,5,6), var1 = c(1,1,1,NA,1,NA,1,1,NA,1,1,1)
Note that Brepresents a dataframe with multiple variables and the NAs in Expected are rows of NAs in the dataframe. Also the actual dataframe has more observations (84 in total).
Would be awesome if you guys could help me out!
If you already know there are 6 timestamps in a day you can do this without B. We can create groups for each day and use complete to add the missing observations with NA.
library(dplyr)
library(tidyr)
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
ungroup() %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 NA
# 5 5 1
# 6 6 NA
# 7 1 1
# 8 2 1
# 9 3 NA
#10 4 1
#11 5 1
#12 6 1
If in the output you need Signal as NA for missing combination you can use
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
mutate(Signal = replace(Signal, is.na(var1), NA)) %>%
ungroup %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 NA NA
# 5 5 1
# 6 NA NA
# 7 1 1
# 8 2 1
# 9 NA NA
#10 4 1
#11 5 1
#12 6 1

How to remove zero values until the first non-zero value occurs in an R dataframe?

The title says it all! I have grouped data where I'd like to remove rows up until the first 0 value by id group.
Example code:
problem <- data.frame(
id = c(1,1,1,1,2,2,2,2,3,3,3,3),
value = c(0,0,2,0,0,8,4,2,1,7,6,5)
)
solution <- data.frame(
id = c(1,1,2,2,2,3,3,3,3),
value = c(2,0,8,4,2,1,7,6,5)
)
Here is a dplyr solution:
library(dplyr)
problem %>%
group_by(id) %>%
mutate(first_match = min(row_number()[value != 0])) %>%
filter(row_number() >= first_match) %>%
select(-first_match) %>%
ungroup()
# A tibble: 9 x 2
id value
<dbl> <dbl>
1 1 2
2 1 0
3 2 8
4 2 4
5 2 2
6 3 1
7 3 7
8 3 6
9 3 5
Or more succinctly per Tjebo's comment:
problem %>%
group_by(id) %>%
filter(row_number() >= min(row_number()[value != 0])) %>%
ungroup()
You can do this in base R:
subset(problem,ave(value,id,FUN=cumsum)>0)
# id value
# 3 1 2
# 4 1 0
# 6 2 8
# 7 2 4
# 8 2 2
# 9 3 1
# 10 3 7
# 11 3 6
# 12 3 5
Use abs(value) if you have negative values in your real case.

Filter (subset) by conditions in 2 columns in R (dplyr or otherwise)

Given a dataset such as:
set.seed(134)
df<- data.frame(ID= rep(LETTERS[1:5], each=2),
condition=rep(0:1, 5),
value=rpois(10, 3)
)
df
ID condition value
1 A 0 2
2 A 1 3
3 B 0 5
4 B 1 2
5 C 0 3
6 C 1 1
7 D 0 2
8 D 1 4
9 E 0 1
10 E 1 5
For each ID, when the value for condition==0 is less than the value for condition==1, I want to keep both observations. When the value for condition==0 is greater than condition==1, I want to keep only the row for condition==0.
The subset returned should be this:
ID condition value
1 A 0 2
2 A 1 3
3 B 0 5
5 C 0 3
7 D 0 2
8 D 1 4
9 E 0 1
10 E 1 5
Using dplyr the first step is:
df %>% group_by(ID) %>%
But not sure where to go from there.
Translating fairly literally,
library(dplyr)
set.seed(134)
df <- data.frame(ID = rep(LETTERS[1:5], each = 2),
condition = rep(0:1, 5),
value = rpois(10, 3))
df %>% group_by(ID) %>%
filter(condition == 0 |
(condition == 1 & value > value[condition == 0]))
#> # A tibble: 8 x 3
#> # Groups: ID [5]
#> ID condition value
#> <fct> <int> <int>
#> 1 A 0 2
#> 2 A 1 3
#> 3 B 0 5
#> 4 C 0 3
#> 5 D 0 2
#> 6 D 1 4
#> 7 E 0 1
#> 8 E 1 5
This depends on each group having a single observation with condition == 0, but should otherwise be fairly robust.
This is may not be the easiest way, but should work as you want.
library(reshape2)
df %>%
dcast(ID ~ condition, value.var = 'value') %>% # cast to wide format
mutate(`1` = ifelse(`1` > `0`, `1`, NA)) %>% # turn 0>1 values as NA
melt('ID') %>% # melt as long format
arrange(ID) %>% # sort by ID
filter(complete.cases(.)) # remove NA rows
Output:
ID variable value
1 A 0 2
2 A 1 3
3 B 0 5
4 C 0 3
5 D 0 2
6 D 1 4
7 E 0 1
8 E 1 5
You always want the value from the first row in each group. You only want the value from the second row in each group if it's larger than the first.
This works:
df %>%
group_by(ID) %>%
filter(row_number() == 1 | value > lag(value))
Edit: as #alistaire points out, this method depends on a particular order in, which is might be a good idea to guarantee as follows:
df %>%
arrange(ID, condition) %>%
group_by(ID) %>%
filter(row_number() == 1 | value > lag(value))

Filter all rows of a group according to specific member of group [duplicate]

This question already has an answer here:
How to filter (with dplyr) for all values of a group if variable limit is reached?
(1 answer)
Closed 5 years ago.
I want to filter an entire group based on a value at a specified row.
In the data below, I'd like to remove all rows of group ID, according the value of Metric for Hour == '2'. (Note that I am not trying to filter based on two conditions here, I'm trying to filter based on one condition but at a specific row)
Sample data:
ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')
Metric <- c(3,4,1,6,7,8,8,3,6,1,1)
x <- data.frame(ID, Hour, Metric)
ID Hour Metric
1 A 0 3
2 A 2 4
3 A 5 1
4 A 6 6
5 A 9 7
6 B 0 8
7 B 2 8
8 B 5 3
9 B 6 6
10 C 0 1
11 C 2 1
I want to filter each ID based on whether Metric > 5 for Hour == '2'. The result should look like this (all rows of ID B are removed):
ID Hour Metric
1 A 0 3
2 A 2 4
3 A 5 1
4 A 6 6
5 A 9 7
10 C 0 1
11 C 2 1
A dplyr-based solution would be preferred, but any help is much appreciated.
Adapting How to filter (with dplyr) for all values of a group if variable limit is reached?
we get:
x %>%
group_by(ID) %>%
filter(any(Metric[Hour == '2'] <= 5))
# # A tibble: 7 x 3
# # Groups: ID [2]
# ID Hour Metric
# <fctr> <fctr> <dbl>
# 1 A 0 3
# 2 A 2 4
# 3 A 5 1
# 4 A 6 6
# 5 A 9 7
# 6 C 0 1
# 7 C 2 1
These type of problems can be also answered by first creating a by group intermediate variable, to flag whether rows should be removed.
Method 1:
x %>%
group_by(ID) %>%
mutate(keep_group = (any(Metric[Hour == '2'] <= 5))) %>%
ungroup %>%
filter(keep_group) %>%
select(-keep_group)
Method 2:
groups_to_keep <-
x %>%
filter(Hour == '2', Metric <= 5) %>%
select(ID) %>%
distinct() # N.B. this sorts groups_to_keep by ID which may not be desired
# ID
# 1 A
# 2 C
x %>%
inner_join(groups_to_keep, by = 'ID')
# ID Hour Metric
# 1 A 0 3
# 2 A 2 4
# 3 A 5 1
# 4 A 6 6
# 5 A 9 7
# 6 C 0 1
# 7 C 2 1
Method 3 - as suggested by #thelatemail (safe with respect to duplicates in ID):
groups_not_to_keep <-
x %>%
filter(Hour == 2, Metric > 5) %>%
select(ID)
x %>%
anti_join(groups_not_to_keep, by = 'ID')
Not in (!()) should be useful here. Try this
library(dplyr)
filter(x, Metric > 5 & Hour == '2')$ID # gives B
subset(x, !(ID %in% filter(x, Metric > 5 & Hour == '2')$ID))

Resources