How to select continuous rows in R data frame based on conditions? - r

I have a dataframe df and it has date ,group and gap days column. I want to select for a group all rows where gap days is continuously 1 from latest date (max date) . If gap days is not equal to 1 , then we ignore rows till the point where gap days is not equal to 1 . For reproducible purpose I have created current df and expected df...
df<-data.frame(Date=c("2018-10-15","2018-10-16","2018-10-17",
"2018-10-14","2018-10-15","2018-10-16","2018-10-18","2018-10-19",
"2018-10-18","2018-10-21","2018-10-23","2018-10-24","2018-10-27","2018-10-28"),Group=c("a","a","a","b","b","b","b","b","c","c","c","c","c","c"),Gap_Days=c(1,1,1,1,1,2,1,1,3,2,1,3,1,1))
df_expected<-data.frame(Date=c("2018-10-15","2018-10-16","2018-10-17","2018-10-18","2018-10-19","2018-10-27","2018-10-28"),Group=c("a","a","a", "b","b","c","c"),Gap_Days=c(1,1,1,1,1,1,1))

The only difference between my first comment and what works now is the introduction of grouping to the question.
Base R:
do.call("rbind", by(df, df$Group, FUN=function(d) d[rev(cumall(rev(d$Gap_Days == 1))),]))
# Date Group Gap_Days
# a.1 2018-10-15 a 1
# a.2 2018-10-16 a 1
# a.3 2018-10-17 a 1
# b.7 2018-10-18 b 1
# b.8 2018-10-19 b 1
# c.13 2018-10-27 c 1
# c.14 2018-10-28 c 1
Tidyverse:
df %>%
group_by(Group) %>%
filter(rev(cumall(rev(Gap_Days == 1)))) %>%
ungroup()
# # A tibble: 7 x 3
# Date Group Gap_Days
# <fct> <fct> <dbl>
# 1 2018-10-15 a 1
# 2 2018-10-16 a 1
# 3 2018-10-17 a 1
# 4 2018-10-18 b 1
# 5 2018-10-19 b 1
# 6 2018-10-27 c 1
# 7 2018-10-28 c 1

Here is one method with tidyverse
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(Gap_Days),
ind = any(Date == max(.data$Date))) %>%
ungroup %>%
filter(grp == max(grp) & ind) %>%
select(-ind, -grp)
# A tibble: 3 x 2
# Date Gap_Days
# <date> <dbl>
#1 2018-10-19 1
#2 2018-10-20 1
#3 2018-10-21 1
If the 'Date' column is already ordered, then we just need to check the 1s in 'Gap_Days
i1 <- inverse.rle(within.list(rle(df$Gap_Days == 1),
values[lengths < max(lengths) & values] <- FALSE))
df[i1,, drop = FALSE]

Related

How to count exact matches across two data frames within IDs in R

I have two datasets similar to the one below (but with 4m observations) and I want to count the number of matching sample days between the two data frames (see example below).
DF1
ID date
1 1992-10-15
1 2010-02-17
2 2019-09-17
2 2015-08-18
3 2020-10-27
3 2020-12-23
DF2
ID date
1 1992-10-15
1 2001-04-25
1 2010-02-17
3 1990-06-22
3 2014-08-18
3 2020-10-27
Expected output
ID Count
1 2
2 0
3 1
I have tried the aggregate function (though unsure what to put in "which":
test <- aggregate(date~ID, rbind(DF1, DF2), length(which(exact?)))
and the table function:
Y<-table(DF1$ID)
X <- table(DF2$ID)
Y2 <- DF1[Y %in% X,]
I am having trouble finding an example to help my situation.
Your help is appreciated!
in Base R
data.frame(table(factor(merge(df1,df2)$ID, unique(df1$ID))))
Var1 Freq
1 1 2
2 2 0
3 3 1
Using tidyverse
library(dplyr)
library(tidyr)
inner_join(df1, df2) %>%
complete(ID = unique(df1$ID)) %>%
reframe(Freq = sum(!is.na(date)), .by = "ID")
-output
# A tibble: 3 × 2
ID Freq
<int> <int>
1 1 2
2 2 0
3 3 1
Here is one way to do it with 'dplyr' and 'tidyr':
library(dplyr)
library(tidyr)
DF1 %>%
semi_join(DF2) %>%
count(ID) %>%
complete(ID = DF1$ID,
fill = list(n = 0))
#> Joining with `by = join_by(ID, date)`
#> # A tibble: 3 × 2
#> ID n
#> <dbl> <int>
#> 1 1 2
#> 2 2 0
#> 3 3 1
data
DF1 <- tibble(ID = c(1,1,2,2,3,3),
date = c("1992-10-15", "2010-02-17", "2019-09-17",
"2015-08-18", "2020-10-27", "2020-12-23"))
DF2 <- tibble(ID = c(1,1,1,3,3,3),
date = c("1992-10-15", "2001-04-25", "2010-02-17",
"1990-06-22", "2014-08-18", "2020-10-27"))
Created on 2023-02-16 with reprex v2.0.2

Subsetting first Observation per id and date in r

I want to subset the first date per observation per id. For example, just get the rows for the first date in which observations A and B appeared. If we have the following dataset:
df =
id date Observation
1 3 A
1 2 B
1 8 B
2 5 B
2 3 A
2 9 A
the outcome should look like this:
df =
id date Observation
1 3 A
1 2 B
2 5 B
2 3 A
thanks
If you don't mind the order being different, it can be accomplished using dplyr by grouping then slicing:
library(tidyverse)
df <- read_table("id date Observation
1 3 A
1 2 B
1 8 B
2 5 B
2 3 A
2 9 A")
df %>%
group_by(id, Observation) %>%
slice(1)
#> # A tibble: 4 x 3
#> # Groups: id, Observation [4]
#> id date Observation
#> <dbl> <dbl> <chr>
#> 1 1 3 A
#> 2 1 2 B
#> 3 2 3 A
#> 4 2 5 B
Created on 2021-04-12 by the reprex package (v1.0.0)
library(dplyr)
df %>%
group_by(id, Observation) %>%
slice(1) %>%
ungroup()
# OR
df %>%
group_by(id, Observation) %>%
filter(row_number() == 1) %>%
ungroup()

Rank function in R after group by

How to use R to create a rank column? Below is an example
This is what I have:
Date group
12/5/2020 A
12/5/2020 A
11/7/2020 A
11/7/2020 A
11/9/2020 B
11/9/2020 B
10/8/2020 B
This is what I want:
Date group rank
12/5/2020 A 2
12/5/2020 A 2
11/7/2020 A 1
11/7/2020 A 1
11/9/2020 B 2
11/9/2020 B 2
10/8/2020 B 1
tidyverse
(I'm using dplyr here since I think it is easy to see the steps being done.)
A first approach might be to capitalize on R's factor function, which assigns an integer to each distinct value, so that operations on this factor is faster (when compared with strings). That is, it takes a (possibly looooong) vector of strings and converts it into a just-as-long vector of integers (much smaller and faster) and a very short vector of strings, where the integers are indices into the small vector of strings. This small vector is called the factor's "levels".
library(dplyr)
group_by(dat, group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <chr> <chr> <int>
# 1 12/5/2020 A 2
# 2 12/5/2020 A 2
# 3 11/7/2020 A 1
# 4 11/7/2020 A 1
# 5 11/9/2020 B 2
# 6 11/9/2020 B 2
# 7 10/8/2020 B 1
This "sorta" works, but there are two problems:
This is reliant on the lexicographic sorting of the Date column, for which this data sample is acceptable, but this will fail. A better way is to convert to something more appropriately sortable, such as a Date object.
Failing sorts:
sort(c("12/9/2020", "11/9/2020", "2/9/2020"))
# [1] "11/9/2020" "12/9/2020" "2/9/2020"
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
and
There really are better functions for ranking, such as dplyr::dense_rank (which #akrun put in an answer first ... I was building to it, honestly):
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = dense_rank(Date)) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
We can use dense_rank after converting the 'Date' to Date class
library(dplyr)
library(lubridate)
df1 %>%
group_by(group) %>%
mutate(rank = dense_rank(mdy(Date)))
# A tibble: 7 x 3
# Groups: group [2]
# Date group rank
# <chr> <chr> <int>
#1 12/5/2020 A 2
#2 12/5/2020 A 2
#3 11/7/2020 A 1
#4 11/7/2020 A 1
#5 11/9/2020 B 2
#6 11/9/2020 B 2
#7 10/8/2020 B 1
data
df1 <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA,
-7L))
Convert the Date column to the actual date object, arrange the data by Date and use match with unique to get rank column.
library(dplyr)
df %>%
mutate(Date = lubridate::mdy(Date)) %>%
arrange(group, Date) %>%
group_by(group) %>%
mutate(rank = match(Date, unique(Date)))
# Date group rank
# <date> <chr> <int>
#1 2020-11-07 A 1
#2 2020-11-07 A 1
#3 2020-12-05 A 2
#4 2020-12-05 A 2
#5 2020-10-08 B 1
#6 2020-11-09 B 2
#7 2020-11-09 B 2
data
df <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA, -7L))

Getting observations until and including first different value (groups with "no switch" are allowed)

I have a slightly convoluted way to slice a data frame by group from the first row (it always starts with the same value) till (and including) the first different value.
I though about using slice(1:min(which == new.value)), but there are groups where this switch does not happen - and this is what causes me headache. I could split the data into those groups where there is a switch and not and do the calculation on only those with a switch - but I would love to know if there are somewhat more elegant options out there. I am open for any package out there.
library(dplyr)
mydf <- data.frame(group = rep(letters[1:3], each = 4), value = c(1,2,2,2, 1, 1,1,1,1,1,2,2))
The following does not work, because there are groups without "switch"
mydf %>% group_by(group) %>% slice(1: min(which(value == 2)))
#> Warning in min(which(value == 2)): no non-missing arguments to min; returning
#> Inf
#> Error in 1:min(which(value == 2)): result would be too long a vector
Doing the slice operation on only the groups with a switch and binding with the "no-switchers" works:
mydf_grouped <- mydf %>% group_by(group)
mydf_grouped %>%
filter(any(value == 2)) %>%
slice(1: min(which(value == 2))) %>%
bind_rows(filter(mydf_grouped, !any(value ==2)))
#> # A tibble: 9 x 2
#> # Groups: group [3]
#> group value
#> <fct> <dbl>
#> 1 a 1
#> 2 a 2
#> 3 c 1
#> 4 c 1
#> 5 c 2
#> 6 b 1
#> 7 b 1
#> 8 b 1
#> 9 b 1
Created on 2019-12-22 by the reprex package (v0.3.0)
Here, one option is to pass the if/else condition
library(dplyr)
mydf %>%
group_by(group) %>%
slice(if(!2 %in% value) row_number() else seq_len(match(2, value)) )
Or more compactly, change the nomatch in match to n()
mydf %>%
group_by(group) %>%
slice(seq_len(match(2, value, nomatch = n())))
# A tibble: 9 x 2
# Groups: group [3]
# group value
# <fct> <dbl>
#1 a 1
#2 a 2
#3 b 1
#4 b 1
#5 b 1
#6 b 1
#7 c 1
#8 c 1
#9 c 2
We want all rows having a value of 1 as well as the row with the first 2 in each group:
mydf %>%
group_by(group) %>%
filter(value == 1 | cumsum(value == 2) == 1) %>%
ungroup
We can use rleid to create an index of change in value, shift it by 1 position and select all the rows till 1st change.
library(data.table)
setDT(mydf)
mydf[, .SD[shift(rleid(value), fill = 1) == 1], group]
# group value
#1: a 1
#2: a 2
#3: b 1
#4: b 1
#5: b 1
#6: b 1
#7: c 1
#8: c 1
#9: c 2
The same logic in dplyr can be implemented by
library(dplyr)
mydf %>%
group_by(group) %>%
filter(lag(cumsum(value != lag(value, default = 1)), default = 0) == 0)

Select first positive match per ID per date in R

I have a dataframe with different observations over time. As soon as an ID has a positive value for "Match", the rows with the ID in the dates that follow has to be removed. This is an example dataframe:
Date ID Match
2018-06-06 5 1
2018-06-06 6 0
2018-06-07 5 1
2018-06-07 6 0
2018-06-07 7 1
2018-06-08 5 0
2018-06-08 6 1
2018-06-08 7 1
2018-06-08 8 1
Desired output:
Date ID Match
2018-06-06 5 1
2018-06-06 6 0
2018-06-07 6 0
2018-06-07 7 1
2018-06-08 6 1
2018-06-08 8 1
In other words, because ID=5 has a positive match on 2018-06-06, the rows with ID=5 are removed for the following days BUT the row with the first positive match for this ID is kept.
Reproducable example:
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- data.frame(Date,ID,Match)
Thank you in advance
One way:
library(data.table)
setDT(df)
df[, Match := as.integer(as.character(Match))] # fix bad format
df[, .SD[shift(cumsum(Match), fill=0) == 0], by=ID]
ID Date Match
1: 5 2018-06-06 1
2: 6 2018-06-06 0
3: 6 2018-06-07 0
4: 6 2018-06-08 1
5: 7 2018-06-07 1
6: 8 2018-06-08 1
We want to drop rows after the first Match == 1.
cumsum takes the cumulative sum of Match. It is zero until the first Match == 1. We want to keep the latter row and so check cumsum on the preceding row with shift.
Here's an alternative approach, where we spot the minimum row number where Match = 1 (i.e. first row with positive match) for each ID and we filter on that:
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match))
library(dplyr)
df %>%
group_by(ID) %>% # for each ID
mutate(min_row = min(row_number()[Match == 1])) %>% # get the first row where you have 1
filter(row_number() <= min_row) %>% # keep previous rows and that row
ungroup() %>% # forget the grouping
select(-min_row) # remove unnecessary column
# # A tibble: 6 x 3
# Date ID Match
# <fct> <fct> <fct>
# 1 2018-06-06 5 1
# 2 2018-06-06 6 0
# 3 2018-06-07 6 0
# 4 2018-06-07 7 1
# 5 2018-06-08 6 1
# 6 2018-06-08 8 1
You can run the code step by step to see how it works. I've created min_row column to help you understand. You can re-write the above as
df %>%
group_by(ID) %>%
filter(row_number() <= min(row_number()[Match == 1])) %>%
ungroup()
Inspired by #Frank answer's
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag = cumsum(as.numeric(Match))) %>%
filter(Match==0 & Flag==0 | Match==1 & Flag==1)
# A tibble: 6 x 4
# Groups: ID [4]
Date ID Match Flag
<chr> <chr> <chr> <dbl>
1 2018-06-06 5 1 1
2 2018-06-06 6 0 0
3 2018-06-07 6 0 0
4 2018-06-07 7 1 1
5 2018-06-08 6 1 1
6 2018-06-08 8 1 1
Data
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match),stringsAsFactors = F)
I have another way to do it with dplyr
library(dplyr)
df %>%
group_by(ID) %>%
# You can use order(Date) if you don't want to coerce Date into date object
mutate(ord = order(Date), first_match = min(ord[Match > 0]), ind = seq_along(Date)) %>%
filter(ind <= first_match) %>%
select(Date:Match)
# A tibble: 6 x 3
# Groups: ID [4]
Date ID Match
<chr> <dbl> <dbl>
1 2018-06-06 5 1
2 2018-06-06 6 0
3 2018-06-07 6 0
4 2018-06-07 7 1
5 2018-06-08 6 1
6 2018-06-08 8 1
Here is another dplyr option:
library(dplyr)
df %>%
mutate(Date = as.Date(Date)) %>%
group_by(ID) %>%
mutate(first_match = min(Date[Match == 1])) %>%
filter((Match == 1 & Date == first_match) | (Match == 0 & Date < first_match)) %>%
ungroup() %>%
select(-first_match)
# A tibble: 6 x 3
Date ID Match
<date> <fct> <fct>
1 2018-06-06 5 1
2 2018-06-06 6 0
3 2018-06-07 6 0
4 2018-06-07 7 1
5 2018-06-08 6 1
6 2018-06-08 8 1

Resources