Subsetting data frame with multiple date conditions for ranges in between - r

I need subsets between multiple dates.
Example data frame:
testdf <- data.frame(short_date = seq(as.Date("2007-03-01"),
as.Date("2008-09-01"), by = 'day'))
An example of data frame with values for date ranges:
dates_cut <- structure(list(emergence = structure(c(13627, 13997), class = "Date"), disease_onset = structure(c(13694, 14062), class = "Date")), .Names = c("emergence", "disease_onset"), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Obviously this is just a sample, there is a number of years for which I need subsets of data in between ($emergence date and $disese_onset).
This works for one data range:
testdf %>% filter(short_date >=dates_cut[1,1], short_date >=dates_cut[1,2])
The problem is when there are multiple date ranges.
Thanks.

One option would be to lapply over the rows of dates_cut and then store each subset in a list. After that you can rbind them all together with do.call:
list <- lapply(1:nrow(dates_cut), function(i) {
testdf[which(testdf$short_date >= dates_cut[i, "emergence"] &
testdf$short_date <= dates_cut[i, "disease_onset"]), , drop = FALSE]})
res <- do.call(rbind, list)
head(res)
# short_date
#55 2007-04-24
#56 2007-04-25
#57 2007-04-26
#58 2007-04-27
#59 2007-04-28
#60 2007-04-29

Related

max(DF1$EFFDT) <= DF2$EFFDT [duplicate]

This question already has an answer here:
Searching for nearest date in data frame
(1 answer)
Closed 1 year ago.
I have two dataframes, DF1 containing monthly data snapshot of data whereas DF2 with a particular date and i want to be able to retrieve data only for closest maxdate (<=) from DF1 wrt DF2 data.
DF1
Account
Date
A1000001
1-JAN-2021
A1000002
1-FEB-2021
A1000003
1-MAR-2021
A1000004
1-APR-2021
DF2
Date
15-MAR-2021
Output Expected:
Account
Date
A1000003
1-MAR-2021
Change the dates to actual date class and using sapply you may find the closest date in df1 for each date in df2.
df1$Date <- as.Date(df1$Date, '%d-%b-%Y')
df2$Date <- as.Date(df2$Date, '%d-%b-%Y')
result <- df1[sapply(df2$Date, function(x) which.min(abs(df1$Date - x))), ]
result
# Account Date
#3 A1000003 2021-03-01
data
It is easier to help if you provide data in a reproducible format
df1 <- structure(list(Account = c("A1000001", "A1000002", "A1000003",
"A1000004"), Date = c("1-JAN-2021", "1-FEB-2021", "1-MAR-2021",
"1-APR-2021")), row.names = c(NA, -4L), class = "data.frame")
df2 <- structure(list(Date = "15-MAR-2021"), row.names = c(NA, -1L),
class = "data.frame")

identify observations based on 2 elements in 2 dataframes that do not match [duplicate]

This question already has answers here:
Delete rows that exist in another data frame? [duplicate]
(3 answers)
Find complement of a data frame (anti - join)
(7 answers)
Closed 2 years ago.
I want to identify observations in 1 df that do not match that of another df using 2 indicators (id and date). Below is sample df1 and df2.
df1
id date n
12-40 12/22/2018 3
11-08 10/02/2016 11
df2
id date interval
12-40 12/22/2018 3
11-08 10/02/2016 32
22-22 11/10/2015 11
I want a df that outputs rows that are in df2, but not in df1, like so. Note that row 3 (based on id and date) of df2 is not in df1.
df3
id date interval
22-22 11/10/2015 11
I tried doing this in tidyverse and was not able to get the code to work. Does anyone have suggestions on how to do this?
We can use anti_join from tidyverse (as the OP mentioned about working with tidyverse). Here we use both 'id' and 'date' as mentioned in the OP's post. More complex joins can be done with tidyverse
library(dplyr)
anti_join(df2, df1, by = c('id', 'date'))
# id date interval
#1 22-22 11/10/2015 11
Or a similar option with data.table and it should be very efficient
library(data.table)
setDT(df2)[!df1, on = .(id, date)]
# id date interval
#1: 22-22 11/10/2015 11
data
df1 <- structure(list(id = c("12-40", "11-08"), date = c("12/22/2018",
"10/02/2016"), n = c(3L, 11L)), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(id = c("12-40", "11-08", "22-22"), date = c("12/22/2018",
"10/02/2016", "11/10/2015"), interval = c(3L, 32L, 11L)), class = "data.frame",
row.names = c(NA,
-3L))
Try this (Both options are base R, follow OP directions and do not require any package):
#Code1
df3 <- df2[!paste(df2$id,df1$date) %in% paste(df1$id,df2$date),]
Output:
id date interval
3 22-22 11/10/2015 11
It can also be considered:
#Code 2
df3 <- subset(df2,!paste(id,date) %in% paste(df1$id,df1$date))
Output:
id date interval
3 22-22 11/10/2015 11
Some data used:
#Data1
df1 <- structure(list(id = c("12-40", "11-08"), date = c("12/22/2018",
"10/02/2016"), n = c(3L, 11L)), class = "data.frame", row.names = c(NA,
-2L))
#Data2
df2 <- structure(list(id = c("12-40", "11-08", "22-22"), date = c("12/22/2018",
"10/02/2016", "11/10/2015"), interval = c(3L, 32L, 11L)), class = "data.frame", row.names = c(NA,
-3L))
Another base R option using merge + subset + complete.cases
df3 <- subset(
u <- merge(df1, df2, by = c("id", "date"), all.y = TRUE),
!complete.cases(u)
)[names(df2)]
which gives
> df3
id date interval
3 22-22 11/10/2015 11

Unnest in R conditional on the cell's content

The main dataframe has a column "passings". It is the only nested variable in the main dataframe. Inside it, there are dataframes (an example a nested cell). In the nested cells, the number of rows varies, yet the number of columns is the same. The columns names are "date" and "title". What I need is to grab a respective date and put it in the main dataframe as a new variable if title is "Закон прийнято" ("A passed law" - translation).
I'm a newbie in coding.
I will appreciate your help!
dataframe
an example of a dataframe within a nested cell
Here is an option where we loop over the 'passings' list column with map (based on the image, it would be a list of 2 column data.frame), filter the rows where the 'title' is "Закон прийнято" (assuming only a single value per row) and pull the 'date' column to create a new column 'date' in the original dataset
library(dplyr)
library(purrr)
df1 %>%
mutate(date = map_chr(passings, ~ .x %>%
filter(title == "Закон прийнято") %>%
pull(date)))
# id passed passings date
#1 54949 TRUE 2015-06-10, 2015-06-08, abcb, Закон прийнято 2015-06-08
#2 55009 TRUE 2015-06-10, 2015-09-08, bcb, Закон прийнято 2015-09-08
NOTE: It works as expected.
data
df1 <- structure(list(id = c(54949, 55009), passed = c(TRUE, TRUE),
passings = list(structure(list(date = c("2015-06-10", "2015-06-08"
), title = c("abcb", "Закон прийнято")), class = "data.frame", row.names = c(NA,
-2L)), structure(list(date = c("2015-06-10", "2015-09-08"
), title = c("bcb", "Закон прийнято")), class = "data.frame", row.names = c(NA,
-2L)))), row.names = c(NA, -2L), class = "data.frame")

Conditional filter of rows based on the values of multiple variables in previous row

I am trying to subset a dataframe to only retain rows for which the value of two variables differ from the value of the previously retained row.
Starting with
df<-structure(list(x = c("ARM018", "ARM018", "ARM018", "ARM021",
"ARM021"), y = c("ARF014", "ARF027", "ARF028",
"ARF014", "ARF020")), class = "data.frame", row.names = c(NA,
-5L))
df
I would like to obtain
df_wanted <-structure(list(x = c("ARM018", "ARM021"), y = c("ARF014",
"ARF020")), class = "data.frame", row.names = c(NA, -2L))
df_wanted
because the values of both x and y differ across the two rows
I had assumed that the lag function from the dplyr package could help
and that the following code would returned df_wanted yet it does return the expected result
library(dplyr)
df_attempt<-df %>%
filter(lag(x)!=x & lag(y)!=y)
Is there any solution to this using the lag function?
a combination of dplyr:cumsum and dplyr:lag could do the trick:
library(dplyr)
df %>% mutate_all(as.character) %>%
filter(cumsum(x != x[1] & y != y[1]) !=
lag(cumsum(x != x[1] & y != y[1]), default = -1))
x y
1 ARM018 ARF014
2 ARM021 ARF020

Extracting Uncommon values from 2 data frames in R

Given two data frames containing dates:
d1
# dates
# 2016-08-01
# 2016-08-02
# 2016-08-03
# 2016-08-04
d2
# dates
# 2016-08-02
# 2016-08-03
# 2016-08-04
# 2016-08-05
# 2016-08-06
How do I create a 3rd dataframe that would have the not-common values?
d3
# dates
# 2016-08-01
# 2016-08-05
# 2016-08-06
Data:
df1 <- structure(list(dates = structure(c(17014, 17015, 17016, 17017 ),
class = "Date")), .Names = "dates", row.names = c(NA, -4L), class =
"data.frame")
df2 <- structure(list(dates = structure(c(17015, 17016, 17017, 17018,
17019), class = "Date")), .Names = "dates", row.names = c(NA, -5L), class
= "data.frame")
Suppose you have two vectors x and y, elements that are not shared are
c(x[!(x %in% y)], y[!(y %in% x)])
If you work with data frames, provided that your dates column is "character" or "Date" instead of "factor", you can do
rbind(subset(df1, !(df1$dates %in% df2$dates)),
subset(df2, !(df2$dates %in% df1$dates)))
Simple vector example
x <- 1:5
y <- 3:8
c(x[!(x %in% y)], y[!(y %in% x)])
# [1] 1 2 6 7 8
Vector of "Date"
x <- seq(from = as.Date("2016-01-01"), length = 5, by = 1)
y <- seq(from = as.Date("2016-01-03"), length = 5, by = 1)
c(x[!(x %in% y)], y[!(y %in% x)])
# [1] "2016-01-01" "2016-01-02" "2016-01-06" "2016-01-07"
Example data frame in your question
rbind(subset(df1, !(df1$dates %in% df2$dates)),
subset(df2, !(df2$dates %in% df1$dates)))
# dates
#1 2016-08-01
#4 2016-08-05
#5 2016-08-06
You could probably just use a join as others have shown. Personally I like using ?setops in base R. Something like this:
# if they are just character/factor variables
setdiff(d1$dates, d2$dates)
# if they are date variables
setdiff(as.character(d1$dates), as.character(d2$dates))
# then convert back to as.Date(setdiff(...))
Applying this, you could filter the data.frame based on the result, or like #ZheyuanLi has indirectly identified, use matching to exclude:
# If they are date variables
d2[!as.character(d2$dates) %in% as.character(d1$dates),]
# If they are character/factor variables
d2[!d2$dates %in% d1$dates,]

Resources