Adding column with Yes/No values based on date of another column - r

ID Date
1 2020-06-03
2 2018-05-04
3 2019-08-09
I want to add a column to this data frame that indicates Yes/No based on whether or not the Date falls within the last year based on the date the code is being run.
ID Date YN
1 2019-06-03 Yes
2 2018-05-04 No
3 2019-06-02 No

You could do:
library(lubridate)
library(dplyr)
nw <- ymd("2020-06-03")
df %>%
mutate(Date = ymd(Date),
yn = if_else(nw > Date & Date >= nw - years(1), "Yes", "No"))
ID Date YN yn
1 1 2019-06-03 Yes Yes
2 2 2018-05-04 No No
3 3 2019-06-02 No No

You can use base R for this, no need for additional packages, the year has approximately 365.25 days but you need to add 1 day to have the time elapsing within a year. Take the difference between today using Sys.Date() and what is in d1[["Date"]]. diff.time() can be applied to vectors. You'll need to get creative with leap years though.
I also realized that you don't specify whether the column Date is of Date format or just a character vector. If the latter, then you need to convert the column Date to Date format using as.Date(). inherits(x, 'Date') checks whether a vector x inherits the class Date. Assume d1 is the name assigned to your data.frame object:
# in case 'Date' is a string, convert it to date:
if(!inherits(d1[["Date"]], "Date")) d1[["Date"]] <- as.Date(d1[["Date"]])
d1[["YN"]] <- ifelse(difftime(Sys.Date(), d1[["Date"]], units="days") <= 366.25, "Yes", "No")
Result:
> d1
ID Date YN
1 1 2019-06-03 Yes
2 2 2018-05-04 No
3 3 2019-06-02 No

Related

adding two column of a data where col1 contains date and col2 contains days

I have a data frame in which i have two columns date and days and i want to add date column with days and show the result in other column
data frame-1
col date is in format of mm/dd/yyyy format
date days
3/2/2019 8
3/5/2019 4
3/6/2019 4
3/21/2019 3
3/25/2019 7
and i want my output like this
date days new-date
3/2/2019 8 3/10/2019
3/5/2019 4 3/9/2019
3/6/2019 4 3/10/2019
3/21/2019 3 3/24/2019
3/25/2019 7 4/1/2019
i was trying this
as.Date("3/10/2019") +8
but i think it will work for a single value
Convert to actual Date values and then add Days. You need to specify the actual format of date (read ?strptime) while converting it to Date.
as.Date(df$date, "%m/%d/%Y") + df$days
#[1] "2019-03-10" "2019-03-09" "2019-03-10" "2019-03-24" "2019-04-01"
If you want the output back in same format, we can use format
df$new_date <- format(as.Date(df$date, "%m/%d/%Y") + df$days, "%m/%d/%Y")
df
# date days new_date
#1 3/2/2019 8 03/10/2019
#2 3/5/2019 4 03/09/2019
#3 3/6/2019 4 03/10/2019
#4 3/21/2019 3 03/24/2019
#5 3/25/2019 7 04/01/2019
If you get confused with different date format we can use lubridate to do
library(lubridate)
with(df, mdy(date) + days)

How to check for continuity minding possible gaps in dates

I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet

Filter a data frame by two time series

Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)

How to subset the most recent 12 months of data for each ID in a data frame?

I have a data frame representing 15 years of follow-up data from several hundred patients. I want to create a subset of the data frame including the most recent 12 months of data for each patient.
Here is a representative example of my data (including one missing value, because missing data abound in my actual dataset):
# Create example dataset.
example.dat <- data.frame(
ID = c(1,1,1,1,2,2,2,3,3,3), # patient ID numbers
Date = as.Date(c("2000-02-01", "2004-10-21", "2005-02-06", # follow-up dates
"2005-06-14", "2002-11-24", "2009-03-05",
"2009-07-20", "2005-09-02", "2006-01-15",
"2006-05-18")),
Cat = c("Yes", "Yes", "No", "Yes", "No", # responses to a categorical variable
"Yes", "Yes", NA, "No", "No")
)
example.dat
Which yields the following output:
ID Date Cat
1 1 2000-02-01 Yes
2 1 2004-10-21 Yes
3 1 2005-02-06 No
4 1 2005-06-14 Yes
5 2 2002-11-24 No
6 2 2009-03-05 Yes
7 2 2009-07-20 Yes
8 3 2005-09-02 <NA>
9 3 2006-01-15 No
10 3 2006-05-18 No
I need to figure out how to subset, for each ID number, the most recent record and all records from the previous 12 months.
ID Date Cat
2 1 2004-10-21 Yes
3 1 2005-02-06 No
4 1 2005-06-14 Yes
6 2 2009-03-05 Yes
7 2 2009-07-20 Yes
8 3 2005-09-02 <NA>
9 3 2006-01-15 No
10 3 2006-05-18 No
Several questions have already been asked about subsetting by date in R, but they are generally concerned with subsetting data from a specific date or range of dates, not subsetting by ((variable end date) - (time interval)).
For the sake of completeness, here are two data.table approaches using either subsetting by groups or a non-equi join. In addition, lubridate is used to ensure a period of 12 months is picked even in the case of leap years.
Subsetting by groups
This is essentialy the data.table version of docendo discimus' dplyr answer. However, lubridate functions are used for date arithmetic because simply subtracting 365 days will not cover a period of 12 months as requested by the OP in case the past year contains a leap day:
library(data.table)
library(lubridate)
setDT(example.dat)[, .SD[Date >= max(Date) %m-% years(1)], by = ID]
ID Date Cat
1: 1 2004-10-21 Yes
2: 1 2005-02-06 No
3: 1 2005-06-14 Yes
4: 2 2009-03-05 Yes
5: 2 2009-07-20 Yes
6: 3 2005-09-02 NA
7: 3 2006-01-15 No
8: 3 2006-05-18 No
Non-equi join
With version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to perform non-equi joins:
library(data.table)
library(lubridate)
mDT <- setDT(example.dat)[, max(Date) %m-% years(1), by = ID]
example.dat[example.dat[mDT, on = .(ID, Date >= V1), which = TRUE]]
ID Date Cat
1: 1 2004-10-21 Yes
2: 1 2005-02-06 No
3: 1 2005-06-14 Yes
4: 2 2009-03-05 Yes
5: 2 2009-07-20 Yes
6: 3 2005-09-02 NA
7: 3 2006-01-15 No
8: 3 2006-05-18 No
mDT contains the start dates of the 12 months period for each ID:
ID V1
1: 1 2004-06-14
2: 2 2008-07-20
3: 3 2005-05-18
The non-equi join returns the indices of the rows which fulfill the conditions
example.dat[mDT, on = .(ID, Date >= V1), which = TRUE]
[1] 2 3 4 6 7 8 9 10
which are then used to finally subset example.dat.
Comparison of date arithmetic methods
The answers posted so far employed three different methods to find a date 12 months earlier:
docendo discimus subtracts 365 days,
G. Grothendieck uses seq.Date(),
this answer uses years() and %m-%
The three methods differ in case a leap day is included in the period:
library(data.table)
library(lubridate)
mseq <- Vectorize(function(x) seq(x, length = 2L, by = "-1 year")[2L])
data.table(Date = as.Date("2016-02-28") + 0:2)[
, minus_365d := Date -365][
, minus_1yr := Date - years()][
, minus_1yr_m := Date %m-% years()][
, seq.Date := as_date(mseq(Date))][]
Date minus_365d minus_1yr minus_1yr_m seq.Date
1: 2016-02-28 2015-02-28 2015-02-28 2015-02-28 2015-02-28
2: 2016-02-29 2015-03-01 <NA> 2015-02-28 2015-03-01
3: 2016-03-01 2015-03-02 2015-03-01 2015-03-01 2015-03-01
If there is no leap day in the past period, all three methods return the same result (row 1).
If a leap day is included in the past period, subtracting 365 days does not fully cover 12 months (row 3) as a leap year has 366 days.
If the reference date is a leap date, the seq.Date() approach picks the next day, 1 March 2015, as there is no 29 February in 2015. Using lubridate's %m-% rolls the date to the last day of February, 28 Feb 2015, instead.
Here is a base solution. We have ave operate on dates as numbers since if we were to use raw "Date" values ave would try to return "Date" values. Instead, ave returns 0/1 values and !! converts those to FALSE/TRUE.
in_last_yr <- function(x) {
max_date <- as.Date(max(x), "1970-01-01")
x > seq(max_date, length = 2, by = "-1 year")[2]
}
subset(example.dat, !!ave(as.numeric(Date), ID, FUN = in_last_yr))
Update Improved method of determining which days are in last year.
A possible approach using dplyr
library(dplyr)
example.dat %>% group_by(ID) %>% filter(Date >= max(Date)-365)
#Source: local data frame [8 x 3]
#Groups: ID
#
# ID Date Cat
#1 1 2004-10-21 Yes
#2 1 2005-02-06 No
#3 1 2005-06-14 Yes
#4 2 2009-03-05 Yes
#5 2 2009-07-20 Yes
#6 3 2005-09-02 NA
#7 3 2006-01-15 No
#8 3 2006-05-18 No

Extracting last date of the year from a date object

I have following data set:
>d
x date
1 1 1-3-2013
2 2 2-4-2010
3 3 2-5-2011
4 4 1-6-2012
I want:
> d
x date
1 1 31-12-2013
2 2 31-12-2010
3 3 31-12-2011
4 4 31-12-2012
i.e. Last day, last month and the year of the date object.
Please Help!
You can also just use the ceiling_date function in LUBRIDATE package.
You can do something like -
library(lubridate)
last_date <- ceiling_date(date,"year") - days(1)
ceiling_date(date,"year") gives you the first date of the next year and to get the last date of the current year, you subtract this by 1 or days(1).
Hope this helps.
Another option using lubridate package:
## using d from Roland answer
transform(d,last =dmy(paste0('3112',year(dmy(date)))))
x date last
1 1 1-3-2013 2013-12-31
2 2 2-4-2010 2010-12-31
3 3 2-5-2011 2011-12-31
4 4 1-6-2012 2012-12-31
d <- read.table(text="x date
1 1 1-3-2013
2 2 2-4-2010
3 3 2-5-2011
4 4 1-6-2012", header=TRUE)
d$date <- as.Date(d$date, "%d-%m-%Y")
d$date <- as.POSIXlt(d$date)
d$date$mon <- 11
d$date$mday <- 31
d$date <- as.Date(d$date)
# x date
#1 1 2013-12-31
#2 2 2010-12-31
#3 3 2011-12-31
#4 4 2012-12-31
1) cut.Date Define cut_year to give the first day of the year. Adding 366 gets us to the next year and then applying cut_year again gets us to the first day of the next year. Finally subtract 1 to get the last day of the year. The code uses base functionality only.
cut_year <- function(x) as.Date(cut(as.Date(x), "year"))
transform(d, date = cut_year(cut_year(date) + 366) - 1)
2) format
transform(d, date = as.Date(format(as.Date(date), "%Y-12-31")))
3) zoo A "yearmon" class variable stores the date as a year plus 0 for Jan, 1/12 for Feb, ..., 11/12 for Dec. Thus taking its floor and adding 11/12 gets one to Dec and as.Date.yearmon(..., frac = 1) uses the last of the month instead of the first.
library(zoo)
transform(d, date = as.Date(floor(as.yearmon(as.Date(date))) + 11 / 12, frac = 1))
Note: The inner as.Date in cut_year and in the other two solutions can be omitted if it is known that date is already of "Date" class.
ADDED additional solutions.

Resources