Extracting most recent date from rows in a data frame - r

I am working with a data frame that has multiple dates that relate to each other but the bottom line is that I need to extract the most recent date. I have seen examples on here but nothing exactly what I am looking for. So my example data frame is as follows:
ID date1 date2 date3
1 01/12/15 02/04/07 07/06/16
2 03/29/12 02/16/16 09/01/10
3 12/01/15 07/07/07 11/13/12
But what I want is to get an output that gives me:
ID date1 date2 date3 max
1 01/12/15 02/04/07 07/06/16 07/06/16
2 03/29/12 02/16/16 09/01/10 02/16/16
3 12/01/15 07/07/07 11/13/12 12/01/15
I'm seeing people use plyr and dplyr but I am very unfamiliar with those packages. Any help is appreciated!
Edit: I was able to run what was given by #akrun, but I ran into the issue of empty field dates. I have provided an example as follows:
ID date1 date2 date3
1 01/12/15 NA 07/06/16
2 NA 02/16/16 09/01/10
3 12/01/15 07/07/07 NA
So with those empty spots I would still like the data frame to be transformed as following:
ID date1 date2 date3 max
1 01/12/15 NA 07/06/16 07/06/16
2 NA 02/16/16 09/01/10 02/16/16
3 12/01/15 07/07/07 NA 12/01/15

We can convert to Date class and then use max.col to get the column index, cbind with the row index, extract the elements from 'df1' and create the 'max' column.
df1$max <- df1[cbind(1:nrow(df1), max.col(sapply(df1[-1], as.Date, format = "%m/%d/%y"))+1)]
df1
# ID date1 date2 date3 max
#1 1 01/12/15 02/04/07 07/06/16 07/06/16
#2 2 03/29/12 02/16/16 09/01/10 02/16/16
#3 3 12/01/15 07/07/07 11/13/12 12/01/15
Or another option is apply with MARGIN = 1
df1$max <- apply(df1[-1], 1, function(x) x[which.max(as.Date(x, "%m/%d/%y"))])
data
df1 <- structure(list(ID = 1:3, date1 = c("01/12/15", "03/29/12", "12/01/15"
), date2 = c("02/04/07", "02/16/16", "07/07/07"), date3 = c("07/06/16",
"09/01/10", "11/13/12")), .Names = c("ID", "date1", "date2",
"date3"), class = "data.frame", row.names = c("1", "2", "3"))

Use pmax after converting to Date objects:
dat[-1] <- lapply(dat[-1], as.Date, format="%m/%d/%y")
dat$max <- do.call(pmax, dat[-1])
# ID date1 date2 date3 max
#1 1 2015-01-12 2007-02-04 2016-07-06 2016-07-06
#2 2 2012-03-29 2016-02-16 2010-09-01 2016-02-16
#3 3 2015-12-01 2007-07-07 2012-11-13 2015-12-01
Using dat as:
dat <- structure(list(ID = 1:3, date1 = structure(1:3, .Label = c("01/12/15",
"03/29/12", "12/01/15"), class = "factor"), date2 = structure(1:3, .Label = c("02/04/07",
"02/16/16", "07/07/07"), class = "factor"), date3 = structure(1:3, .Label = c("07/06/16",
"09/01/10", "11/13/12"), class = "factor")), .Names = c("ID",
"date1", "date2", "date3"), class = "data.frame", row.names = c("1",
"2", "3"))

If you feel more comfortable using SQL, the library sqldf provides you another way of getting the last date:
data1<-data.frame(id=c("1","2","3"),
date1=as.Date(c("01/12/15","03/29/12","12/01/15"),"%m/%d/%y"),
date2=as.Date(c("02/04/07","02/16/16","07/07/07"),"%m/%d/%y"),
date3=as.Date(c("07/06/16","09/01/10","11/13/12"),"%m/%d/%y"))
library(sqldf)
data2 = sqldf("SELECT id,
max(date1,date2,date3) as 'max__Date'
FROM data1", method = "name__class")

Related

How to find the earliest date across multiple columns in R (Issue with NAs)

I have 3 date columns (class-date) and I want to create a new column that will have the earliest of the 3 dates. This is the code I used below:
df1 <- df %>% mutate(timeout= pmin(date1, date2, end_date))
In the case that date1 and date2 are NAs, then I would like the date in end_date to be returned in the timeout column and therefore timeout should not have any NAs. The code above is bringing back NAs. Any assistance will be greatly appreciated.
You can add na.rm = TRUE, then it will ignore the NAs in each row when calculating pmin.
library(dplyr)
df %>%
mutate(timeout = pmin(date1, date2, end_date, na.rm = TRUE))
Output
id date1 date2 end_date timeout
1 1 <NA> <NA> 2008-01-23 2008-01-23
2 1 2007-10-16 2007-11-01 2008-01-23 2007-10-16
3 2 2007-11-30 2007-11-30 2007-11-30 2007-11-30
4 3 2007-08-17 2007-12-17 2008-12-12 2007-08-17
5 3 2008-11-12 2008-12-12 2008-12-12 2008-11-12
Data
df <- structure(list(id = c(1L, 1L, 2L, 3L, 3L), date1 = structure(c(NA,
13802, 13847, 13742, 14195), class = "Date"), date2 = structure(c(NA,
13818, 13847, 13864, 14225), class = "Date"), end_date = c("2008-01-23",
"2008-01-23", "2007-11-30", "2008-12-12", "2008-12-12")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))

Create new column based on whether date is between other dates, over multiple time periods

I have a table with multiple rows person, and the date that each tax year ends:
df1 <- tibble::tribble(~ID, ~TAX_YEAR_END_DATE,
"01", "2009-04-06",
"01", "2010-04-06",
"01", "2011-04-06",
"02", "2010-04-06",
"02", "2011-04-06",
"02", "2012-04-06")
And another table, with multiple rows per person, giving the start date and end date for periods of work:
df2 <- tibble::tribble(~ID, ~START_DATE, ~END_DATE,
"01", "2007-09-11", "2010-04-06",
"02", "2008-06-06", "2010-04-06",
"02", "2011-09-09", "2014-04-06")
The END_DATE is always on 6th April, and everybody always has a START_DATE and END_DATE - there are no NULLs.
I want to add a new STATUS column to the first table, saying whether or not each person was EMPLOYED or NOT for each year. This is what it would look like for the above example:
ID TAX_YEAR_END_DATE STATUS
01 2009-04-06 EMPLOYED
01 2010-04-06 EMPLOYED
01 2011-04-06 NOT
02 2010-04-06 EMPLOYED
02 2011-04-06 NOT
02 2012-04-06 EMPLOYED
I've figured out that I can join the tables by ID, then apply some rules while using mutate() to create a new column - if the TY_END_DATE is between the START_DATE and END_DATE then the STATUS is EMPLOYED, and if it isn't then the STATUS is NOT.
Where I get stuck is with borrowers who have more than one period of employment in the second table. In these cases the rows in the first table get duplicated (or more) when I carry out the join, and I've not been able to figure out an alternative way of doing this.
I'm using R, would prefer data.table as it's normally quicker, but dplyr might be ok too.
A solution using a join to associate the tables and then a summarise
df1 %>% left_join(df2, by = "ID") %>%
mutate(employed = between(TAX_YEAR_END_DATE, START_DATE, END_DATE)) %>%
group_by(ID, TAX_YEAR_END_DATE) %>%
summarise(employed = any(employed))
An option using non equi join in data.table:
DT1[, status := c("NOT","EMP")[
DT2[.SD, on=.(ID, START_DATE<=TAX_YEAR_END_DATE, END_DATE>=TAX_YEAR_END_DATE),
by=.EACHI, .N>0L]$V1 + 1L
]]
output:
ID TAX_YEAR_END_DATE status
1: 1 2009-04-06 EMP
2: 1 2010-04-06 EMP
3: 1 2011-04-06 NOT
4: 2 2010-04-06 EMP
5: 2 2011-04-06 NOT
6: 2 2012-04-06 EMP
data:
library(data.table)
DT1 <- fread("ID TAX_YEAR_END_DATE
01 2009-04-06
01 2010-04-06
01 2011-04-06
02 2010-04-06
02 2011-04-06
02 2012-04-06")[,
TAX_YEAR_END_DATE := as.IDate(TAX_YEAR_END_DATE)]
cols <- c("START_DATE", "END_DATE")
DT2 <- fread("ID START_DATE END_DATE
01 2007-09-11 2010-04-06
02 2008-06-06 2010-04-06
02 2011-09-09 2014-04-06")[,
(cols) := lapply(.SD, as.IDate), .SDcols=cols]
One dplyr and lubridate solution could be:
df1 %>%
left_join(df2) %>%
group_by(ID, TAX_YEAR_END_DATE) %>%
summarise(STATUS = any(int_overlaps(interval(TAX_YEAR_END_DATE, TAX_YEAR_END_DATE),
interval(START_DATE, END_DATE))))
ID TAX_YEAR_END_DATE STATUS
<int> <chr> <lgl>
1 1 2009-04-06 TRUE
2 1 2010-04-06 TRUE
3 1 2011-04-06 FALSE
4 2 2010-04-06 TRUE
5 2 2011-04-06 FALSE
6 2 2012-04-06 TRUE
# Create a lookup data.frame for the durations in which ID was employed:
# dates_ro => data.frame
dates_ro <- data.frame(do.call("rbind", lapply(split(df2, rownames(df2)), function(x){
data.frame(id = x$ID,
emp_date = seq.Date(x$START_DATE, x$END_DATE, by = "days"))
}
)
),
row.names = NULL)
# Lookup whether or not the person is employed at end date
# STATUS => character vector
df1$STATUS <- ifelse(is.na(
match(df1$ID, dates_ro$id) &
match(df1$TAX_YEAR_END_DATE, dates_ro$emp_date)),"UNEMPLOYED", "EMPLOYED")
Data:
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), TAX_YEAR_END_DATE = structure(c(14340,
14705, 15070, 14705, 15070, 15436), class = "Date")),
class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 2L, 2L), START_DATE = structure(c(13767,
14036, 15226), class = "Date"), END_DATE = structure(c(14705,
14705, 16166), class = "Date")), class = "data.frame", row.names = c(NA, -3L))

Apportion number of days per month between two dates in R code?

I am very new to R programming and am trying to determine the number of days apportioned per month between two dates.
I have a dataset that has the following structure:
from_date
to_date
quantity
Example data:
2019-06-15 2019-09-10 55
2019-07-11 2019-10-05 17
I would like to call a function that returns a dataset/vector? that holds 3 values as there will be a maximum difference between from_date and to_date of 3 months.
I have tried using lubridate::floor_date() to work backward from the to_date
Not sure if you are looking for some result like below:
df$quantity <- with(df,as.Date(to_date)-as.Date(from_date))
or
df$quantity <- apply(df, 1, function(v) diff(as.Date(v)))
yielding
> df
from_date to_date quantity
1 2019-06-15 2019-09-10 87
2 2019-07-11 2019-10-05 86
Data
df <- structure(list(from_date = structure(1:2, .Label = c("2019-06-15",
"2019-07-11"), class = "factor"), to_date = structure(1:2, .Label = c("2019-09-10",
"2019-10-05"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))

How can I filter multiple factor levels by multiple dates in R?

I have multiple grouping variables (id) and I want to filter each one with its own specific date.
mydata <- structure(list(ID = structure(c("A", "A", "A", "B", "B", "B", "C", "C", "C")),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("Line", "Start", "End"), row.names = c(NA, -9L), class = "data.frame")
I could do it individually with the following but how do I tie this together into one line?
mydata %>% filter(id == "A" & time >= as.Date("2013-01-01 00:00:00"))
mydata %>% filter(id == "B" & time >= as.Date("2013-01-13 00:00:00"))
mydata %>% filter(id == "C" & time >= as.Date("2013-01-23 00:00:00"))
If there are many dates, then can use a loop
library(dplyr)
library(purrr)
v1 <- unique(mydata$Line)
dates <- as.POSIXct(c("2013-01-01", "2013-01-13", "2013-01-23"))
mydata %>%
filter(map2(v1, dates, ~ Line== .x & Start >= .y) %>%
reduce(`|`))
If there are many dates, I suggest to use a non-equi join either using SQL (package sqldf) or data.table
For this, a table with filter conditions is created, e.g.,
fc <- data.frame(Line = LETTERS[1:3],
dates = as.POSIXct(c("2013-01-01", "2013-01-13", "2013-01-23")))
fc
Line dates
1 A 2013-01-01
2 B 2013-01-13
3 C 2013-01-23
(Note that dates is of type POSIXct to be in line with Start and End)
sqldf
library(sqldf)
sqldf("select mydata.* from mydata join fc on mydata.Line = fc.Line and mydata.Start >= fc.dates")
Line Start End
1 A 2013-01-01 12:01:00 2013-01-02 08:07:10
2 A 2013-01-03 14:51:14 2013-01-05 06:55:12
3 A 2013-01-05 08:07:24 2013-01-07 14:13:33
4 B 2013-01-14 09:22:43 2013-01-20 10:36:40
5 C 2013-01-30 10:33:20 2013-01-30 18:49:11
BTW,
sqldf("select mydata.* from mydata, fc where mydata.Line = fc.Line and mydata.Start >= fc.dates")
returns the same result.
data.table
library(data.table)
setDT(mydata)[mydata[fc, on = .(Line, Start >= dates ), which = TRUE]]
Line Start End
1: A 2013-01-01 12:01:00 2013-01-02 08:07:10
2: A 2013-01-03 14:51:14 2013-01-05 06:55:12
3: A 2013-01-05 08:07:24 2013-01-07 14:13:33
4: B 2013-01-14 09:22:43 2013-01-20 10:36:40
5: C 2013-01-30 10:33:20 2013-01-30 18:49:11
The expression
mydata[fc, on = .(Line, Start >= dates ), which = TRUE]
returns the indices of the rows of mydata which fulfill the conditions
[1] 1 2 3 6 9

Find the smallest date for each ID

I have one data table with the following schema
id|smalltime
1 2199-08-02 20:00:00
2 2150-11-13 15:00:00
...
And I have another data table with the following schema
id|time
1 2199-08-02 20:10:00
1 2199-08-02 19:00:00
2 2150-11-13 15:10:00
...
I want to find for each id in data table two the smallest date after the smalltime of each id in data table one.
With the previous example, I am looking for the following new data table:
id|time
1 2199-08-02 20:10:00
2 2150-11-13 15:10:00
Did you mean to have something like below?
library(lubridate)
library(dplyr)
df1$smalltime <- ymd_hms(df1$smalltime)
df2$time <- ymd_hms(df2$time)
df2 %>%
inner_join(df1, by="id") %>%
mutate(time_diff = time - smalltime) %>%
filter(time_diff > 0) %>%
group_by(id) %>%
summarise(time = time[which.min(time_diff)])
Output is:
id time
1 1 2199-08-02 20:10:00
2 2 2150-11-13 15:10:00
Sample data:
df1 <- structure(list(id = 1:2, smalltime = c("2199-08-02 20:00:00",
"2150-11-13 15:00:00")), .Names = c("id", "smalltime"), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(id = c(1L, 1L, 2L), time = c("2199-08-02 20:10:00",
"2199-08-02 19:00:00", "2150-11-13 15:10:00")), .Names = c("id",
"time"), class = "data.frame", row.names = c(NA, -3L))
You can try this way:
library(data.table)
library(purrr)
# convert to date time format
df1[, smalltime := ymd_hms(smalltime)]
df2[, time := ymd_hms(time)]
# merge df2 in df1 while grouping by df2 on id
df1[df2[, list(list(time)), .(id)], on = 'id', z := i.V1]
# check if the time is greater than df1 time
df1[, ans := map2(z, smalltime, function(x, y) lapply(x, function(j) as.character(j[j > y])))]
# extract the time (answer)
df1[, ans1 := map_chr(ans, 1)]
print(df1[,.(id, ans1)])
id ans1
1: 1 2199-08-02 20:10:00
2: 2 2150-11-13 15:10:00
> A=strptime(df1$smalltime,"%F %T")
> B=strptime(df2$time,"%F %T")
> d=findInterval(B,sort(A))
> unname(by(B,list(d,df2$id),function(x)format(min(x),"%F %T"))[unique(d)])
[1] "2199-08-02 20:10:00" "2150-11-13 15:10:00"

Resources