I am trying to implement a way to filter this dataframe df
structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey",
"Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20",
"80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45",
"11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15",
"12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04",
"11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")
I want to filter it such that if there are duplicate responses in the Name column like how Jim appears twice I would like to keep the row that has the earliest date and time according to the EndDate column ONLY if the Progress column value is greater than 70. Otherwise I want to take the row that has a later date and time in the EndDate column.
Using dplyr, we first convert EndDate to date time object using parse_date_time from lubridate then we group_by Name and select row with minimum EndDate if Progress > 70 and number of rows for each Name is more than 1 and maximum EndDate otherwise. If there is only one row for the Name then we select only that one by default.
library(dplyr)
library(lubridate)
df %>%
mutate(EndDate = parse_date_time(EndDate,c("%m-%d-%y %H:%M","%Y-%m-%d %H:%M:%S"))) %>%
group_by(Name) %>%
slice(ifelse(n() > 1,
ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))
# Name Progress EndDate
# <chr> <chr> <dttm>
#1 Jane 70 2018-11-30 11:18:00
#2 Jim 25 2018-11-29 18:04:00
#3 Jose 80 2018-11-25 14:20:00
#4 Matt 20 2018-12-01 22:52:00
#5 Mickey 65 2018-11-29 18:15:00
#6 Peter 20 2018-11-26 12:07:00
#7 Tom 45 2018-12-02 15:27:00
Based on the condition, we convert the 'EndDate' to DateTime class, then arrange by 'Name', 'EndDate', grouped by 'Name' if the first element of 'Progres' is greater than 70 return index 1 or else the last row index in slice to subset the rows
library(tidyverse)
library(lubridate)
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
# if there are multiple formats
# mutate(EndDate = anytime::anytime(EndDate)) %>%
arrange(Name, EndDate) %>%
group_by(Name) %>%
slice(if(first(Progress) > 70) 1 else n())
# A tibble: 7 x 3
# Groups: Name [7]
# Name Progress EndDate
# <chr> <chr> <dttm>
#1 Jane 70 2018-11-30 11:18:00
#2 Jim 25 2018-11-29 18:04:00
#3 Jose 80 2018-11-25 14:20:00
#4 Matt 20 2018-12-01 22:52:00
#5 Mickey 65 2018-11-29 18:15:00
#6 Peter 20 2018-11-26 12:07:00
#7 Tom 45 2018-12-02 15:27:00
NOTE: if there are multiple 'DateTime' formats, one option is anytime::anytime instead of mdy_hm
An (of course) this can also be done using data.table
sample data
df <- structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey",
"Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20",
"80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45",
"11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15",
"12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04",
"11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")
code
#create the data.table (can also be done using setDT(df) )
dt <- as.data.table( df )
#set the dates to a proper POSIXct-format
dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
#order omn EndDate (by reference!)
setorder( dt, EndDate )
#summarise by Name, if first Progress >70 then keep it, else keep last Progress
dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]
benchmarks
microbenchmark::microbenchmark(
data.table = {
dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
setorder( dt, EndDate )
dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]
},
tidyverse1 = {
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
arrange(Name, EndDate) %>%
group_by(Name) %>%
slice(if(first(Progress) > 70) 1 else n())
},
tidyverse2 = {
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
group_by(Name) %>%
slice(ifelse(n() > 1,
ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.654241 2.030820 2.709023 2.556978 2.782023 30.36590 100
# tidyverse1 6.847731 7.218286 8.742247 7.516838 8.034861 72.00902 100
# tidyverse2 6.173201 6.506398 7.286639 6.764582 7.088591 52.10180 100
Related
I have a table with multiple rows person, and the date that each tax year ends:
df1 <- tibble::tribble(~ID, ~TAX_YEAR_END_DATE,
"01", "2009-04-06",
"01", "2010-04-06",
"01", "2011-04-06",
"02", "2010-04-06",
"02", "2011-04-06",
"02", "2012-04-06")
And another table, with multiple rows per person, giving the start date and end date for periods of work:
df2 <- tibble::tribble(~ID, ~START_DATE, ~END_DATE,
"01", "2007-09-11", "2010-04-06",
"02", "2008-06-06", "2010-04-06",
"02", "2011-09-09", "2014-04-06")
The END_DATE is always on 6th April, and everybody always has a START_DATE and END_DATE - there are no NULLs.
I want to add a new STATUS column to the first table, saying whether or not each person was EMPLOYED or NOT for each year. This is what it would look like for the above example:
ID TAX_YEAR_END_DATE STATUS
01 2009-04-06 EMPLOYED
01 2010-04-06 EMPLOYED
01 2011-04-06 NOT
02 2010-04-06 EMPLOYED
02 2011-04-06 NOT
02 2012-04-06 EMPLOYED
I've figured out that I can join the tables by ID, then apply some rules while using mutate() to create a new column - if the TY_END_DATE is between the START_DATE and END_DATE then the STATUS is EMPLOYED, and if it isn't then the STATUS is NOT.
Where I get stuck is with borrowers who have more than one period of employment in the second table. In these cases the rows in the first table get duplicated (or more) when I carry out the join, and I've not been able to figure out an alternative way of doing this.
I'm using R, would prefer data.table as it's normally quicker, but dplyr might be ok too.
A solution using a join to associate the tables and then a summarise
df1 %>% left_join(df2, by = "ID") %>%
mutate(employed = between(TAX_YEAR_END_DATE, START_DATE, END_DATE)) %>%
group_by(ID, TAX_YEAR_END_DATE) %>%
summarise(employed = any(employed))
An option using non equi join in data.table:
DT1[, status := c("NOT","EMP")[
DT2[.SD, on=.(ID, START_DATE<=TAX_YEAR_END_DATE, END_DATE>=TAX_YEAR_END_DATE),
by=.EACHI, .N>0L]$V1 + 1L
]]
output:
ID TAX_YEAR_END_DATE status
1: 1 2009-04-06 EMP
2: 1 2010-04-06 EMP
3: 1 2011-04-06 NOT
4: 2 2010-04-06 EMP
5: 2 2011-04-06 NOT
6: 2 2012-04-06 EMP
data:
library(data.table)
DT1 <- fread("ID TAX_YEAR_END_DATE
01 2009-04-06
01 2010-04-06
01 2011-04-06
02 2010-04-06
02 2011-04-06
02 2012-04-06")[,
TAX_YEAR_END_DATE := as.IDate(TAX_YEAR_END_DATE)]
cols <- c("START_DATE", "END_DATE")
DT2 <- fread("ID START_DATE END_DATE
01 2007-09-11 2010-04-06
02 2008-06-06 2010-04-06
02 2011-09-09 2014-04-06")[,
(cols) := lapply(.SD, as.IDate), .SDcols=cols]
One dplyr and lubridate solution could be:
df1 %>%
left_join(df2) %>%
group_by(ID, TAX_YEAR_END_DATE) %>%
summarise(STATUS = any(int_overlaps(interval(TAX_YEAR_END_DATE, TAX_YEAR_END_DATE),
interval(START_DATE, END_DATE))))
ID TAX_YEAR_END_DATE STATUS
<int> <chr> <lgl>
1 1 2009-04-06 TRUE
2 1 2010-04-06 TRUE
3 1 2011-04-06 FALSE
4 2 2010-04-06 TRUE
5 2 2011-04-06 FALSE
6 2 2012-04-06 TRUE
# Create a lookup data.frame for the durations in which ID was employed:
# dates_ro => data.frame
dates_ro <- data.frame(do.call("rbind", lapply(split(df2, rownames(df2)), function(x){
data.frame(id = x$ID,
emp_date = seq.Date(x$START_DATE, x$END_DATE, by = "days"))
}
)
),
row.names = NULL)
# Lookup whether or not the person is employed at end date
# STATUS => character vector
df1$STATUS <- ifelse(is.na(
match(df1$ID, dates_ro$id) &
match(df1$TAX_YEAR_END_DATE, dates_ro$emp_date)),"UNEMPLOYED", "EMPLOYED")
Data:
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), TAX_YEAR_END_DATE = structure(c(14340,
14705, 15070, 14705, 15070, 15436), class = "Date")),
class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 2L, 2L), START_DATE = structure(c(13767,
14036, 15226), class = "Date"), END_DATE = structure(c(14705,
14705, 16166), class = "Date")), class = "data.frame", row.names = c(NA, -3L))
I have multiple grouping variables (id) and I want to filter each one with its own specific date.
mydata <- structure(list(ID = structure(c("A", "A", "A", "B", "B", "B", "C", "C", "C")),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("Line", "Start", "End"), row.names = c(NA, -9L), class = "data.frame")
I could do it individually with the following but how do I tie this together into one line?
mydata %>% filter(id == "A" & time >= as.Date("2013-01-01 00:00:00"))
mydata %>% filter(id == "B" & time >= as.Date("2013-01-13 00:00:00"))
mydata %>% filter(id == "C" & time >= as.Date("2013-01-23 00:00:00"))
If there are many dates, then can use a loop
library(dplyr)
library(purrr)
v1 <- unique(mydata$Line)
dates <- as.POSIXct(c("2013-01-01", "2013-01-13", "2013-01-23"))
mydata %>%
filter(map2(v1, dates, ~ Line== .x & Start >= .y) %>%
reduce(`|`))
If there are many dates, I suggest to use a non-equi join either using SQL (package sqldf) or data.table
For this, a table with filter conditions is created, e.g.,
fc <- data.frame(Line = LETTERS[1:3],
dates = as.POSIXct(c("2013-01-01", "2013-01-13", "2013-01-23")))
fc
Line dates
1 A 2013-01-01
2 B 2013-01-13
3 C 2013-01-23
(Note that dates is of type POSIXct to be in line with Start and End)
sqldf
library(sqldf)
sqldf("select mydata.* from mydata join fc on mydata.Line = fc.Line and mydata.Start >= fc.dates")
Line Start End
1 A 2013-01-01 12:01:00 2013-01-02 08:07:10
2 A 2013-01-03 14:51:14 2013-01-05 06:55:12
3 A 2013-01-05 08:07:24 2013-01-07 14:13:33
4 B 2013-01-14 09:22:43 2013-01-20 10:36:40
5 C 2013-01-30 10:33:20 2013-01-30 18:49:11
BTW,
sqldf("select mydata.* from mydata, fc where mydata.Line = fc.Line and mydata.Start >= fc.dates")
returns the same result.
data.table
library(data.table)
setDT(mydata)[mydata[fc, on = .(Line, Start >= dates ), which = TRUE]]
Line Start End
1: A 2013-01-01 12:01:00 2013-01-02 08:07:10
2: A 2013-01-03 14:51:14 2013-01-05 06:55:12
3: A 2013-01-05 08:07:24 2013-01-07 14:13:33
4: B 2013-01-14 09:22:43 2013-01-20 10:36:40
5: C 2013-01-30 10:33:20 2013-01-30 18:49:11
The expression
mydata[fc, on = .(Line, Start >= dates ), which = TRUE]
returns the indices of the rows of mydata which fulfill the conditions
[1] 1 2 3 6 9
I have 2 tables. Below are the sample tables and the desired output.
Table1:
Start Date End Date Country
2017-01-04 2017-01-06 id
2017-02-13 2017-02-15 ng
Table2:
Transaction Date Country Cost Product
2017-01-04 id 111 21
2017-01-05 id 200 34
2017-02-14 ng 213 45
2017-02-15 ng 314 32
2017-02-18 ng 515 26
Output:
Start Date End Date Country Cost Product
2017-01-04 2017-01-06 id 311 55
2017-02-13 2017-02-15 ng 527 77
The problem is to merge two tables when transaction date lies in between start date and end date & country matches. And add the values of cost and product.
This calls for fuzzyjoins. Below are 2 examples.
Using dplyr and fuzzyjoin packages:
fuzzy_left_join(df1, df2,
c("Country" = "Country",
"Start_Date" = "Transaction_Date",
"End_Date" = "Transaction_Date"),
list(`==`, `<=`,`>=`)) %>%
group_by(Country.x, Start_Date, End_Date) %>%
summarise(Cost = sum(Cost),
Product = sum(Product))
# A tibble: 2 x 5
# Groups: Country.x, Start_Date [?]
Country.x Start_Date End_Date Cost Product
<chr> <date> <date> <int> <int>
1 id 2017-01-04 2017-01-06 311 55
2 ng 2017-02-13 2017-02-15 527 77
Using data.table:
library(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt2[dt1, on=.(Country = Country,
Transaction_Date >= Start_Date,
Transaction_Date <= End_Date),
.(Cost = sum(Cost), Product = sum(Product)),
by=.EACHI]
data:
df1 <- structure(list(Start_Date = structure(c(17170, 17210), class = "Date"),
End_Date = structure(c(17172, 17212), class = "Date"), Country = c("id",
"ng")), row.names = c(NA, -2L), class = "data.frame")
df2 <- structure(list(Transaction_Date = structure(c(17170, 17171, 17211,
17212, 17215), class = "Date"), Country = c("id", "id", "ng",
"ng", "ng"), Cost = c(111L, 200L, 213L, 314L, 515L), Product = c(21L,
34L, 45L, 32L, 26L)), row.names = c(NA, -5L), class = "data.frame")
Not sure if you can use any of the merge operation here but one way using mapply is to subset the rows based on the condition and take the sum of Product and Cost columns.
df1[c("Cost", "Product")] <- t(mapply(function(x, y, z) {
inds <- df2$Transaction_Date >= x & df2$Transaction_Date <= y & df2$Country == z
c(sum(df2$Cost[inds]), sum(df2$Product[inds]))
},df1$Start_Date, df1$End_Date, df1$Country))
df1
# Start_Date End_Date Country Cost Product
#1 2017-01-04 2017-01-06 id 311 55
#2 2017-02-13 2017-02-15 ng 527 77
I have one data table with the following schema
id|smalltime
1 2199-08-02 20:00:00
2 2150-11-13 15:00:00
...
And I have another data table with the following schema
id|time
1 2199-08-02 20:10:00
1 2199-08-02 19:00:00
2 2150-11-13 15:10:00
...
I want to find for each id in data table two the smallest date after the smalltime of each id in data table one.
With the previous example, I am looking for the following new data table:
id|time
1 2199-08-02 20:10:00
2 2150-11-13 15:10:00
Did you mean to have something like below?
library(lubridate)
library(dplyr)
df1$smalltime <- ymd_hms(df1$smalltime)
df2$time <- ymd_hms(df2$time)
df2 %>%
inner_join(df1, by="id") %>%
mutate(time_diff = time - smalltime) %>%
filter(time_diff > 0) %>%
group_by(id) %>%
summarise(time = time[which.min(time_diff)])
Output is:
id time
1 1 2199-08-02 20:10:00
2 2 2150-11-13 15:10:00
Sample data:
df1 <- structure(list(id = 1:2, smalltime = c("2199-08-02 20:00:00",
"2150-11-13 15:00:00")), .Names = c("id", "smalltime"), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(id = c(1L, 1L, 2L), time = c("2199-08-02 20:10:00",
"2199-08-02 19:00:00", "2150-11-13 15:10:00")), .Names = c("id",
"time"), class = "data.frame", row.names = c(NA, -3L))
You can try this way:
library(data.table)
library(purrr)
# convert to date time format
df1[, smalltime := ymd_hms(smalltime)]
df2[, time := ymd_hms(time)]
# merge df2 in df1 while grouping by df2 on id
df1[df2[, list(list(time)), .(id)], on = 'id', z := i.V1]
# check if the time is greater than df1 time
df1[, ans := map2(z, smalltime, function(x, y) lapply(x, function(j) as.character(j[j > y])))]
# extract the time (answer)
df1[, ans1 := map_chr(ans, 1)]
print(df1[,.(id, ans1)])
id ans1
1: 1 2199-08-02 20:10:00
2: 2 2150-11-13 15:10:00
> A=strptime(df1$smalltime,"%F %T")
> B=strptime(df2$time,"%F %T")
> d=findInterval(B,sort(A))
> unname(by(B,list(d,df2$id),function(x)format(min(x),"%F %T"))[unique(d)])
[1] "2199-08-02 20:10:00" "2150-11-13 15:10:00"
I am working with a data frame that has multiple dates that relate to each other but the bottom line is that I need to extract the most recent date. I have seen examples on here but nothing exactly what I am looking for. So my example data frame is as follows:
ID date1 date2 date3
1 01/12/15 02/04/07 07/06/16
2 03/29/12 02/16/16 09/01/10
3 12/01/15 07/07/07 11/13/12
But what I want is to get an output that gives me:
ID date1 date2 date3 max
1 01/12/15 02/04/07 07/06/16 07/06/16
2 03/29/12 02/16/16 09/01/10 02/16/16
3 12/01/15 07/07/07 11/13/12 12/01/15
I'm seeing people use plyr and dplyr but I am very unfamiliar with those packages. Any help is appreciated!
Edit: I was able to run what was given by #akrun, but I ran into the issue of empty field dates. I have provided an example as follows:
ID date1 date2 date3
1 01/12/15 NA 07/06/16
2 NA 02/16/16 09/01/10
3 12/01/15 07/07/07 NA
So with those empty spots I would still like the data frame to be transformed as following:
ID date1 date2 date3 max
1 01/12/15 NA 07/06/16 07/06/16
2 NA 02/16/16 09/01/10 02/16/16
3 12/01/15 07/07/07 NA 12/01/15
We can convert to Date class and then use max.col to get the column index, cbind with the row index, extract the elements from 'df1' and create the 'max' column.
df1$max <- df1[cbind(1:nrow(df1), max.col(sapply(df1[-1], as.Date, format = "%m/%d/%y"))+1)]
df1
# ID date1 date2 date3 max
#1 1 01/12/15 02/04/07 07/06/16 07/06/16
#2 2 03/29/12 02/16/16 09/01/10 02/16/16
#3 3 12/01/15 07/07/07 11/13/12 12/01/15
Or another option is apply with MARGIN = 1
df1$max <- apply(df1[-1], 1, function(x) x[which.max(as.Date(x, "%m/%d/%y"))])
data
df1 <- structure(list(ID = 1:3, date1 = c("01/12/15", "03/29/12", "12/01/15"
), date2 = c("02/04/07", "02/16/16", "07/07/07"), date3 = c("07/06/16",
"09/01/10", "11/13/12")), .Names = c("ID", "date1", "date2",
"date3"), class = "data.frame", row.names = c("1", "2", "3"))
Use pmax after converting to Date objects:
dat[-1] <- lapply(dat[-1], as.Date, format="%m/%d/%y")
dat$max <- do.call(pmax, dat[-1])
# ID date1 date2 date3 max
#1 1 2015-01-12 2007-02-04 2016-07-06 2016-07-06
#2 2 2012-03-29 2016-02-16 2010-09-01 2016-02-16
#3 3 2015-12-01 2007-07-07 2012-11-13 2015-12-01
Using dat as:
dat <- structure(list(ID = 1:3, date1 = structure(1:3, .Label = c("01/12/15",
"03/29/12", "12/01/15"), class = "factor"), date2 = structure(1:3, .Label = c("02/04/07",
"02/16/16", "07/07/07"), class = "factor"), date3 = structure(1:3, .Label = c("07/06/16",
"09/01/10", "11/13/12"), class = "factor")), .Names = c("ID",
"date1", "date2", "date3"), class = "data.frame", row.names = c("1",
"2", "3"))
If you feel more comfortable using SQL, the library sqldf provides you another way of getting the last date:
data1<-data.frame(id=c("1","2","3"),
date1=as.Date(c("01/12/15","03/29/12","12/01/15"),"%m/%d/%y"),
date2=as.Date(c("02/04/07","02/16/16","07/07/07"),"%m/%d/%y"),
date3=as.Date(c("07/06/16","09/01/10","11/13/12"),"%m/%d/%y"))
library(sqldf)
data2 = sqldf("SELECT id,
max(date1,date2,date3) as 'max__Date'
FROM data1", method = "name__class")