Related
I am trying to merge two dataframes based on a conditional relationship between several dates associated with unique identifiers but distributed across different observations (rows).
I have two large datasets with unique identifiers. One dataset has 'enter' and 'exit' dates (alongside some other variables).
> df1 <- data.frame(ID=c(1,1,1,2,2,3,4),
enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'),
+ exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
> dcis <- grep('date$',names(df1));
> df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
> df1;
ID enter.date exit.date
1 1 2015-05-07 2015-07-01
2 1 2015-07-10 2015-10-15
3 1 2017-08-25 2017-09-03
4 2 2016-09-01 2016-09-30
5 2 2018-01-05 2019-06-01
6 3 2016-05-01 2017-05-01
7 4 2017-04-08 2017-06-08
and the other has "eval" dates.
> df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
> df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
> df2;
ID eval.date
1 1 2015-10-30
2 2 2016-10-10
3 2 2019-09-10
4 3 2018-05-15
5 4 2015-01-19
I am trying to calculate the average interval of time from 'exit' to 'eval' for each individual in the dataset. However, I only want those 'evals' that come after a given individual's 'exit' and before the next 'enter' for that individual (there are no 'eval' observations between enter and exit for a given individual), if such an 'eval' exists.
In other words, I'm trying to get an output that looks like this from the two dataframes above.
> df3 <- data.frame(ID=c(1,2,2,3), enter.date=c('7/10/2015','9/1/2016','1/05/2018','5/01/2016'),
+ exit.date = c('10/15/2015', '9/30/2016', '6/01/2019', '5/01/2017'),
+ assess.date=c('10/30/2015', '10/10/2016', '9/10/2019', '5/15/2018'));
> dcis <- grep('date$',names(df3));
> df3[dcis] <- lapply(df3[dcis],as.Date,'%m/%d/%Y');
> df3$time.diff<-difftime(df3$exit.date, df3$assess.date)
> df3;
ID enter.date exit.date assess.date time.diff
1 1 2015-07-10 2015-10-15 2015-10-30 -15 days
2 2 2016-09-01 2016-09-30 2016-10-10 -10 days
3 2 2018-01-05 2019-06-01 2019-09-10 -101 days
4 3 2016-05-01 2017-05-01 2018-05-15 -379 days
Once I perform the merge finding the averages is easy enough with
> aggregate(df3[,5], list(df3$ID), mean)
Group.1 x
1 1 -15.0
2 2 -55.5
3 3 -379.0
but I'm really at a loss as to how to perform the merge. I've tried to use leftjoin and fuzzyjoin to perform the merge per the advice given here and here, but I'm inexperienced at R and couldn't figure it out. I would really appreciate if someone could walk me through it - thanks!
A few other descriptive notes about the data: each ID may have some number of rows associated with it in each dataframe. df1 has enter dates which mark the beginning of a service delivery and exit dates that mark the end of a service delivery. All enters have one corresponding exit. df2 has eval dates. Eval dates can occur at any time when an individual is not receiving the service. There may be many evals between one period of service delivery and the next, or there may be no evals.
Just discovered the sqldf package. Assuming that for each ID the date ranges are in ascending order, you might use it like this:
df1 <- data.frame(ID=c(1,1,1,2,2,3,4), enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'), exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
dcis <- grep('date$',names(df1));
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
df1;
df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
df2;
library(sqldf)
df1 = unsplit(lapply(split(df1, df1$ID, drop=FALSE), function(df) {
df$next.date = as.Date('2100-12-31')
if (nrow(df) > 1)
df$next.date[1:(nrow(df) - 1)] = df$enter.date[2:nrow(df)]
df
}), df1$ID)
sqldf('
select df1.*, df2.*, df1."exit.date" - df2."eval.date" as "time.diff"
from df1, df2
where df1.ID == df2.ID
and df2."eval.date" between df1."exit.date"
and df1."next.date"')
ID enter.date exit.date next.date ID..5 eval.date time.diff
1 1 2015-07-10 2015-10-15 2017-08-25 1 2015-10-30 -15
2 2 2016-09-01 2016-09-30 2018-01-05 2 2016-10-10 -10
3 2 2018-01-05 2019-06-01 2100-12-31 2 2019-09-10 -101
4 3 2016-05-01 2017-05-01 2100-12-31 3 2018-05-15 -379
I must be missing something simple.
I have a data.frame of various date formats and I'm using lubridate which works great with everything except month names by themselves. I can't get the month names to convert to date time objects.
> head(dates)
From To
1 June August
2 January December
3 05/01/2013 10/30/2013
4 July November
5 06/17/2013 10/14/2013
6 05/04/2013 11/23/2013
Trying to change June into date time object:
> as_date(dates[1,1])
Error in charToDate(x) :
character string is not in a standard unambiguous format
> as_date("June")
Error in charToDate(x) :
character string is not in a standard unambiguous format
The actual year and day do not matter. I only need the month. zx8754 suggested using dummy day and year.
lubridate can handle converting the name or abbreviation of a month to its number when it's paired with the rest of the information needed to make a proper date, i.e. a day and year. For example:
lubridate::mdy("August/01/2013", "08/01/2013", "Aug/01/2013")
#> [1] "2013-08-01" "2013-08-01" "2013-08-01"
You can utilize that to write a function that appends "/01/2013" to any month names (I threw in abbreviations as well to be safe). Then apply that to all your date columns (dplyr::mutate_all is just one way to do that).
name_to_date <- function(x) {
lubridate::mdy(ifelse(x %in% c(month.name, month.abb), paste0(x, "/01/2013"), x))
}
dplyr::mutate_all(dates, name_to_date)
#> From To
#> 1 2013-06-01 2013-08-01
#> 2 2013-01-01 2013-12-01
#> 3 2013-05-01 2013-10-30
#> 4 2013-07-01 2013-11-01
#> 5 2013-06-17 2013-10-14
#> 6 2013-05-04 2013-11-23
The following is a crude example of how you could achieve that.
Given that dummy values are fine:
match(dates[1, 1], month.abb)
The above would return you, given that we had Dec in dates[1. 1]:
12
To generate the returned value above along with dummy number in a date format, I tried:
tmp = paste(match(dates[1, 1], month.abb), "2013", sep="/")
which gives us:
12/2013
and then lastly:
result = paste("01", tmp, sep="/")
which returns:
01/12/2013
I am sure there are more flexible approaches than this; but this is just an idea, which I just tried.
Using a custom function:
# dummy data
df1 <- read.table(text = "
From To
1 June August
2 January December
3 05/01/2013 10/30/2013
4 July November
5 06/17/2013 10/14/2013
6 05/04/2013 11/23/2013", header = TRUE, as.is = TRUE)
# custom function
myFun <- function(x, dummyDay = "01", dummyYear = "2013"){
require(lubridate)
x <- ifelse(substr(x, 1, 3) %in% month.abb,
paste(match(substr(x, 1, 3), month.abb),
dummyDay,
dummyYear, sep = "/"), x)
#return date
mdy(x)
}
res <- data.frame(lapply(df1, myFun))
res
# From To
# 1 2013-06-01 2013-08-01
# 2 2013-01-01 2013-12-01
# 3 2013-05-01 2013-10-30
# 4 2013-07-01 2013-11-01
# 5 2013-06-17 2013-10-14
# 6 2013-05-04 2013-11-23
I have a csv file that contains indexes for various asset classes and most of them start on different dates. I would like to create new indexes for these asset classes that have the same base year. Below is a subset of the data I have.
indexes <- read.csv("AssetClassIndexes.csv")
indexes$Date <- as.Date(indexes$Date, '%m/%d/%Y')
indexes %>%
filter(Date > as.Date('2013-01-01')) %>%
select(Date, Large.Cap.Stocks, Mid.Cap.Stocks, Precious.Metals)
Date Large.Cap.Stocks Mid.Cap.Stocks Precious.Metals
1 2013-01-31 130.9160 58.13547 651.1803
2 2013-02-28 132.6932 58.70621 658.3433
3 2013-03-31 137.6696 61.51427 690.4047
4 2013-04-30 140.3220 61.90042 684.9505
5 2013-05-31 143.6044 63.29899 720.4309
6 2013-06-30 141.6760 62.13056 723.7449
7 2013-07-31 148.8850 65.97987 777.3744
8 2013-08-31 144.5731 63.50743 750.3217
9 2013-09-30 149.1068 66.81690 803.2194
10 2013-10-31 155.9609 69.29937 831.1715
11 2013-11-30 160.7137 70.21606 877.3015
12 2013-12-31 164.7823 72.38485 893.8825
13 2014-01-31 159.0851 70.84785 854.2835
14 2014-02-28 166.3623 74.30846 890.2488
15 2014-03-31 167.7607 74.58250 898.8842
16 2014-04-30 169.0008 73.41721 868.2323
17 2014-05-31 172.9679 74.72066 869.1005
18 2014-06-30 176.5410 77.81163 906.8195
19 2014-07-31 174.1063 74.48576 853.8612
20 2014-08-31 181.0715 78.27180 892.6265
21 2014-09-30 178.5322 74.71220 841.8361
What I would like to do is create multiple base indexes based on various dates.
BaseDates <-
c(
'1973-12-31',
'1981-06-30',
'1984-03-31',
'2001-03-31',
'2007-12-31'
)
I have the following line of code that allows me to create an index based on one date, but I can't figure out how to do all the base dates above. I'm guessing it involves some sort of apply function; any suggestions?
indexes %>%
mutate_each(funs(BaseIdx(.,Date,as.Date('1984-06-30'))),-Date)
BaseIdx <- function(x, column, dte) {x / x[column == dte]}
There are multiple approaches you can take. Your suggested approach moves across each column (mutate_each) dividing values whose date matches a single date. You can iterate this over all your dates with _apply or another command.
An alternate approach below uses lapply to iterate across dates, dividing rows by a vector. The tricky part is the division of a dataframe by rows. Here, the dataframe is transposed (t) and divided by a vector (as.numeric), then retransposed back to the original format (additional methods here).
#indexes = the subsetted [21 x 4] data in your example
#Sample some dates based on your example data
BaseDates <- indexes[seq(1, 21, by=5), "Date"]
IndexThemALL <- lapply(BaseDates, function(z) { #z = each BaseDate
data.frame(
IndexDate = z,
Date = indexes$Date,
t(t(indexes[, cols])/as.numeric(indexes[indexes$Date == z, cols]))
)
})
# Optional: collapse a list into a dataframe
IndexThemALL <- dplyr::rbind_all(IndexThemALL)
#Source: local data frame [105 x 5]
#IndexDate Date Large.Cap.Stocks Mid.Cap.Stocks Precious.Metals
#1 2013-01-31 2013-01-31 1.000000 1.000000 1.000000
#2 2013-01-31 2013-02-28 1.013575 1.009817 1.011000
#3 2013-01-31 2013-03-31 1.051587 1.058119 1.060236
#4 2013-01-31 2013-04-30 1.071848 1.064762 1.051860
I have created the following 2 dummy datasets as follows:
id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))
id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))
Note that only some of the id in data are included in ns and viceversa.
For each of the values in data$id I am trying to find if there is a ns$date that is 14 days before the data$date where data$id==ns$id and report the number of days difference.
The output I need is a vector/column ("received") of the same number of rows of data, with a TRUE/FALSE whre ns$date[ns$id==data$id] is less than 14 days before the respective data$date and a similar vector with the actual number of days where "received" is TRUE. I hope this makes sense now.
This is where I got so far
# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)
# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and tmp$date as date.y
# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)
This link Find if date is within 14 days if id matches between datasets in R provides an idea but the result does not include the number of days difference in tmp$dif.
In the result table I need only the lowest difference for each data$id for those cases were tmp$received was TRUE.
Hope this makes more sense now? If not please let me know what needs further clarification.
M
PS: as requested I added what the desired output should look like (same number of rows of data = 10 - no rows for data in ns not in data). Should have thought this might help earlier.
id date received dif
1 8 2011-03-03 TRUE 6
2 8 2011-12-12 FALSE NA
3 50 2010-08-18 TRUE 1
4 87 2009-04-28 TRUE 0
5 141 2010-11-29 NA NA
6 161 2012-04-02 NA NA
7 192 2013-01-08 NA NA
8 216 2007-01-22 NA NA
9 257 2009-06-03 NA NA
10 282 2009-12-02 TRUE 4
Here's a data.table approach
Converting to data.table objects
library(data.table)
setkey(setDT(data), id)
setkey(setDT(ns), id)
Merging
ns <- ns[data]
Converting to Date class
ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]
Computing days differences and TRUE/FALSE
ns[, `:=`(timediff = date.1 - date,
Logical = (date.1 - date) < 14)]
Taking only the rows we are interested in
res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]
Sorting by id and date
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)
Subsetting by minimum dstance
res[, list(diff = min(dif)), by = list(id, date.1, received)]
# id date.1 received diff
# 1: 8 2011-03-03 TRUE 6 days
# 2: 8 2011-12-12 FALSE NA days
# 3: 50 2010-08-18 TRUE 1 days
# 4: 87 2009-04-28 TRUE 0 days
# 5: 141 2010-11-29 NA NA days
# 6: 161 2012-04-02 NA NA days
# 7: 192 2013-01-08 NA NA days
# 8: 216 2007-01-22 NA NA days
# 9: 257 2009-06-03 NA NA days
# 10: 282 2009-12-02 TRUE 4 days
I'm trying to figure out the fastest way to aggregate a large data frame (about 50M rows) that looks similar to:
>sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
+ "date" = sample(seq(as.Date("2014-01-01"),as.Date("2014-02-13"),by=1),6),
+ "value" = runif(6))
> sample_frame
id date value
1 73 2014-02-11 0.84197491
2 7 2014-01-14 0.08057893
3 73 2014-01-16 0.78521616
4 7 2014-01-24 0.61889286
5 73 2014-02-06 0.54792356
6 7 2014-01-06 0.66484848
Here we have 2 unique IDs with 3 dates and a value assigned to each. I know that I can use ddply, or data.table, or just a lapply to aggregate and find the mean for each ID.
What I'm really looking for is a way to quickly find the mean for each ID for the most recent two dates. For example, with sapply:
> sapply(split(sample_frame,sample_frame$id),function(x){
+ mean(x$value[x$date%in%x$date[order(x$date,decreasing=T)][1:2]])
+ })
7 73
0.3497359 0.6949492
I can't figure out how to get data.table to do this. Thoughts? Hints?
Why not use tail in your "data.table" aggregation step?
set.seed(1)
sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
"date" = sample(seq(as.Date("2014-01-01"),
as.Date("2014-02-13"),by=1),6),
"value" = runif(6))
DT <- data.table(sample_frame, key = "id,date")
DT
# id date value
# 1: 27 2014-01-09 0.20597457
# 2: 27 2014-01-26 0.62911404
# 3: 27 2014-02-07 0.68702285
# 4: 37 2014-02-06 0.17655675
# 5: 37 2014-02-09 0.06178627
# 6: 37 2014-02-13 0.38410372
DT[, mean(tail(value, 2)), by = id]
# id V1
# 1: 27 0.6580684
# 2: 37 0.2229450
Since you require the mean of just two values, you can do it directly (without using mean). And you can use the internal variable .N instead of tail to get more speed-up. You just have to take care of the case where there's just 1 date. Basically, this should be much faster.
DT[, (value[.N]+value[max(1L, .N-1)])/2, by=id]