I'm having difficulty calculating the average time between the payment dates for my csv. I have tried multiple methods that I have seen online (changing to data.table, using ddply) with no success
WorkerID PaymentDate
1 2015-07-18
1 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18
This is an example of my dataset- I wanted to calculate the average time between the PaymentDates (in number of days), in the simplest way possible. I would like to group by the workerID.
Thank you!
This is a perfect job for aggregate(). It groups PaymentDate by WorkerID and applies the function mean(diff(.)) to each group.
tt <- read.table(text="
WorkerID PaymentDate
1 2015-06-18
1 2015-07-18
1 2015-08-18
2 2015-09-18
3 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18
4 2015-12-16", header=TRUE)
tt$PaymentDate <- as.Date(tt$PaymentDate)
aggregate(PaymentDate ~ WorkerID, data=tt, FUN=function(x) mean(diff(x)))
# WorkerID PaymentDate
# 1 1 30.5
# 2 2 NaN
# 3 3 31.0
# 4 4 29.5
An alternative to AkselA's answer, one can use the data.table package if one prefers this over base R.
This is similar to using aggregate, but may sometimes give a speed boost. In my example below I've handled single times by setting the difference to 0, to illustrate how this can be achieved.
library(lubridate)
library(data.table)
df <- fread("WorkerID PaymentDate
1 2015-07-18
1 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18")
df[,PaymentDate := as.Date(PaymentDate)]
df[,{
if(length(PaymentDate) > 1){
mean(diff(as.numeric(PaymentDate)))
}else
0
}, by = WorkerID]
Related
My questions concerns lagging data in r where r should be aware of the time index. I hope the question has not been asked in any further thread. Lets consider a simple setup:
df <- data.frame(date=as.Date(c("1990-01-01","1990-02-01","1990-01-15","1990-03-01","1990-05-01","1990-07-01","1993-01-02")), value=1:7)
This code should generate a table like
date
value
1990-01-01
1
1990-02-01
2
1990-01-15
3
1990-03-01
4
1990-05-01
5
1990-07-01
6
And my aim is now to try to lag the "value" by e.g. one month such that e.g when I try to compute the lagged value of "1990-05-01" (which would be 1990-04-01 but is not present in the data) should then generate an NA in the specific row. When I use the standard lag function r is not aware of the time index and simply uses the value "4" of 1990-03-01 which is not what I want. Has anyone an idea what I could do here?
Thanks in advance! :)
All the best,
Leon
You can try %m-% for lagged month like below
library(lubridate)
transform(
df,
value_lag = value[match(date %m-% months(1), date)]
)
which gives
date value value_lag
1 1990-01-01 1 NA
2 1990-02-01 2 1
3 1990-01-15 3 NA
4 1990-03-01 4 2
5 1990-05-01 5 NA
6 1990-07-01 6 NA
7 1993-01-02 7 NA
For an example with multiple columns lets consider:
df <- data.frame(date=as.Date(c("1990-01-01","1990-02-01","1990-01-15","1990-03-01","1990-05-01","1990-07-01","1993-01-02")), value=1:7, value2=7:13)
I recently found myself the following solution:
df %>%
as_tibble() %>%
mutate(across(2:ncol(df), .fns= function(x){x[match(date %m-% months(lags),date)]}, .names="{.col}_lag"))
Thanks to your code #ThomasisCoding. :)
I am trying to merge two dataframes based on a conditional relationship between several dates associated with unique identifiers but distributed across different observations (rows).
I have two large datasets with unique identifiers. One dataset has 'enter' and 'exit' dates (alongside some other variables).
> df1 <- data.frame(ID=c(1,1,1,2,2,3,4),
enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'),
+ exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
> dcis <- grep('date$',names(df1));
> df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
> df1;
ID enter.date exit.date
1 1 2015-05-07 2015-07-01
2 1 2015-07-10 2015-10-15
3 1 2017-08-25 2017-09-03
4 2 2016-09-01 2016-09-30
5 2 2018-01-05 2019-06-01
6 3 2016-05-01 2017-05-01
7 4 2017-04-08 2017-06-08
and the other has "eval" dates.
> df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
> df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
> df2;
ID eval.date
1 1 2015-10-30
2 2 2016-10-10
3 2 2019-09-10
4 3 2018-05-15
5 4 2015-01-19
I am trying to calculate the average interval of time from 'exit' to 'eval' for each individual in the dataset. However, I only want those 'evals' that come after a given individual's 'exit' and before the next 'enter' for that individual (there are no 'eval' observations between enter and exit for a given individual), if such an 'eval' exists.
In other words, I'm trying to get an output that looks like this from the two dataframes above.
> df3 <- data.frame(ID=c(1,2,2,3), enter.date=c('7/10/2015','9/1/2016','1/05/2018','5/01/2016'),
+ exit.date = c('10/15/2015', '9/30/2016', '6/01/2019', '5/01/2017'),
+ assess.date=c('10/30/2015', '10/10/2016', '9/10/2019', '5/15/2018'));
> dcis <- grep('date$',names(df3));
> df3[dcis] <- lapply(df3[dcis],as.Date,'%m/%d/%Y');
> df3$time.diff<-difftime(df3$exit.date, df3$assess.date)
> df3;
ID enter.date exit.date assess.date time.diff
1 1 2015-07-10 2015-10-15 2015-10-30 -15 days
2 2 2016-09-01 2016-09-30 2016-10-10 -10 days
3 2 2018-01-05 2019-06-01 2019-09-10 -101 days
4 3 2016-05-01 2017-05-01 2018-05-15 -379 days
Once I perform the merge finding the averages is easy enough with
> aggregate(df3[,5], list(df3$ID), mean)
Group.1 x
1 1 -15.0
2 2 -55.5
3 3 -379.0
but I'm really at a loss as to how to perform the merge. I've tried to use leftjoin and fuzzyjoin to perform the merge per the advice given here and here, but I'm inexperienced at R and couldn't figure it out. I would really appreciate if someone could walk me through it - thanks!
A few other descriptive notes about the data: each ID may have some number of rows associated with it in each dataframe. df1 has enter dates which mark the beginning of a service delivery and exit dates that mark the end of a service delivery. All enters have one corresponding exit. df2 has eval dates. Eval dates can occur at any time when an individual is not receiving the service. There may be many evals between one period of service delivery and the next, or there may be no evals.
Just discovered the sqldf package. Assuming that for each ID the date ranges are in ascending order, you might use it like this:
df1 <- data.frame(ID=c(1,1,1,2,2,3,4), enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'), exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
dcis <- grep('date$',names(df1));
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
df1;
df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
df2;
library(sqldf)
df1 = unsplit(lapply(split(df1, df1$ID, drop=FALSE), function(df) {
df$next.date = as.Date('2100-12-31')
if (nrow(df) > 1)
df$next.date[1:(nrow(df) - 1)] = df$enter.date[2:nrow(df)]
df
}), df1$ID)
sqldf('
select df1.*, df2.*, df1."exit.date" - df2."eval.date" as "time.diff"
from df1, df2
where df1.ID == df2.ID
and df2."eval.date" between df1."exit.date"
and df1."next.date"')
ID enter.date exit.date next.date ID..5 eval.date time.diff
1 1 2015-07-10 2015-10-15 2017-08-25 1 2015-10-30 -15
2 2 2016-09-01 2016-09-30 2018-01-05 2 2016-10-10 -10
3 2 2018-01-05 2019-06-01 2100-12-31 2 2019-09-10 -101
4 3 2016-05-01 2017-05-01 2100-12-31 3 2018-05-15 -379
I'm working on time-series analyses and I'm hoping to develop multiple datasets with different units of analysis. Namely: the units in data set 1 will be districts in country X for 2-week periods within a span of 4 years (districtYearPeriodCode), the units in data set 2 will be districts in country X for 4-week periods within a span of 4 years, and so forth.
I have created a number of data frames containing start and end dates for each interval, as well as an interval ID. The one below is for the 2-week intervals.
begin <- seq(ymd('2004-01-01'),ymd('2004-06-30'), by = as.difftime(weeks(2)))
end <- seq(ymd('2004-01-14'),ymd('2004-06-30'), by = as.difftime(weeks(2)))
interval <- seq(1,13,1)
df2 <- data.frame(begin, end, interval)
begin end interval
1 2004-01-01 2004-01-14 1
2 2004-01-15 2004-01-28 2
3 2004-01-29 2004-02-11 3
4 2004-02-12 2004-02-25 4
5 2004-02-26 2004-03-10 5
6 2004-03-11 2004-03-24 6
7 2004-03-25 2004-04-07 7
8 2004-04-08 2004-04-21 8
9 2004-04-22 2004-05-05 9
10 2004-05-06 2004-05-19 10
11 2004-05-20 2004-06-02 11
12 2004-06-03 2004-06-16 12
13 2004-06-17 2004-06-30 13
In addition to this I have a data frame that contains observations for events, dates included. It looks something like this:
new.df3 <- data.frame(dates5, districts5)
new.df3
dates5 districts5
1 2004-01-01 d1
2 2004-01-02 d2
3 2004-01-03 d3
4 2004-01-04 d4
5 2004-01-05 d5
Is there a function I can write or a command I can use to end up with something like this?
dates5 districts5 interval5
1 2004-01-01 d1 1
2 2004-01-02 d2 1
3 2004-01-03 d3 1
4 2004-01-04 d4 1
5 2004-01-05 d5 1
I have been trying to find an answer in the lubridate package, or in other threads but all answers seem to be tailored at finding out whether a date falls within a specific time interval instead of identifying the interval a date falls into from a group of intervals.
Much appreiciated!
I used the purrr approached outlined by #alistair in here. I reproduce it below:
elements %>%
map(~intervals$phase[.x >= intervals$start & .x <= intervals$end]) %>%
# Clean up a bit. Shorter, but less readable: map_chr(~.x[1] %||% NA)
map_chr(~ifelse(length(.x) == 0, NA, .x))
## [1] "a" "a" "a" NA "b" "b" "c"
I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)
I have a questions that might be too basic, but here it is...
I want to extract monthly data from a dataset like this:
Date Obs
1 2001-01-01 120
2 2001-01-02 100
3 2001-01-03 150
4 2001-01-04 175
5 2001-01-05 121
6 2001-01-06 100
I just want to get the rows from the data where I have a certain month(e.g. January), this works perfectly:
output=which(strftime(dataset[,1],"%m")=="01",dataset[,1])
However when I try to create a loop to go through all the months using a variable that is declared has character it doesn't work and I only get "FALSE".
value=as.character(k)
output=which(strftime(dataset[,1],"%m")==value,dataset[,1])
Do not parse dates as strings. That is too error prone. Parse dates as dates, and do logical comparisons on them.
Here is one approach, creating January to March data and sub-setting February based on a comparison:
R> output <- data.frame(date=seq(as.Date("2011-01-01"), by=7, length=10),
+ value=cumsum(runif(10)*100))
R> output
date value
1 2011-01-01 8.29916
2 2011-01-08 44.82950
3 2011-01-15 72.08662
4 2011-01-22 134.19277
5 2011-01-29 221.67744
6 2011-02-05 245.77195
7 2011-02-12 314.82081
8 2011-02-19 396.34661
9 2011-02-26 437.14286
10 2011-03-05 442.41321
R> output[ output[,"date"] >= as.Date("2011-02-01") &
+ output[,"date"] <= as.Date("2011-02-28"), ]
date value
6 2011-02-05 245.772
7 2011-02-12 314.821
8 2011-02-19 396.347
9 2011-02-26 437.143
R>
Another approach uses the xts package:
R> oo <- xts(output[,"value"], order.by=output[,"date"])
R> oo
[,1]
2011-01-01 8.29916
2011-01-08 44.82950
2011-01-15 72.08662
2011-01-22 134.19277
2011-01-29 221.67744
2011-02-05 245.77195
2011-02-12 314.82081
2011-02-19 396.34661
2011-02-26 437.14286
2011-03-05 442.41321
R> oo["2011-02-01::2011-02-28"]
[,1]
2011-02-05 245.772
2011-02-12 314.821
2011-02-19 396.347
2011-02-26 437.143
R>
as xts has convenient date parsing for the index; see the package documentation for details.
I'm assuming k is an integer in 1:12. I suspect you may be better off using abbreviated month names:
value <- month.abb[k]
output <- which(strftime(dataset[,1],"%b")==value,dataset[,1])
The reason you way isn't working is because the month number is zero-padded and "1" != "01".
You can also use dates as dates with POSIXlt()$mon
as.POSIXlt(output$date)$mon # Note that Jan = 0 and Feb=1
[1] 0 0 0 0 0 1 1 1 1 2
There are several other packages such as chron, lubridate and gdata that provide date handling functions. I found the functions in lubridate particularly intuitive and less prone to errors in my clumsy hands.