Combine data sets based on date comparisons within groups - r

I have two data sets, 'Df_A' and 'Df_B':
Df_A
Date Info A Info B
9/19/18 23:00 36 48
9/18/18 23:00 47 30
9/17/18 23:00 51 3
8/14/18 23:00 45 16
8/6/18 23:00 37 13
8/5/18 23:00 42 66
7/11/18 23:00 42 53
7/4/18 23:00 38 10
Df_B
Released Info Event Value X
9/6/2018 22:30 Event A 51.8
8/6/2018 22:30 Event A 52
7/5/2018 22:30 Event A 50.6
6/6/2018 22:30 Event A 54
9/2/2018 22:30 Event C 48
7/31/2018 22:30 Event C 45
9/4/2018 22:30 Event D 58.7
8/2/2018 22:30 Event D 56.2
7/3/2018 22:30 Event D 57.3
6/4/2018 22:30 Event D 51.1
5/2/2018 22:30 Event D 54.2
4/4/2018 22:30 Event D 59.8
9/3/2018 1:30 Event E 61.8
8/6/2018 1:30 Event E 63
7/2/2018 1:30 Event E 65.2
Both 'Date' and 'Released.info' are factors.
I have a vector 'Events' which contains the Events in 'Df_B' that I need to parse, e.g.
Events <- c("Event A", "Event D")
For each 'Event' in 'Df_B', I would like to check if 'Date' in 'Df_A' is greater than 'Released Info' in 'Df_B'. If so, I want to add the corresponding value of 'Event A' and 'Event B' to 'Df_A'.
The desired output:
Date Info A Info B Event A Event D
9/19/18 23:00 36 48 51.8 58.7
9/18/18 23:00 47 30 51.8 58.7
9/17/18 23:00 51 3 51.8 58.7
8/14/18 23:00 45 16 52 56.2
8/6/18 23:00 37 13 52 56.2
8/5/18 23:00 42 66 50.6 56.2
7/11/18 23:00 42 53 50.6 57.3
7/4/18 23:00 38 10 54 57.3
For example, for 9/19/18 23:00, 9/18/18 23:00 and 9/17/18 23:00 in 'Df_A', the closest prior date in 'Df_B' for the group 'Event A' is 9/6/2018 22:30. Thus, for these rows we pick the value 51.8 from 'Df_B'. And so on for all dates in Df_A, and for both 'Event A' and 'Event B' in 'Df_B'.
I would like to add new n columns to 'Df_A', in this example 'Event A' and 'Event D', but it could be more.
For this, I have been trying creating some dynamic variables for the dynamic amount of events with something like this (as the Events come from a csv as a matrix) :
#To Create a variable for each Event
ListEvents <- as.list(as.vector(Events))
names(ListEvents) <- paste("Variable", 1:length(ListEvents), sep = "")
list2env(ListEvents,envir = .GlobalEnv)
After creating a variable for each Event, I was thinking in creating a loop so I can create a subset for each event and then compare the Date (Df_A) with the release Date(Df_B) and add it as a column in Df_A. But I know this is an unnecessary complex and inefficient approach. Could someone help me?

The following reproduces your expected output:
events <- c("Event A", "Event D")
library(tidyverse)
library(lubridate)
map(events, ~Df_A %>%
mutate(Event := .x) %>%
left_join(Df_B) %>%
mutate(
Date = mdy_hm(Date),
Released.Info = mdy_hm(Released.Info)) %>%
group_by(Date) %>%
mutate(diff = difftime(Released.Info, Date, units = "days")) %>%
filter(diff < 0) %>%
filter(diff == max(diff)) %>%
select(-Released.Info, -diff) %>%
spread(Event, Value.X)) %>%
reduce(left_join) %>%
arrange(desc(Date))
## A tibble: 8 x 5
## Groups: Date [8]
# Date Info.A Info.B `Event A` `Event D`
# <dttm> <int> <int> <dbl> <dbl>
#1 2018-09-19 23:00:00 36 48 51.8 58.7
#2 2018-09-18 23:00:00 47 30 51.8 58.7
#3 2018-09-17 23:00:00 51 3 51.8 58.7
#4 2018-08-14 23:00:00 45 16 52 56.2
#5 2018-08-06 23:00:00 37 13 52 56.2
#6 2018-08-05 23:00:00 42 66 50.6 56.2
#7 2018-07-11 23:00:00 42 53 50.6 57.3
#8 2018-07-04 23:00:00 38 10 54 57.3
The idea is to do add an Events column to Df_A with entries given in a vector events; we then do a left join of Df_A and Df_B, and select only those rows with the shortest negative time difference between Released.Info and Date (that's the filter(diff < 0) and filter(diff == max(diff)) part). The rest is reshaping and re-arranging to reproduce your expected output.
Sample data
Df_A <-read.table(text =
" Date 'Info A' 'Info B'
'9/19/18 23:00' 36 48
'9/18/18 23:00' 47 30
'9/17/18 23:00' 51 3
'8/14/18 23:00' 45 16
'8/6/18 23:00' 37 13
'8/5/18 23:00' 42 66
'7/11/18 23:00' 42 53
'7/4/18 23:00' 38 10", header = T)
Df_B <- read.table(text =
"'Released Info' Event 'Value X'
'9/6/2018 22:30' 'Event A' 51.8
'8/6/2018 22:30' 'Event A' 52
'7/5/2018 22:30' 'Event A' 50.6
'6/6/2018 22:30' 'Event A' 54
'9/2/2018 22:30' 'Event C' 48
'7/31/2018 22:30' 'Event C' 45
'9/4/2018 22:30' 'Event D' 58.7
'8/2/2018 22:30' 'Event D' 56.2
'7/3/2018 22:30' 'Event D' 57.3
'6/4/2018 22:30' 'Event D' 51.1
'5/2/2018 22:30' 'Event D' 54.2
'4/4/2018 22:30' 'Event D' 59.8
'9/3/2018 1:30' 'Event E' 61.8
'8/6/2018 1:30' 'Event E' 63
'7/2/2018 1:30' 'Event E' 65.2", header = T)

This can be done with a rolling join by group in data.table.
library(data.table)
# convert data to data.table
setDT(Df_A)
setDT(Df_B)
# convert times to POSIXct
Df_A[ , Date := as.POSIXct(Date, format = "%m/%d/%y %H:%M")]
Df_B[ , Released.Info := as.POSIXct(Released.Info, format = "%m/%d/%Y %H:%M")]
# select rows
db <- Df_B[Event %in% Events]
# rolling join: for each Event in db, join to Df_A by nearest preceeding time
d2 <- db[ , .SD[Df_A, on = c(Released.Info = "Date"), roll = Inf], by = Event]
# Event Released.Info Value.X Info.A Info.B
# 1: Event A 2018-09-19 23:00:00 51.8 36 48
# 2: Event A 2018-09-18 23:00:00 51.8 47 30
# [snip]
# 7: Event A 2018-07-11 23:00:00 50.6 42 53
# 8: Event A 2018-07-04 23:00:00 54.0 38 10
# 9: Event D 2018-09-19 23:00:00 58.7 36 48
# 10: Event D 2018-09-18 23:00:00 58.7 47 30
# [snip]
# 15: Event D 2018-07-11 23:00:00 57.3 42 53
# 16: Event D 2018-07-04 23:00:00 57.3 38 10
That's basically it. If desired, cast the 'Event' column to wide and join to 'Df_A':
dcast(d2[ , .(Event, Released.Info, Value.X)],
Released.Info ~ Event, value.var = "Value.X")[
Df_A, on = c(Released.Info = "Date")]
# Released.Info Event A Event D Info.A Info.B
# 1: 2018-09-19 23:00:00 51.8 58.7 36 48
# 2: 2018-09-18 23:00:00 51.8 58.7 47 30
# 3: 2018-09-17 23:00:00 51.8 58.7 51 3
# 4: 2018-08-14 23:00:00 52.0 56.2 45 16
# 5: 2018-08-06 23:00:00 52.0 56.2 37 13
# 6: 2018-08-05 23:00:00 50.6 56.2 42 66
# 7: 2018-07-11 23:00:00 50.6 57.3 42 53
# 8: 2018-07-04 23:00:00 54.0 57.3 38 10

Related

moving average on different size data frames in R

I have a set of data taken every 5 minutes consisting of the following structure:
>df1
Date X1
01/01/2017 0:00 1
01/01/2017 0:30 32
01/01/2017 1:00 65
01/01/2017 1:30 14
01/01/2017 2:00 25
01/01/2017 2:30 14
01/01/2017 3:00 85
01/01/2017 3:30 74
01/01/2017 4:00 74
01/01/2017 4:30 52
01/01/2017 5:00 25
01/01/2017 5:30 74
01/01/2017 6:00 45
01/01/2017 6:30 52
01/01/2017 7:00 21
01/01/2017 7:30 41
01/01/2017 8:00 74
01/01/2017 8:30 11
01/01/2017 9:00 2
01/01/2017 9:30 52
Another vector is given consisting of only dates, but with a different time frequency:
>V1
Date2
1/1/2017 1:30:00
1/1/2017 3:30:00
1/1/2017 5:30:00
1/1/2017 9:30:00
I would like to calculate the moving average of X1 but at the end the only values I really need are the ones associated with the dates in V1 (but preserving the smoothing generated by the moving average)
Would you recommend to calculate the moving average of X1, then associate the value to the corresponding date in V1 and re-apply a moving average? or do you know a function in R that would help me achieve this?
Thank you, I really appreciate your help!
SofĂ­a
filter is a convenient way to construct moving averages
Assuming you want a simple arithmetic moving average, you'll need to decide how many elements you'd like to average together, and if you'd like a one or two-sided average. Arbitrarily, I'll pick 5 and one-sided.
elements <- 5
df1$x1.smooth <- filter(df1$X1, filter = rep(1/elements, elements), sides=1)
Note that "moving.average" will have elements-1 fewer elements than df1$X1 due to the moving average being undefined until there are elements items to average.
df1 is now
Date X1 x1.smooth
1 01/01/2017 0:00 1 NA
2 01/01/2017 0:30 32 NA
3 01/01/2017 1:00 65 NA
4 01/01/2017 1:30 14 NA
5 01/01/2017 2:00 25 27.4
6 01/01/2017 2:30 14 30.0
7 01/01/2017 3:00 85 40.6
8 01/01/2017 3:30 74 42.4
9 01/01/2017 4:00 74 54.4
10 01/01/2017 4:30 52 59.8
11 01/01/2017 5:00 25 62.0
12 01/01/2017 5:30 74 59.8
13 01/01/2017 6:00 45 54.0
14 01/01/2017 6:30 52 49.6
15 01/01/2017 7:00 21 43.4
16 01/01/2017 7:30 41 46.6
17 01/01/2017 8:00 74 46.6
18 01/01/2017 8:30 11 39.8
19 01/01/2017 9:00 2 29.8
20 01/01/2017 9:30 52 36.0
Now you need only merge the two data frames on Date = Date2 or else subset df1 where Date is %in% V1$Date2
Another option could be to use zoo package. One can use rollapply to calculate and add another column in dataframe that will hold moving average for X1.
A implementation with moving average of width 4 (every 2 hours) can be implemented as:
Library(zoo)
#Add another column with mean value
df$mean <- rollapply(df$X1, 4, mean, align = "right", fill = NA)
df
# Date X1 mean
# 1 2017-01-01 00:00:00 1 NA
# 2 2017-01-01 00:30:00 32 NA
# 3 2017-01-01 01:00:00 65 NA
# 4 2017-01-01 01:30:00 14 28.00
# 5 2017-01-01 02:00:00 25 34.00
# 6 2017-01-01 02:30:00 14 29.50
# 7 2017-01-01 03:00:00 85 34.50
# 8 2017-01-01 03:30:00 74 49.50
# 9 2017-01-01 04:00:00 74 61.75
# 10 2017-01-01 04:30:00 52 71.25
# 11 2017-01-01 05:00:00 25 56.25
# 12 2017-01-01 05:30:00 74 56.25
# 13 2017-01-01 06:00:00 45 49.00
# 14 2017-01-01 06:30:00 52 49.00
# 15 2017-01-01 07:00:00 21 48.00
# 16 2017-01-01 07:30:00 41 39.75
# 17 2017-01-01 08:00:00 74 47.00
# 18 2017-01-01 08:30:00 11 36.75
# 19 2017-01-01 09:00:00 2 32.00
# 20 2017-01-01 09:30:00 52 34.75

Finding Fitted Forecast with ARIMA / Exponential Smoothing in R

I've written some code to sort claim dates, count them per month/year and I am attempting to forecast them with either ARIMA / exponential smoothing parameters.
See claims list:
2012-01-31 82
2012-02-29 65
2012-03-31 64
2012-04-30 73
2012-05-31 71
2012-06-30 79
2012-07-31 72
2012-08-31 82
2012-09-29 64
2012-10-31 72
2012-11-30 63
2012-12-31 80
2013-01-31 67
2013-02-27 65
2013-03-31 84
2013-04-30 68
2013-05-31 68
2013-06-29 66
2013-07-30 64
2013-08-31 69
2013-09-29 66
2013-10-31 65
2013-11-30 56
2013-12-31 76
2014-01-31 75
2014-02-28 58
2014-03-29 80
2014-04-30 76
2014-05-31 80
2014-06-28 68
2014-07-31 82
2014-08-30 79
2014-09-27 60
2014-10-31 85
2014-11-30 60
2014-12-31 76
2015-01-31 75
2015-02-28 84
2015-03-31 77
2015-04-30 79
2015-05-30 91
2015-06-30 82
2015-07-31 98
2015-08-31 65
2015-09-30 77
2015-10-31 115
2015-11-30 79
2015-12-31 80
2016-01-30 91
2016-02-29 105
2016-03-31 77
2016-04-30 107
2016-05-31 85
2016-06-30 89
2016-07-30 112
2016-08-31 88
2016-09-30 90
2016-10-30 79
2016-11-30 85
2016-12-31 66
The issue I'm facing with my code is that I am getting a mean forecast rather than my desired fitted data similar to this example: https://stats.stackexchange.com/questions/115506/forecasting-a-seasonal-time-series-in-r
Please see the R code:
Sorting the claim dates and counting them
library(forecast)
library(ggplot2)
library(xts)
library(reshape2)
library(zoo)
library(lubridate)
data = read.csv('Claims1.csv')
data$DISABILITYDATE <- as.Date(data$DISABILITYDATE, "%m/%d/%Y")
data
str(data)
as.Date(data[,1])
xts(x=data[,-1], order.by = data[,1])
data = read.csv('Claims1.csv')
data$DISABILITYDATE <- as.Date (data$DISABILITYDATE, "%m/%d/%Y")
df <- xts(rep(1,length(data$DISABILITYDATE)),order.by=data$DISABILITYDATE)
df1 <- apply.monthly(df,function(x) length(x))
df1
t(df1)
str(df1)
df2 <- data.frame(df1=c("Jan 2012","Feb 2012","Mar 2012","Apr 2012","May 2012","Jun 2012","Jul 2012","Aug 2012","Sep 2012","Oct 2012","Nov 2012","Dec 2012","Jan 2013","Feb 2013","Mar 2013","Apr 2013","May 2013","Jun 2013","Jul 2013","Aug 2013","Sep 2013","Oct 2013","Nov 2013","Dec 2013","Jan 2014","Feb 2014","Mar 2014","Apr 2014","May 2014","Jun 2014","Jul 2014","Aug 2014","Sep 2014","Oct 2014","Nov 2014","Dec 2014","Jan 2015","Feb 2015","Mar 2015","Apr 2015","May 2015","Jun 2015","Jul 2015","Aug 2015","Sep 2015","Oct 2015","Nov 2015","Dec 2015","Jan 2016","Feb 2016","Mar 2016","Apr 2016","May 2016","Jun 2016","Jul 2016","Aug 2016","Sep 2016","Oct 2016","Nov 2016","Dec 2016"),score=c(df1))
df2
t(df2)
df2[-1]
2.1 Forecasting with ETS (Exponential Smoothing)
library(forecast)
x.ts <- as.ts(df2[2])
x.ts
x.ets <- ets(x.ts)
x.ets
x.fore <- forecast(x.ets$fitted, h=12)
x.fore
x <- ts(df2[2], start = 2012, frequency = 12)
plot(forecast(ets(x), 24))
x
plot(forecast(x, h=12))
date1 <- ymd("2012-01-01","2013-01-01","2014-01-01","2015-01-01","2016-01-01","2017-01-01")
abline(v=decimal_date(date1), col="blue")
2.2 Forecasting with ARIMA
ARIMAfit = auto.arima(x, approximation=FALSE,trace=FALSE)
summary(ARIMAfit)
plot(ARIMAfit)
pred = predict(ARIMAfit, n.ahead = 48)
round(as.numeric(pred$fitted,0))
pred
library(TSPred)
plotarimapred(pred$pred,x, xlim=c(2012, 2020), range.percent = 0.05)
My output is this:
example of desired output

Aggregate data by user defined time interval

I have a following dataframe:
df<-data.frame(timecol=as.POSIXct(c("2016-05-31 22:12:27 PDT","2016-05-31 22:25:03 PDT","2016-05-31 23:08:43 PDT","2016-05-31 23:24:10 PDT","2016-06-01 02:00:56 PDT","2016-06-01 03:00:56 PDT","2016-06-01 05:00:56 PDT","2016-06-01 22:12:27 PDT","2016-06-01 22:25:03 PDT","2016-06-01 23:08:43 PDT","2016-06-01 23:24:10 PDT","2016-06-02 02:00:56 PDT","2016-06-02 03:00:56 PDT","2016-06-02 05:00:56 PDT")),value=sample(1:100,14))
> df
timecol value
1 2016-05-31 22:12:27 100
2 2016-05-31 22:25:03 86
3 2016-05-31 23:08:43 39
4 2016-05-31 23:24:10 91
5 2016-06-01 02:00:56 32
6 2016-06-01 03:00:56 93
7 2016-06-01 05:00:56 53
8 2016-06-01 22:12:27 54
9 2016-06-01 22:25:03 76
10 2016-06-01 23:08:43 19
11 2016-06-01 23:24:10 56
12 2016-06-02 02:00:56 20
13 2016-06-02 03:00:56 3
14 2016-06-02 05:00:56 66
I need to aggregate the value column based of a predefined time interval - from 19pm this day to 7am the next day. I was thinking smth like this:
tm <- seq(as.POSIXct("2016-05-31 19:00:00 PDT"),as.POSIXct("2016-06-02 07:00:00 PDT"), by = "12 hours")
aggregate(df$value, list(day = cut(tm, "days")), sum)
but I can't figure out what's wrong.

How to calculate average time interval based on unique value?

I'm having trouble when trying to calculate the average time interval (how many days) between appearances of the same value in another column.
My data looks like this:
dt subject_id
2016-09-13 77
2016-11-07 1791
2016-09-18 1332
2016-08-31 84
2016-08-23 89
2016-08-23 41
2016-09-15 41
2016-10-12 93
2016-10-05 93
2016-11-09 94
2016-10-25 94
2016-11-03 94
2016-10-09 375
2016-10-14 11
2016-09-27 11
2016-09-13 11
2016-08-23 11
2016-08-27 11
And I want to get something like this:
subject_id mean_day
41 23
93 7
94 7.5
11 13
I tried to use:
aggregate(dt~subject_id, data, mean)
But it can't calculate mean from Date values. Any ideas?
My first approach would be something like this:
df$dt <- as.Date(df$dt)
library(dplyr)
df %>%
group_by(subject_id) %>%
summarise((max(dt) - min(dt))/(n()-1))
# <int> <time>
#1 11 13.0 days
#2 41 23.0 days
#3 77 NaN days
#4 84 NaN days
#5 89 NaN days
#6 93 7.0 days
#7 94 7.5 days
#8 375 NaN days
#9 1332 NaN days
#10 1791 NaN days
I think it's a starting point for you ... you can modify as you want.

R , how to Aggregate data with same date field in an R dataframe

Hi I have an R dataframe that looks like the following:
SURVEY.DATE A B C
1898 2010-05-13 38 34 21
1899 2010-05-13 38 33 21
1897 2010-05-14 37 34 21
1895 2010-05-21 38 29 21
1896 2010-05-21 39 32 21
1894 2010-05-23 39 32 21
I would like to average the rows with the same date so to have only one average observation per day. Ideally I would like to end up with an xts obsject that would look like :
SURVEY.DATE A B C
1898 2010-05-13 38 33.5 21
1897 2010-05-14 37 34 21
1896 2010-05-21 38.5 30.5 21
1894 2010-05-23 39 32 21
Seems to be a challenge for my newbie R skills...any help / pointers would be appreciated
You could try
library(dplyr)
res <- df1 %>%
group_by(SURVEY.DATE) %>%
summarise_each(funs(mean))
Or
res1 <- aggregate(.~SURVEY.DATE, df1, mean)
and then convert it to xts
library(xts)
xts(res1[-1], order.by= as.Date(res1[,1]))
# A B C
#2010-05-13 38.0 33.5 21
#2010-05-14 37.0 34.0 21
#2010-05-21 38.5 30.5 21
#2010-05-23 39.0 32.0 21
Here's how I'd do this using data.table.
require(data.table)
setDT(df)[, lapply(.SD, mean), by=SURVEY.DATE]
# SURVEY.DATE A B C
# 1: 2010-05-13 38.0 33.5 21
# 2: 2010-05-14 37.0 34.0 21
# 3: 2010-05-21 38.5 30.5 21
# 4: 2010-05-23 39.0 32.0 21
Check the new HTML vignettes if you'd like to learn more.

Resources