Similarly to a question posted here, I want to compute number of overlapping days between two periods conditional on a third variable (location).
For each observation of the main dataset (DF) I have a starting and an end date, and a location (character) variable. The Events data comprises information on event location, starting date and end date. Multiple events in the same location and (partially) overlapping periods are allowed.
Thus for each observation in DF the period must be compared to other periods in an event dataset (Events). This means that the count of overlapping days between the between one (DF) and multiple periods (Events) must be done net of overlapping days between two (or more) periods in the Events dataset
An example of the data structure of my two data sources can be easily reproduced in R using this code (note that the location variable has been set to an integer for simplicity):
set.seed(1)
DF <- data.frame(
start = sample(seq(as.Date('2018-01-01'), as.Date('2018-04-30'), by="day"), 20),
end = sample(seq(as.Date('2018-05-01'), as.Date('2018-10-30'), by="day"), 20),
location = sample(seq(1:5)),20)
Events <- data.frame(
start = sample(seq(as.Date('2018-01-01'), as.Date('2018-04-30'), by="day"), 30),
end = sample(seq(as.Date('2018-05-01'), as.Date('2018-10-30'), by="day"), 30),
location = sample(seq(1:5)), 30 )
In the simple case in which the Events data reduces to only one event (and we do not care about the location) counting overallping days for each obervation in DF can be done easily with the following code and dplyr: code taken from Matthew Lundberg answer here, also note that I have created another dataframe with a single event (One_event):
library(dplyr)
One_event <- data.frame(
start = as.Date('2018-01-01'),
end = as.Date('2018-07-30'))
DF %>%
mutate(overlap = pmax(pmin(One_event$end, end) - pmax(One_event$start, start) + 1,0))
resulting in:
start end location X20 overlap
1 2018-02-01 2018-10-19 5 20 180 days
2 2018-02-14 2018-06-08 3 20 115 days
3 2018-03-09 2018-08-26 4 20 144 days
4 2018-04-17 2018-05-23 2 20 37 days
5 2018-01-24 2018-06-17 1 20 145 days
6 2018-04-14 2018-07-08 5 20 86 days
7 2018-04-18 2018-05-03 3 20 16 days
8 2018-03-16 2018-07-07 4 20 114 days
9 2018-03-12 2018-09-30 2 20 141 days
10 2018-01-07 2018-06-29 1 20 174 days
11 2018-01-23 2018-07-23 5 20 182 days
12 2018-01-20 2018-08-12 3 20 192 days
13 2018-04-23 2018-07-24 4 20 93 days
14 2018-02-11 2018-06-01 2 20 111 days
15 2018-03-23 2018-09-17 1 20 130 days
16 2018-02-22 2018-08-21 5 20 159 days
17 2018-04-24 2018-09-10 3 20 98 days
18 2018-04-13 2018-05-18 4 20 36 days
19 2018-02-08 2018-08-28 2 20 173 days
20 2018-03-20 2018-10-23 1 20 133 days
Now back to the orginal problem.
To allow comparison between the period of each observation in Data and the matching event(s) according to observation's and event's location I think that would be reasonable to use the apply function, subset the Event dataset according to the observation location, and finally run the mutate function for each row and a subset of the Events data (temp):
apply(DF, 1, function(x) {
temp = Events[Events$location %in% x["location"]
x %>%
mutate(overlap = pmax(pmin(temp$end, end) - pmax(temp$start, start) +
1,0))
})
There are several issues with this last part of code. First, does not work and gives an error message:
(Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "character")
Second, it does not account for two (or more periods) overlapping in the Events dataset.
are you looking for this:
apply(DF, MARGIN = 1, function(x) {
Events[Events$location == x["location"],] %>% mutate(overlap = pmax(pmin(.data$end,
x["end"]) - pmax(.data$start, x["start"])))
})
This results in my case to:
[[1]]
start end location X30 overlap
1 2018-02-01 2018-07-28 5 30 177 days
2 2018-04-14 2018-08-27 5 30 135 days
3 2018-01-23 2018-09-20 5 30 231 days
4 2018-02-22 2018-09-10 5 30 200 days
5 2018-04-04 2018-07-17 5 30 104 days
6 2018-02-06 2018-05-16 5 30 99 days
[[2]]
start end location X30 overlap
1 2018-01-24 2018-09-26 3 30 114 days
2 2018-01-07 2018-07-11 3 30 114 days
3 2018-03-23 2018-10-28 3 30 77 days
4 2018-03-20 2018-08-22 3 30 80 days
5 2018-01-26 2018-05-12 3 30 87 days
6 2018-01-31 2018-07-02 3 30 114 days
[[3]]
start end location X30 overlap
1 2018-03-09 2018-07-29 4 30 142 days
2 2018-03-16 2018-05-19 4 30 64 days
3 2018-04-23 2018-09-11 4 30 125 days
4 2018-04-13 2018-07-19 4 30 97 days
5 2018-03-05 2018-07-10 4 30 123 days
6 2018-02-05 2018-07-20 4 30 133 days
...
I am using the new package released from Facebook called Prophet. It does time series predictions and I want to apply this function By Group.
Scroll down to R Section.
https://facebookincubator.github.io/prophet/docs/quick_start.html
This is my attempt:
grouped_output = df %>% group_by(group) %>%
do(m = prophet(df[,c(1,3)])) %>%
do(future = make_future_dataframe(m, period = 7)) %>%
do(forecast = prophet:::predict.prophet(m, future))
grouped_output[[1]]
I then need to extract the results from the list of each group which I am having trouble doing.
Below is my original dataframe without the groups:
ds <- as.Date(c('2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30'))
y <- c(15,17,18,19,20,54,67,23,12,34,12,78,34,12,3,45,67,89,12,111,123,112,14,566,345,123,567,56,87,90)
y<-as.numeric(y)
df <- data.frame(ds, y)
df
ds y
1 2016-11-01 15
2 2016-11-02 17
3 2016-11-03 18
4 2016-11-04 19
5 2016-11-05 20
6 2016-11-06 54
7 2016-11-07 67
8 2016-11-08 23
9 2016-11-09 12
10 2016-11-10 34
11 2016-11-11 12
12 2016-11-12 78
13 2016-11-13 34
14 2016-11-14 12
15 2016-11-15 3
16 2016-11-16 45
17 2016-11-17 67
18 2016-11-18 89
19 2016-11-19 12
20 2016-11-20 111
21 2016-11-21 123
22 2016-11-22 112
23 2016-11-23 14
24 2016-11-24 566
25 2016-11-25 345
26 2016-11-26 123
27 2016-11-27 567
28 2016-11-28 56
29 2016-11-29 87
30 2016-11-30 90
The current function works when I do it to a single group as follows:
#install.packages('prophet')
library(prophet)
m<-prophet(df)
future <- make_future_dataframe(m, period = 7)
forecast <- prophet:::predict.prophet(m, future)
forecast$yhat
[1] -2.649032 -29.762095 128.169781 59.573684 -11.623727 107.473617 -29.949730 -42.862455 -62.378408 104.797639 46.868610
[12] -12.502864 119.282058 -4.914921 -4.402638 -10.643570 169.309505 123.321261 74.734746 215.856347 99.290218 105.508059
[23] 102.882915 284.245984 237.401258 185.688202 321.466962 197.451536 194.280518 180.535663 349.304365 288.684031 222.337210
[34] 342.968499 203.648851 185.377165
I now want to change this so that it applies the prophet:::predict function to each group. So the NEW dataframe BY GROUP looks like this:
ds <- as.Date(c('2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30',
'2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30'))
y <- c(15,17,18,19,20,54,67,23,12,34,12,78,34,12,3,45,67,89,12,111,123,112,14,566,345,123,567,56,87,90,
45,23,12,10,21,34,12,45,12,44,87,45,32,67,1,57,87,99,33,234,456,123,89,333,411,232,455,55,90,21)
y<-as.numeric(y)
group<-c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B")
df <- data.frame(ds,group, y)
df
ds group y
1 2016-11-01 A 15
2 2016-11-02 A 17
3 2016-11-03 A 18
4 2016-11-04 A 19
5 2016-11-05 A 20
6 2016-11-06 A 54
7 2016-11-07 A 67
8 2016-11-08 A 23
9 2016-11-09 A 12
10 2016-11-10 A 34
11 2016-11-11 A 12
12 2016-11-12 A 78
13 2016-11-13 A 34
14 2016-11-14 A 12
15 2016-11-15 A 3
16 2016-11-16 A 45
17 2016-11-17 A 67
18 2016-11-18 A 89
19 2016-11-19 A 12
20 2016-11-20 A 111
21 2016-11-21 A 123
22 2016-11-22 A 112
23 2016-11-23 A 14
24 2016-11-24 A 566
25 2016-11-25 A 345
26 2016-11-26 A 123
27 2016-11-27 A 567
28 2016-11-28 A 56
29 2016-11-29 A 87
30 2016-11-30 A 90
31 2016-11-01 B 45
32 2016-11-02 B 23
33 2016-11-03 B 12
34 2016-11-04 B 10
35 2016-11-05 B 21
36 2016-11-06 B 34
37 2016-11-07 B 12
38 2016-11-08 B 45
39 2016-11-09 B 12
40 2016-11-10 B 44
41 2016-11-11 B 87
42 2016-11-12 B 45
43 2016-11-13 B 32
44 2016-11-14 B 67
45 2016-11-15 B 1
46 2016-11-16 B 57
47 2016-11-17 B 87
48 2016-11-18 B 99
49 2016-11-19 B 33
50 2016-11-20 B 234
51 2016-11-21 B 456
52 2016-11-22 B 123
53 2016-11-23 B 89
54 2016-11-24 B 333
55 2016-11-25 B 411
56 2016-11-26 B 232
57 2016-11-27 B 455
58 2016-11-28 B 55
59 2016-11-29 B 90
60 2016-11-30 B 21
How do I predict using the prophet package, the y-hat by group rather than in total?
Here is a solution using tidyr::nest to nest the data by group, fit the models in those groups using purrr::map and then retrieving the y-hat as requested.
I took your code, but incorporated it into mutate calls that would compute new colums using purrr::map.
library(prophet)
library(dplyr)
library(purrr)
library(tidyr)
d1 <- df %>%
nest(-group) %>%
mutate(m = map(data, prophet)) %>%
mutate(future = map(m, make_future_dataframe, period = 7)) %>%
mutate(forecast = map2(m, future, predict))
Here is the output at this point:
d1
# A tibble: 2 × 5
group data m future
<fctr> <list> <list> <list>
1 A <tibble [30 × 2]> <S3: list> <data.frame [36 × 1]>
2 B <tibble [30 × 2]> <S3: list> <data.frame [36 × 1]>
# ... with 1 more variables: forecast <list>
Then I use unnest() to retrieve the data from the forecast column and select the y-hat value as requested.
d <- d1 %>%
unnest(forecast) %>%
select(ds, group, yhat)
And here is the output for the newly forecasted values:
d %>% group_by(group) %>%
top_n(7, ds)
Source: local data frame [14 x 3]
Groups: group [2]
ds group yhat
<date> <fctr> <dbl>
1 2016-11-30 A 180.53422
2 2016-12-01 A 349.30277
3 2016-12-02 A 288.68215
4 2016-12-03 A 222.33501
5 2016-12-04 A 342.96654
6 2016-12-05 A 203.64625
7 2016-12-06 A 185.37395
8 2016-11-30 B 131.07827
9 2016-12-01 B 222.83703
10 2016-12-02 B 236.33555
11 2016-12-03 B 145.41001
12 2016-12-04 B 228.59687
13 2016-12-05 B 162.49244
14 2016-12-06 B 68.44477
I was looking for a solution for the same problem. I came up with the following code, which is a bit simpler than the accepted answer.
library(tidyr)
library(dplyr)
library(prophet)
data = df %>%
group_by(group) %>%
do(predict(prophet(.), make_future_dataframe(prophet(.), periods = 7))) %>%
select(ds, group, yhat)
And here are the predicted values
data %>% group_by(group) %>%
top_n(7, ds)
# A tibble: 14 x 3
# Groups: group [2]
ds group yhat
<date> <fctr> <dbl>
1 2016-12-01 A 316.9709
2 2016-12-02 A 258.2153
3 2016-12-03 A 196.6835
4 2016-12-04 A 346.2338
5 2016-12-05 A 208.9083
6 2016-12-06 A 216.5847
7 2016-12-07 A 206.3642
8 2016-12-01 B 230.0424
9 2016-12-02 B 268.5359
10 2016-12-03 B 190.2903
11 2016-12-04 B 312.9019
12 2016-12-05 B 266.5584
13 2016-12-06 B 189.3556
14 2016-12-07 B 168.9791
I have a data set that is long format and includes exact date/time measurements of 3 scores on a single test administered between 3 and 5 times per year.
ID Date Fl Er Cmp
1 9/24/2010 11:38 15 2 17
1 1/11/2011 11:53 39 11 25
1 1/15/2011 11:36 39 11 39
1 3/7/2011 11:28 95 58 2
2 10/4/2010 14:35 35 9 6
2 1/7/2011 13:11 32 7 8
2 3/7/2011 13:11 79 42 30
3 10/12/2011 13:22 17 3 18
3 1/19/2012 14:14 45 15 36
3 5/8/2012 11:55 29 6 11
3 6/8/2012 11:55 74 37 7
4 9/14/2012 9:15 62 28 18
4 1/24/2013 9:51 82 45 9
4 5/21/2013 14:04 135 87 17
5 9/12/2011 11:30 98 61 18
5 9/15/2011 13:23 55 22 9
5 11/15/2011 11:34 98 61 17
5 1/9/2012 11:32 55 22 17
5 4/20/2012 11:30 23 4 17
I need to transform this data to short format with time bands based on month (i.e. Fall=August-October; Winter=January-February; Spring=March-May). Some bands will include more than one observation per participant, and as such, will need a "spill over" band. An example transformation for the Fl scores below.
ID Fall1Fl Fall2Fl Winter1Fl Winter2Fl Spring1Fl Spring2Fl
1 15 NA 39 39 95 NA
2 35 NA 32 NA 79 NA
3 17 NA 45 NA 28 74
4 62 NA 82 NA 135 NA
5 98 55 55 NA 23 NA
Notice that dates which are "redundant" (i.e. more than 1 Aug-Oct observation) spill over into Fall2fl column. Dates that occur outside of the desired bands (i.e. November, December, June, July) should be deleted. The final data set should have additional columns that include Fl Er and Cmp.
Any help would be appreciated!
(Link to .csv file with long data http://mentor.coe.uh.edu/Data_Example_Long.csv )
This seems to do what you are looking for, but doesn't exactly match your desired output. I haven't looked at your sample data to see whether the problem lies with your sample desired output or the transformations I've done, but you should be able to follow along with the code to see how the transformations were made.
## Convert dates to actual date formats
mydf$Date <- strptime(gsub("/", "-", mydf$Date), format="%m-%d-%Y %H:%M")
## Factor the months so we can get the "seasons" that you want
Months <- factor(month(mydf$Date), levels=1:12)
levels(Months) <- list(Fall = c(8:10),
Winter = c(1:2),
Spring = c(3:5),
Other = c(6, 7, 11, 12))
mydf$Seasons <- Months
## Drop the "Other" seasons
mydf <- mydf[!mydf$Seasons == "Other", ]
## Add a "Year" column
mydf$Year <- year(mydf$Date)
## Add a "Times" column
mydf$Times <- as.numeric(ave(as.character(mydf$Seasons),
mydf$ID, mydf$Year, FUN = seq_along))
## Load "reshape2" and use `dcast` on just one variable.
## Repeat for other variables by changing the "value.var"
dcast(mydf, ID ~ Seasons + Times, value.var="Fluency")
# ID Fall_1 Fall_2 Winter_1 Winter_2 Spring_2 Spring_3
# 1 1 15 NA 39 39 NA 95
# 2 2 35 NA 32 NA 79 NA
# 3 3 17 NA 45 NA 29 NA
# 4 4 62 NA 82 NA 135 NA
# 5 5 98 55 55 NA 23 NA
I want to generate a row (with zero ammount) for each missing month (until the current) in the following dataframe. Can you please give me a hand in this? Thanks!
trans_date ammount
1 2004-12-01 2968.91
2 2005-04-01 500.62
3 2005-05-01 434.30
4 2005-06-01 549.15
5 2005-07-01 276.77
6 2005-09-01 548.64
7 2005-10-01 761.69
8 2005-11-01 636.77
9 2005-12-01 1517.58
10 2006-03-01 719.09
11 2006-04-01 1231.88
12 2006-05-01 580.46
13 2006-07-01 1468.43
14 2006-10-01 692.22
15 2006-11-01 505.81
16 2006-12-01 1589.70
17 2007-03-01 1559.82
18 2007-06-01 764.98
19 2007-07-01 964.77
20 2007-09-01 405.18
21 2007-11-01 112.42
22 2007-12-01 1134.08
23 2008-02-01 269.72
24 2008-03-01 208.96
25 2008-04-01 353.58
26 2008-05-01 756.00
27 2008-06-01 747.85
28 2008-07-01 781.62
29 2008-09-01 195.36
30 2008-10-01 424.24
31 2008-12-01 166.23
32 2009-02-01 237.11
33 2009-04-01 110.94
34 2009-07-01 191.29
35 2009-11-01 153.42
36 2009-12-01 222.87
37 2010-09-01 1259.97
38 2010-11-01 375.61
39 2010-12-01 496.48
40 2011-02-01 360.07
41 2011-03-01 324.95
42 2011-04-01 566.93
43 2011-06-01 281.19
44 2011-08-01 428.04
'data.frame': 44 obs. of 2 variables:
$ trans_date : Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount: num 2969 501 434 549 277 ...
you can use seq.Date and merge:
> str(df)
'data.frame': 44 obs. of 2 variables:
$ trans_date: Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount : num 2969 501 434 549 277 ...
> mns <- data.frame(trans_date = seq.Date(min(df$trans_date), max(df$trans_date), by = "month"))
> df2 <- merge(mns, df, all = TRUE)
> df2$ammount <- ifelse(is.na(df2$ammount), 0, df2$ammount)
> head(df2)
trans_date ammount
1 2004-12-01 2968.91
2 2005-01-01 0.00
3 2005-02-01 0.00
4 2005-03-01 0.00
5 2005-04-01 500.62
6 2005-05-01 434.30
and if you need months until current, use this:
mns <- data.frame(trans_date = seq.Date(min(df$trans_date), Sys.Date(), by = "month"))
note that it is sufficient to call simply seq instead of seq.Date if the parameters are Date class.
If you're using xts objects, you can use timeBasedSeq and merge.xts. Assuming your original data is in an object Data:
# create xts object:
# no comma on the first subset (Data['ammount']) keeps column name;
# as.Date needs a vector, so use comma (Data[,'trans_date'])
x <- xts(Data['ammount'],as.Date(Data[,'trans_date']))
# create a time-based vector from 2004-12-01 to 2011-08-01. The "m" denotes
# monthly time-steps. By default this returns a yearmon class. Use
# retclass="Date" to return a Date vector.
d <- timeBasedSeq(paste(start(x),end(x),"m",sep="/"), retclass="Date")
# merge x with an "empty" xts object, xts(,d), filling with zeros
y <- merge(x,xts(,d),fill=0)