I've written some code to sort claim dates, count them per month/year and I am attempting to forecast them with either ARIMA / exponential smoothing parameters.
See claims list:
2012-01-31 82
2012-02-29 65
2012-03-31 64
2012-04-30 73
2012-05-31 71
2012-06-30 79
2012-07-31 72
2012-08-31 82
2012-09-29 64
2012-10-31 72
2012-11-30 63
2012-12-31 80
2013-01-31 67
2013-02-27 65
2013-03-31 84
2013-04-30 68
2013-05-31 68
2013-06-29 66
2013-07-30 64
2013-08-31 69
2013-09-29 66
2013-10-31 65
2013-11-30 56
2013-12-31 76
2014-01-31 75
2014-02-28 58
2014-03-29 80
2014-04-30 76
2014-05-31 80
2014-06-28 68
2014-07-31 82
2014-08-30 79
2014-09-27 60
2014-10-31 85
2014-11-30 60
2014-12-31 76
2015-01-31 75
2015-02-28 84
2015-03-31 77
2015-04-30 79
2015-05-30 91
2015-06-30 82
2015-07-31 98
2015-08-31 65
2015-09-30 77
2015-10-31 115
2015-11-30 79
2015-12-31 80
2016-01-30 91
2016-02-29 105
2016-03-31 77
2016-04-30 107
2016-05-31 85
2016-06-30 89
2016-07-30 112
2016-08-31 88
2016-09-30 90
2016-10-30 79
2016-11-30 85
2016-12-31 66
The issue I'm facing with my code is that I am getting a mean forecast rather than my desired fitted data similar to this example: https://stats.stackexchange.com/questions/115506/forecasting-a-seasonal-time-series-in-r
Please see the R code:
Sorting the claim dates and counting them
library(forecast)
library(ggplot2)
library(xts)
library(reshape2)
library(zoo)
library(lubridate)
data = read.csv('Claims1.csv')
data$DISABILITYDATE <- as.Date(data$DISABILITYDATE, "%m/%d/%Y")
data
str(data)
as.Date(data[,1])
xts(x=data[,-1], order.by = data[,1])
data = read.csv('Claims1.csv')
data$DISABILITYDATE <- as.Date (data$DISABILITYDATE, "%m/%d/%Y")
df <- xts(rep(1,length(data$DISABILITYDATE)),order.by=data$DISABILITYDATE)
df1 <- apply.monthly(df,function(x) length(x))
df1
t(df1)
str(df1)
df2 <- data.frame(df1=c("Jan 2012","Feb 2012","Mar 2012","Apr 2012","May 2012","Jun 2012","Jul 2012","Aug 2012","Sep 2012","Oct 2012","Nov 2012","Dec 2012","Jan 2013","Feb 2013","Mar 2013","Apr 2013","May 2013","Jun 2013","Jul 2013","Aug 2013","Sep 2013","Oct 2013","Nov 2013","Dec 2013","Jan 2014","Feb 2014","Mar 2014","Apr 2014","May 2014","Jun 2014","Jul 2014","Aug 2014","Sep 2014","Oct 2014","Nov 2014","Dec 2014","Jan 2015","Feb 2015","Mar 2015","Apr 2015","May 2015","Jun 2015","Jul 2015","Aug 2015","Sep 2015","Oct 2015","Nov 2015","Dec 2015","Jan 2016","Feb 2016","Mar 2016","Apr 2016","May 2016","Jun 2016","Jul 2016","Aug 2016","Sep 2016","Oct 2016","Nov 2016","Dec 2016"),score=c(df1))
df2
t(df2)
df2[-1]
2.1 Forecasting with ETS (Exponential Smoothing)
library(forecast)
x.ts <- as.ts(df2[2])
x.ts
x.ets <- ets(x.ts)
x.ets
x.fore <- forecast(x.ets$fitted, h=12)
x.fore
x <- ts(df2[2], start = 2012, frequency = 12)
plot(forecast(ets(x), 24))
x
plot(forecast(x, h=12))
date1 <- ymd("2012-01-01","2013-01-01","2014-01-01","2015-01-01","2016-01-01","2017-01-01")
abline(v=decimal_date(date1), col="blue")
2.2 Forecasting with ARIMA
ARIMAfit = auto.arima(x, approximation=FALSE,trace=FALSE)
summary(ARIMAfit)
plot(ARIMAfit)
pred = predict(ARIMAfit, n.ahead = 48)
round(as.numeric(pred$fitted,0))
pred
library(TSPred)
plotarimapred(pred$pred,x, xlim=c(2012, 2020), range.percent = 0.05)
My output is this:
example of desired output
Related
This is the second time I'm posting with regards to the same dataset. This time, I'm having trouble using the function ceiling_date from the package lubridate.
Here's a sample of my dataset:
> head(dataraw)
Time ACTIVITY_X ACTIVITY_Y ACTIVITY_Z
1: 6/19/18 10:40:00 60 74 95
2: 6/19/18 10:41:20 62 63 88
3: 6/19/18 10:42:40 60 56 82
4: 6/19/18 10:44:00 66 61 90
5: 6/19/18 10:45:20 60 53 80
6: 6/19/18 10:46:40 57 40 70
7: 6/19/18 10:48:00 54 41 68
8: 6/19/18 10:49:20 52 49 71
9: 6/19/18 10:50:40 61 49 78
10: 6/19/18 10:52:00 93 32 98
11: 6/19/18 10:53:20 80 54 97
12: 6/19/18 10:54:40 73 39 83
13: 6/19/18 10:56:00 47 37 60
14: 6/19/18 10:57:20 51 55 75
15: 6/19/18 10:58:40 51 60 79
16: 6/19/18 11:00:00 14 13 19
17: 6/19/18 11:01:20 0 0 0
18: 6/19/18 11:02:40 13 3 13
19: 6/19/18 11:04:00 20 10 22
20: 6/19/18 11:05:20 13 6 14
And this is how I would like to transform my data:
Time x y z
1: 2018-06-19 10:40:00 60 74 95
2: 2018-06-19 10:44:00 188 180 260
3: 2018-06-19 10:48:00 171 134 218
4: 2018-06-19 10:52:00 206 130 247
5: 2018-06-19 10:56:00 200 130 240
6: 2018-06-19 11:00:00 116 128 173
7: 2018-06-19 11:04:00 33 13 35
8: 2018-06-19 11:08:00 13 6 14
Where time is taken every 240 seconds (4 minutes) instead of 80 seconds (1:20 minutes) as in dataraw. The values for ACTIVITY_X, ACTIVITY_Y and ACTIVITY_Z are summed to fit the longer 4 minute interval.
Below's the code I've been using for this matter. This works for the sample I've posted, but when used on the full dataraw dataset I've warning messages and errors as seen below:
> sampleinput<-na.omit(dataraw)
> names(sampleinput)[1]<-"Time"
> sampleinput$Time <- as.numeric(as.character(sampleinput$Time))
Warning message:
NAs introduced by coercion
> X <- data.table(sampleinput)
> X$tgroup <- lubridate::ceiling_date(X$Time, '4 mins')
Error in UseMethod("reclass_date", orig) :
no applicable method for 'reclass_date' applied to an object of class "c('double', 'numeric')"
> X[, list( x = sum(ACTIVITY_X),
+ y = sum(ACTIVITY_Y),
+ z =sum(ACTIVITY_Z) ), by = list (tgroup)]
Error in eval(bysub, x, parent.frame()) : object 'tgroup' not found
Does this relate to syntax or coding errors? If that helps, the full dataraw dataset is available here, as it's too big to be posted as dput()
Any help is appreciated!
You were almost there. Date parsing has failed. See the solution below:
library(lubridate)
library(dplyr)
data <- read.table("41361_sensor_converted.txt", sep="\t", header=TRUE)
data %>%
mutate(
Time = mdy_hms(data$Time),
TimeGroup = ceiling_date(Time, '4 mins') ) %>%
group_by(TimeGroup) %>%
summarise(
x = sum(ACTIVITY_X),
y = sum(ACTIVITY_Y),
z = sum(ACTIVITY_Z) )
I have an R Time Series at the weekly level starting at Jan 7, 2013 and ending at May 23 2016.
I created the time series using the following code:
start_date <- min(Orders_All_weekly$Week_Start)
Orders_Weekly.ts <- ts(Orders_All_weekly$Sales, start = decimal_date(ymd(start_date)), freq = (365.25/7))
Orders_Weekly.stl <- stl(Orders_Weekly.ts, s.window = 'periodic')
I am attempting to run a Holt Winters time series on these data, and I am receiving the error
Orders_Weekly.hw <- HoltWinters(Orders_Weekly.stl)
Error in decompose(ts(x[1L:wind], start = start(x), frequency = f), seasonal) :
time series has no or less than 2 periods
I have seen several people post this error and the common response has been that the data did not, in fact, have at least two periods...which is necessary for this procedure. Unless I'm completely misunderstanding the meaning of this (which is possible) I have more than the required two periods. My data are at the weekly level, and I have 3+ years of observations.
Am I setting this up incorrectly? Or is the error essentially saying there is no seasonality?
ID Week_Start Sales
1 2013-04-08 932662.9
3 2013-05-13 1021574.4
4 2013-05-20 913812.9
5 2013-06-17 1086239.1
7 2013-08-26 762829.3
8 2013-11-18 1085033.0
9 2013-12-02 897158.4
10 2013-12-09 776733.7
11 2013-12-23 867362.8
12 2013-02-04 666362.0
13 2013-02-18 748603.2
15 2013-07-22 1005986.7
16 2013-09-02 896582.8
17 2013-10-28 868364.8
18 2014-01-06 814648.7
19 2014-02-10 847342.4
20 2014-02-17 869374.3
21 2014-03-17 827677.6
22 2014-03-24 897462.3
23 2014-03-31 850542.4
24 2014-04-21 1139619.4
25 2014-07-28 889043.3
26 2014-08-04 1097560.6
27 2014-09-08 1029379.4
28 2014-10-13 998094.8
29 2014-11-10 1238445.9
30 2014-12-15 1204006.6
31 2014-07-14 1106800.6
32 2014-09-01 730030.8
33 2014-10-06 1085331.8
34 2014-05-05 1072926.8
35 2014-05-19 863283.7
36 2015-01-19 1095186.1
37 2015-02-02 866258.2
38 2015-02-16 1006247.0
39 2015-03-23 1214339.7
40 2015-04-20 1181482.9
41 2015-05-18 1112542.4
42 2015-06-01 1188714.7
43 2015-07-20 1216050.4
45 2015-08-17 848302.8
46 2015-08-24 1081198.9
47 2015-09-14 916539.8
48 2015-09-28 957177.8
49 2015-10-26 964467.1
50 2015-11-02 1063949.1
51 2015-01-12 879343.9
53 2015-03-09 1245047.9
55 2015-11-16 913514.4
56 2015-02-09 1108247.6
57 2015-12-28 1014929.2
58 2016-01-25 946786.3
59 2016-02-01 891230.8
60 2016-02-29 1274039.8
61 2016-03-07 847501.8
62 2016-04-04 1057844.1
64 2016-04-11 1207347.4
65 2016-04-18 1159690.4
66 2016-05-02 1394727.6
67 2016-05-23 1044129.3
68 2013-03-04 1040017.1
69 2013-03-11 984574.2
70 2013-04-15 1054174.1
72 2013-04-29 952720.1
73 2013-05-06 1000977.1
74 2013-06-03 1091743.6
75 2013-07-01 955164.8
76 2013-08-12 808803.7
77 2013-09-23 960096.4
78 2013-09-30 814014.4
79 2013-10-14 743264.9
81 2013-01-28 956396.4
84 2013-10-21 959058.5
85 2013-11-11 915108.6
90 2013-01-14 867140.6
91 2014-01-27 910063.7
92 2014-03-10 963144.2
93 2014-04-07 975789.6
95 2014-04-28 1030313.7
97 2014-05-26 1139089.3
99 2014-06-09 1077980.6
100 2014-06-30 1019326.6
101 2014-09-15 666787.6
103 2014-11-03 1059089.4
105 2014-11-24 705428.6
106 2014-12-22 889368.8
108 2014-06-23 1046989.4
110 2015-02-23 1327066.4
112 2015-04-13 1110673.9
115 2015-06-08 1177799.1
116 2015-07-06 1314697.7
118 2015-07-27 1094805.6
119 2015-08-03 882394.2
120 2015-09-21 1159233.2
121 2015-10-19 1171636.9
122 2015-11-23 1036050.9
125 2015-12-21 984050.8
128 2016-01-04 1371348.3
129 2016-01-11 1086225.4
131 2016-02-22 1077692.4
137 2013-03-18 854699.1
141 2013-05-27 1011870.1
142 2013-08-05 893878.4
143 2013-12-16 801215.2
148 2013-10-07 805962.8
150 2013-11-04 801729.8
152 2013-08-19 726361.0
155 2014-02-24 979288.7
158 2014-04-14 1006729.5
161 2014-07-07 1102600.4
162 2014-08-11 979494.5
164 2014-10-20 901047.1
166 2014-10-27 1260062.0
169 2014-12-29 1022656.2
171 2014-08-18 976136.5
175 2015-03-02 897352.6
177 2015-03-30 1059103.8
178 2015-05-11 1033694.4
179 2015-06-29 1037959.4
182 2015-09-07 1230050.6
183 2015-10-12 975898.2
185 2015-12-07 1057603.4
186 2015-12-14 953718.2
189 2015-04-06 1233091.9
190 2015-04-27 1176994.2
192 2015-01-26 1256182.6
196 2016-01-18 955919.5
197 2016-02-15 954623.5
198 2016-03-14 740724.2
199 2013-01-07 924205.2
201 2013-02-11 672150.0
202 2013-03-25 769391.5
205 2013-06-10 870971.1
206 2013-06-24 1043166.2
208 2013-07-15 1106379.4
210 2013-09-09 916382.0
215 2013-04-22 934307.5
217 2013-12-30 974004.0
219 2014-01-13 972211.2
220 2014-01-20 952294.8
221 2014-02-03 946820.6
225 2014-06-02 1182837.6
228 2014-08-25 912550.8
234 2014-03-03 1013797.0
245 2015-06-15 946565.2
246 2015-07-13 1139633.6
248 2015-08-10 1080701.8
249 2015-08-31 1052796.2
253 2015-11-30 980493.4
259 2016-03-28 1105384.2
264 2016-02-08 897832.2
267 2013-02-25 766646.8
269 2013-04-01 954419.8
281 2013-11-25 852430.6
286 2013-09-16 997656.1
290 2014-07-21 1171519.8
294 2014-09-29 804772.4
298 2014-12-01 813872.0
299 2014-12-08 1005479.1
304 2014-06-16 981782.5
312 2015-03-16 1009182.7
315 2015-05-25 1166947.6
329 2015-01-05 903062.3
337 2016-03-21 1299648.7
338 2016-04-25 1132090.1
341 2013-01-21 818799.7
364 2014-05-12 1035870.7
367 2014-09-22 1234683.8
381 2015-06-22 990619.5
383 2015-10-05 1175100.6
385 2015-11-09 1095345.9
395 2016-05-16 1121192.5
399 2016-05-09 1175343.4
407 2013-07-08 1035513.8
430 2014-11-17 1024473.3
443 2015-05-04 1063411.6
476 2013-07-29 809045.3
I'm not sure if this completely answers the question but I was able to get a result out of your data with the slightly modified code below.
Hope this helps!
One point, I first sorted the data by date, assuming this was part of your intent.
Orders_Sorted <- Orders_Weekly[order(Orders_Weekly$Week_Start),] # Sort by date (unless you want to keep the data out of date order for some reason)
Orders_Weekly.ts <- ts(Orders_Sorted$Sales, frequency = (365.25/7)) # Convert df to time series
Orders_Weekly.hw <- HoltWinters(x=Orders_Weekly.ts, beta = FALSE, gamma = FALSE) # Run HW
plot(Orders_Weekly.hw) # Show plot of HW output
This produces the plot below.
Plot of Holt-Winters exponential smoothing of data
I have encountered the same error, except that I have been computing a moving average of my initial data.
Unfortunately, the decompose() function that HoltWinters() uses returns that error message if anything goes wrong, not just when there aren't enough periods. Look more closely at the data you're passing HoltWinters(), even if your initial data looks fine.
In your particular case, Orders_Weekly.ts is kind of a ts object, but it has seasonal, trend, remainder, and weights components. I'm not very familiar with stl(), but when I try HoltWinters(Orders_Weekly.ts$time.series), it works just fine.
In my case, the moving average of my initial data introduced a bunch of NAs at the beginning of my time-series. After removing those, HoltWinters() worked.
The trick is to have at least two periods in your time series. The time series need to be complete-- there are two default time periods.
https://r.789695.n4.nabble.com/time-series-has-no-or-less-than-2-periods-td4677519.html
I have a sample xts object with the some data:
dates <- seq.Date(from = as.Date("2010-01-01", format = "%Y-%m-%d"),
to = as.Date("2013-12-01", format = "%Y-%m-%d"), by = "month")
sample_data <- cbind(1:length(dates),length(dates):1)
xts_object <- xts(x = sample_data, order.by = dates)
I then use apply.yearly on it with the function cumsum:
apply.yearly(x = xts_object, FUN = cumsum)
The output is a tranposed matrix, which is not what I originally intended it to return.
I would expect the snippet above to return the same output as:
rbind(apply(X = xts_object[1:12],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[13:24],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[25:36],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[37:48],MARGIN = 2,FUN = cumsum))
The problem with using apply is that it returns a matrix and not an xts object. While I could solve this by using as.xts, I would like to know if there is something I am missing, or if I am using apply.yearly incorrectly. Using pure apply seems to be more prone to difficult to catch errors and bugs.
This might not be the most elegant solution, but it works:
# Split xts_object by year
xts_list = split(xts_object, "years")
# cumsum for each year
cumsum_list = lapply(xts_list, FUN = cumsum)
# rbind them together
do.call(rbind, cumsum_list)
# [,1] [,2]
# 2010-01-01 1 48
# 2010-02-01 3 95
# 2010-03-01 6 141
# 2010-04-01 10 186
# 2010-05-01 15 230
# 2010-06-01 21 273
# 2010-07-01 28 315
# 2010-08-01 36 356
# 2010-09-01 45 396
# 2010-10-01 55 435
# 2010-11-01 66 473
# 2010-12-01 78 510
# 2011-01-01 13 36
# 2011-02-01 27 71
# 2011-03-01 42 105
# 2011-04-01 58 138
# 2011-05-01 75 170
# 2011-06-01 93 201
# 2011-07-01 112 231
# 2011-08-01 132 260
# 2011-09-01 153 288
# 2011-10-01 175 315
# 2011-11-01 198 341
# 2011-12-01 222 366
# 2012-01-01 25 24
# 2012-02-01 51 47
# 2012-03-01 78 69
# 2012-04-01 106 90
# 2012-05-01 135 110
# 2012-06-01 165 129
# 2012-07-01 196 147
# 2012-08-01 228 164
# 2012-09-01 261 180
# 2012-10-01 295 195
# 2012-11-01 330 209
# 2012-12-01 366 222
# 2013-01-01 37 12
# 2013-02-01 75 23
# 2013-03-01 114 33
# 2013-04-01 154 42
# 2013-05-01 195 50
# 2013-06-01 237 57
# 2013-07-01 280 63
# 2013-08-01 324 68
# 2013-09-01 369 72
# 2013-10-01 415 75
# 2013-11-01 462 77
# 2013-12-01 510 78
class(do.call(rbind, cumsum_list))
# [1] "xts" "zoo"
The resulting object would still be "xts"
I'm having trouble when trying to calculate the average time interval (how many days) between appearances of the same value in another column.
My data looks like this:
dt subject_id
2016-09-13 77
2016-11-07 1791
2016-09-18 1332
2016-08-31 84
2016-08-23 89
2016-08-23 41
2016-09-15 41
2016-10-12 93
2016-10-05 93
2016-11-09 94
2016-10-25 94
2016-11-03 94
2016-10-09 375
2016-10-14 11
2016-09-27 11
2016-09-13 11
2016-08-23 11
2016-08-27 11
And I want to get something like this:
subject_id mean_day
41 23
93 7
94 7.5
11 13
I tried to use:
aggregate(dt~subject_id, data, mean)
But it can't calculate mean from Date values. Any ideas?
My first approach would be something like this:
df$dt <- as.Date(df$dt)
library(dplyr)
df %>%
group_by(subject_id) %>%
summarise((max(dt) - min(dt))/(n()-1))
# <int> <time>
#1 11 13.0 days
#2 41 23.0 days
#3 77 NaN days
#4 84 NaN days
#5 89 NaN days
#6 93 7.0 days
#7 94 7.5 days
#8 375 NaN days
#9 1332 NaN days
#10 1791 NaN days
I think it's a starting point for you ... you can modify as you want.
Hi I have an R dataframe that looks like the following:
SURVEY.DATE A B C
1898 2010-05-13 38 34 21
1899 2010-05-13 38 33 21
1897 2010-05-14 37 34 21
1895 2010-05-21 38 29 21
1896 2010-05-21 39 32 21
1894 2010-05-23 39 32 21
I would like to average the rows with the same date so to have only one average observation per day. Ideally I would like to end up with an xts obsject that would look like :
SURVEY.DATE A B C
1898 2010-05-13 38 33.5 21
1897 2010-05-14 37 34 21
1896 2010-05-21 38.5 30.5 21
1894 2010-05-23 39 32 21
Seems to be a challenge for my newbie R skills...any help / pointers would be appreciated
You could try
library(dplyr)
res <- df1 %>%
group_by(SURVEY.DATE) %>%
summarise_each(funs(mean))
Or
res1 <- aggregate(.~SURVEY.DATE, df1, mean)
and then convert it to xts
library(xts)
xts(res1[-1], order.by= as.Date(res1[,1]))
# A B C
#2010-05-13 38.0 33.5 21
#2010-05-14 37.0 34.0 21
#2010-05-21 38.5 30.5 21
#2010-05-23 39.0 32.0 21
Here's how I'd do this using data.table.
require(data.table)
setDT(df)[, lapply(.SD, mean), by=SURVEY.DATE]
# SURVEY.DATE A B C
# 1: 2010-05-13 38.0 33.5 21
# 2: 2010-05-14 37.0 34.0 21
# 3: 2010-05-21 38.5 30.5 21
# 4: 2010-05-23 39.0 32.0 21
Check the new HTML vignettes if you'd like to learn more.