I have been given a dataset for 118 days. I'm supposed to forecast the values for the next 28 days. I've tried out the below code. But I'm getting the same values for all the 28 days. Can you help me find my mistake? Thank you.
library(forecast)
library(dplyr)
head(product)
ts_product = ts(product$Qty, start=1,frequency=1)
ts_product
plot(ts_product)
#predictions of 28 days
m_ets = ets(ts_product)
f_ets = forecast(m_ets, h=28)
plot(f_ets)
The data for Qty is given by:
Qty = c(53, 40, 37, 45, 69, 105, 62, 101, 104, 46, 92, 157, 133, 173,
139, 163, 145, 154, 245, 147, 85, 131, 228, 192, 240, 346, 267,
267, 243, 233, 233, 244, 241, 136, 309, 236, 310, 266, 280, 321,
349, 335, 410, 226, 391, 314, 250, 368, 282, 203, 250, 233, 233,
277, 338, 279, 279, 266, 253, 178, 238, 126, 279, 258, 350, 277,
226, 287, 180, 268, 191, 279, 214, 133, 292, 212, 307, 232, 165,
107, 121, 188, 198, 154, 128, 85, 106, 67, 63, 88, 107, 56, 41,
59, 27, 58, 80, 75, 93, 54, 14, 36, 107, 82, 83, 112, 37, 57,
9, 51, 47, 57, 68, 97, 25, 45, 69, 89)
This is the prediction I get.
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
119 69.53429 2.089823 136.9788 -33.61312 172.6817
120 69.53429 -2.569107 141.6377 -40.73834 179.8069
121 69.53429 -6.944751 146.0133 -47.43031 186.4989
122 69.53429 -11.083248 150.1518 -53.75959 192.8282
123 69.53429 -15.019428 154.0880 -59.77946 198.8480
124 69.53429 -18.780346 157.8489 -65.53129 204.5999
125 69.53429 -22.387517 161.4561 -71.04798 210.1166
126 69.53429 -25.858385 164.9270 -76.35622 215.4248
127 69.53429 -29.207323 168.2759 -81.47798 220.5466
128 69.53429 -32.446345 171.5149 -86.43163 225.5002
129 69.53429 -35.585612 174.6542 -91.23273 230.3013
130 69.53429 -38.633808 177.7024 -95.89454 234.9631
131 69.53429 -41.598429 180.6670 -100.42854 239.4971
132 69.53429 -44.485993 183.5546 -104.84468 243.9133
133 69.53429 -47.302214 186.3708 -109.15172 248.2203
134 69.53429 -50.052133 189.1207 -113.35736 252.4259
135 69.53429 -52.740222 191.8088 -117.46844 256.5370
136 69.53429 -55.370474 194.4391 -121.49106 260.5596
137 69.53429 -57.946468 197.0150 -125.43070 264.4993
138 69.53429 -60.471431 199.5400 -129.29230 268.3609
139 69.53429 -62.948280 202.0169 -133.08032 272.1489
140 69.53429 -65.379664 204.4482 -136.79880 275.8674
141 69.53429 -67.768000 206.8366 -140.45144 279.5200
142 69.53429 -70.115495 209.1841 -144.04163 283.1102
143 69.53429 -72.424177 211.4928 -147.57245 286.6410
144 69.53429 -74.695908 213.7645 -151.04676 290.1153
145 69.53429 -76.932409 216.0010 -154.46719 293.5358
146 69.53429 -79.135268 218.2038 -157.83618 296.9048
Also, do you think any other model other than ets, which we have used here will work for this problem ?
Understanding ets()
The ets() function is an exponential smoothing technique for state space models. By default, the ets() function will attempt to automatically fit a model to a time series via model = 'ZZZ' using the supplied frequency= parameter. This is particularly problematic as an incorrectly specified frequency= will cause a non-ideal model to be generate w.r.t to seasonality yielding the flat estimates.
Seasonalities
You may think that one should specify frequency=1 within a ts() object for daily data. However, that is an incorrect way to go about it. In fact, the correct way to specify frequency= is to understand R's "unique" definition:
The frequency is the number of observations per season.
Thus, we need to care about the seasonality of your data.
There are two guiding tables to consult.
The first is a macro view:
Data Frequency
Annual 1
Quarterly 4
Monthly 12
Weekly 52
The second is a micro view:
Data Frequencies
Minute Hour Day Week Year
Daily 7 365.25
Hourly 24 168 8766
Half-hourly 48 336 17532
Minutes 60 1440 10080 525960
Seconds 60 3600 86400 604800 31557600
There are two seasonalities (e.g. frequency= options) to consider with daily data:
7 (weekly) and 365.25 (daily)
For more information see: Seasonal periods
Revisiting the estimation
The reason why ets() is not working appropriately is due to the seasonality used. (e.g. frequency = 1). By changing it based on the above, we get:
# Changed the frequency to 7
ts_product = ts(product$Qty, start=1, frequency=7)
# Predictions of 28 days
m_ets <- ets(ts_product)
f_ets <- forecast(m_ets, h = 28)
plot(f_ets)
Alternative models
There are two other models worth looking into briefly: HoltWinters() and auto.arima(). Discussion for is available for the prior: HoltWinters vs. ets
hw = HoltWinters(ts_product)
f_ets = predict(hw, n.ahead = 28, prediction.interval = T, level = 0.95)
plot(hw, f_ets)
The ARIMA generated by running auto.arima():
aa = auto.arima(ts_product)
f_ets = forecast(aa, h = 28)
plot(f_ets)
Misc data note
Briefly looking at your data under:
ts_product = ts(a, start=1, frequency=1)
plot(ts_product)
Note, there is a relatively large disturbance between times 18-85 that would cause a model to be considered non-stationary. You may wish to first try differencing it out via diff() and then repeat the above.
In addition, you may want to try to obtain a full year's worth of data instead of only 118 days.
Take a look at ?arima. For example:
mar=arima(product$Qty,order = c(1,0,1))
f_ar=forecast(mar, h=28)
plot(f_ar)
Your data appears to have seasonality, try to use that information in the ets or arima models.
Related
Given a simple vector of 82 observation
x = c(102, 104, 89, 89, 76, 95, 88, 112, 81, 101, 101, 104, 94, 111, 108, 104, 93, 92, 86, 113, 93, 100, 92, 80, 92, 126, 102, 109, 104, 95, 84, 81, 103, 83, 103, 83, 58, 109, 89, 93, 104, 104, 123, 104, 93, 76, 103, 103, 100, 105, 108, 90, 122, 103, 114, 102, 87, 98, 88, 107, 102, 80, 81, 96, 107, 105, 113, 98, 93, 104, 94, 107, 107, 97, 102, 82, 90, 97, 124, 109, 96, 92)
I would like to perform EMA (Exponential Moving Average) of this vector in such a way:
The 1st element of the new vector should be NA
The 2nd element should be the first element of the original vector
The 3rd element should be the EMA of the first and second elements of the original vector
The 4th element should be the EMA of the first three elements of the original vector
...
The 82th element should be the EMA of all the values of the original vector except for the last one
The ideas are to to place greater weight on the most recent vector elements and that also the last elements of the new vector is affected (although infinitesimally) by the first elements of the orginal vector.
I tried achieving this using the function EMA from the package TTR and lag from dplyr
> library(dplyr)
> library(TTR)
> lag(EMA(x, 1, ratio = 2/(81+1)))
[1] NA 102.00000 102.04878 101.73052 101.42002 100.80002 100.65855 100.34981 100.63396 100.15508 100.17569
[12] 100.19579 100.28858 100.13520 100.40020 100.58556 100.66884 100.48179 100.27492 99.92675 100.24561 100.06889
[23] 100.06721 99.87045 99.38580 99.20566 99.85918 99.91139 100.13307 100.22738 100.09989 99.70721 99.25093
[34] 99.34237 98.94378 99.04271 98.65143 97.65993 97.93651 97.71855 97.60346 97.75948 97.91168 98.52360
[45] 98.65717 98.51919 97.96994 98.09262 98.21231 98.25592 98.42041 98.65405 98.44298 99.01754 99.11468
[56] 99.47773 99.53925 99.23342 99.20333 98.93008 99.12691 99.19698 98.72876 98.29635 98.24035 98.45400
[67] 98.61365 98.96454 98.94102 98.79611 98.92304 98.80296 99.00289 99.19794 99.14433 99.21398 98.79413
[78] 98.57964 98.54111 99.16206 99.40201 99.31903
But that's definetely not the result I was looking for.... what I'm doing wrong?
I wasn't able to find any comprehensive documentation about ratio argument on internet, and I'm not sure I've got this clear.
Can anyone please help me?
To get things more clear:
the result i reached until now is the following:
> library(runner)
> mean_run(x, k = 7, lag = 1)
[1] NA 102.00000 103.00000 98.33333 96.00000 92.00000 92.50000 91.85714 93.28571 90.00000 91.71429
[12] 93.42857 97.42857 97.28571 100.57143 100.00000 103.28571 102.14286 100.85714 98.28571 101.00000 98.42857
[23] 97.28571 95.57143 93.71429 93.71429 99.42857 97.85714 100.14286 100.71429 101.14286 101.71429 100.14286
[34] 96.85714 94.14286 93.28571 90.28571 85.00000 88.57143 89.71429 88.28571 91.28571 91.42857 97.14286
[45] 103.71429 101.42857 99.57143 101.00000 100.85714 100.28571 97.71429 98.28571 97.85714 104.42857 104.42857
[56] 106.00000 106.28571 103.71429 102.28571 102.00000 99.85714 99.71429 94.85714 91.85714 93.14286 94.42857
[67] 96.85714 97.71429 97.14286 99.00000 102.28571 102.00000 102.00000 102.28571 100.00000 100.57143 99.00000
[78] 97.00000 97.42857 99.85714 100.14286 100.00000
So this is the Simple Moving Average (SMA) over k=7 observations i obteneid with mean_run function from runner package.
Now i would like to "improve" this moving average by placing exponential increasing weights on each observations and making sure that also the last element is affected by the first one (the weight for that observation should be as close as possible to 0). That means that the window sizes for the rolling average will be:
n=0 for the 1st element (i.e. NA)
n=1 for the 2nd element (i.e the 1st element of the original vector)
n=2 for the 3d element (i.e. the EMA of the 1st and 2nd elements)
n=3 for the 4th element (i.e. the EMA of the 1st,2nd and 3rd elements)
...
n=81 for the 82th element (i.e. the EMA of the first 81 elements)
I still wasn't able to find any good documentation about the ratio arguments (i.e alpha), but i think it should be settled aribitrarily, but I'm not sure about that
Assuming that you intended what you wrote, i.e. a lagged exponential moving average and not a lagged weighted moving average defined in a comment, we define the iteration in iter and then use Reduce like this.
alfa <- 2/(81+1)
iter <- function(y, x) alfa * x + (1-alfa) * y
ema <- c(NA, head(Reduce(iter, tail(x, -1), init = x[1], acc = TRUE), -1))
# check
identical(ema[1], NA_real_)
## [1] TRUE
identical(ema[2], x[1])
## [1] TRUE
identical(ema[3], alfa * x[2] + (1-alfa) * x[1])
## [1] TRUE
identical(ema[4], alfa * x[3] + (1-alfa) * ema[3])
## [1] TRUE
The lagged weighted moving average in the comment is not an exponential moving average and is unlikely what you want but just to show how to implement it if the second argument to rollapply is a list containing a vector then that vector will be regarded as the offsets to use.
library(zoo)
c(NA, x[1], rollapplyr(x, list(-seq(2)), weighted.mean, c(alfa, 1-alfa)))
I am attempting an analysis that requires the extraction of a some (2 or 3) consecutive values in which perform further analysis later.
I have two vectors: a is the output from a machine of consecutive cellular signals. b is the same output, but shifted by 1. This notation is used to understand the variability between one signal and the next one
a <- c(150, 130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181)
b <- c(130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181, 130)
What I am trying to do is to identify the most homogeneous (stable) region (i.e. one value is similar to the following) in this set of data.
The idea I had was to perform a subtraction between a and b and consider the absolute value:
c <- abs(a-b)
which gives
c
[1] 20 5 45 2 8 2 7 25 30 20 10 50 1 51
Now, if I want the 3 closest consecutive points, I can clearly see that the sequence 2 8 2 is by far the one that I would consider, but I have no idea on how I can automatically extract these 3 values, especially from arrays of hundreds of data points.
Initial data:
a <- c(150, 130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181)
b <- c(130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181, 130)
Find absolute difference between two vectors:
res <- abs(a - b)
For each element in res get neighbors and calculate sum off absolute difference :
# with res[(x-1):(x+1)] we extract x and it's neighbors
resSimilarity <- sapply(seq_along(res), function(x) sum(res[(x-1):(x+1)]))
resPosition <- which.min(resSimilarity)
# [1] 5
To extract values from original vectors use:
a[(resPosition - 1):(resPosition + 1)]
# [1] 180 182 190
b[(resPosition - 1):(resPosition + 1)]
# [1] 182 190 188
Here is one more alternative:
a <- c(150, 130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181)
b <- c(130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181, 130)
res <- abs(a-b)
> which.min(diff(c(0, cumsum(res)), lag=3))
[1] 4
> res[(4):(4+2)]
[1] 2 8 2
The above code uses cumsum to get the cumulative sums of your absolute differences. Then it calls diff with lag=3 to get the differences between each element and the element 3 positions away from it. Finally it takes the position where the increase in cumulative sum over successive 3 elements was the smallest.
I am interested in plotting the range of values of variables so that the names appear on the Y-axis and the range on the X-axis, for a better visualization.
I have used the following code:
primer_matrix1a <- matrix(
c(
"EF1", 65, 217,
"EF6", 165, 197,
"EF14", 96, 138,
"EF15", 103, 159,
"EF20", 86, 118,
"G9", 115, 173,
"G25", 112, 140,
"BE22", 131, 135,
"TT20", 180, 190
)
,nrow=9,ncol=3,byrow = T)
# Format data
Primer_name <- primer_matrix1a[,1]
Primer_name <- matrix(c(Primer_name),nrow = 9,byrow = T)
Primer_values<- matrix(c(as.numeric(primer_matrix1a[ ,2-3])),nrow = 9,ncol = 2,byrow = T)
Primer_Frame <- data.frame(Primer_name,Primer_values)
colnames(Primer_Frame) <- c("Primer","min","max")
Primer_Frame$mean<- mean(c(Primer_Frame$min,Primer_Frame$max))
ggplot(Primer_Frame, aes(x=Primer))+
geom_linerange(aes(ymin=min,ymax=max),linetype=2,color="blue")+
geom_point(aes(y=min),size=3,color="red")+
geom_point(aes(y=max),size=3,color="red")+
theme_bw()
but the plot is weird, EF15 goes from 103, 159, while G9 goes from 115 to 173, and they do not overlap, so I am doing something wrong.
It looks like something is getting muddled when you are joining the matrix, but the approach is already more complex than it should be, so you might want to start afresh. It is probably easiest converting it to a dataframe and then formatting it there, rather than fiddling around with all the matrix functions:
df <- as.data.frame(primer_matrix1a)
names(df)<- c("Primer","min","max")
df$min <- as.numeric(as.character(df$min)) # Converts factor to numeric
df$max <- as.numeric(as.character(df$max))
df$mean<- mean(c(df$min,df$max))
ggplot(df, aes(x=Primer))+
geom_linerange(aes(ymin=min,ymax=max),linetype=2,color="blue")+
geom_point(aes(y=min),size=3,color="red")+
geom_point(aes(y=max),size=3,color="red")+
theme_bw()
Suppose this is my data set:
ID<- seq(1:50)
mou<-sample(c(2000, 2500, 440, 4990, 23000, 450, 3412, 4958,745,1000), 50, replace= TRUE)
calls<-sample(c(50, 51, 12, 60, 90, 888, 444, 668, 16, 89, 222,33, 243, 239, 333, 645,23, 50,555), 50, replace= TRUE)
rev<- sample(c(100, 345, 758, 44, 58, 334, 50000, 888, 205, 940,298, 754), 50, replace= TRUE)
dt<- data.frame(mou, calls, rev)
I did the box plot for calls and while analyzing it, I saw the following objects for the boxplot.
x<-boxplot(dt$calls)
names(x)
> names(x)
[1] "stats" "n" "conf" "out" "group" "names"
Looking at the output for x$stats, I figured that stats object gives me the lower whisker the lower hinge, the median, the the upper hinge and the upper whisker for each group. But i am little bit confused what the object "out" really mean? Does this signify the outlier values or something else?
The out object for my boxplot gives the following results:
> x$out
[1] 555 10000 555 555 555 555 555 10000
It gives you: "The values of any data points which lie beyond the extremes of the whiskers"
Take a look at here for more insight.
In R is have 2 vectors
u <- c(109, 77, 57, 158, 60, 63, 42, 20, 139, 15, 64, 18)
v <- c(734, 645, 1001, 1117, 1071, 687, 162, 84, 626, 64, 218, 79)
I want to test H: u and v are independent so I run a chi-square test:
chisq.test( as.data.frame( rbind(u,v) ) )
and get a very low p-value meaning that I can reject H, meaning that u and v are not independent.
But when I type
chisq.test(u,v)
I get a p-value on 0.23 which mean that I can accept H.
Which one of these two test should I chose ?
Furthermore I want to find the entries in these vectors that causes this low p-value. Any ideas how to do this?
The test statistic uses the sum of squared standardised residuals. You can look at these values to get an idea of the importance of particular values
m = chisq.test(u, v)
residuals(m)
m$stdres