I have a time series which represents the amount of a certain product sold throughout the year 2018 (from 2018/01/01 to 2018/12/31); is it correct to think of a frequency of 7 observations per cycle? and if so, what is my cycle? one week? I try to understand this in order to decompose my time series avoiding the error Error in decompose(tsData) : time series has no or less than 2 periods. This is my R script and my data.
library(forecast)
library(sweep)
library(timetk)
Data <- read.delim("R Project/Dataset/MyData.txt")
DataFrame <- data.frame(Data,
Date = seq(as.Date("2018-01-01"), as.Date("2018-12-31"),
by = "day"))
inds <- seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day")
tsData <- ts(Data, start = c(2018, as.numeric(format(inds[1], "%j"))),
frequency = 365)
print(tsData)
plot(tsData)
Axis(inds, side = 1, at = seq(inds[1], tail(inds, 1) + 60,
by = "1 months"), format = "%b %Y")
comp = decompose(tsData)
#comp = stl(tsData)
plot(comp)
fit <- auto.arima(tsData)
fore <- forecast(fit, h = 15, level = 99.5)
plot(fore, xaxt = "n")
Axis(inds, side = 1, at = seq(inds[1], tail(inds, 1) + 60, by = "1 months"),
format = "%b %Y")
This is MyData.txt file
Daily Data
0
2621
3407
3644
3569
1212
0
0
4473
3885
3671
3641
1453
0
4182
3812
3650
3444
3557
1612
0
4004
3631
3342
3203
3424
1597
0
4280
3644
3642
3696
3793
1753
0
4416
3935
3522
3544
3569
1649
0
3871
3442
3144
3158
3693
1780
0
4322
3682
3499
3279
3485
1716
0
4255
3713
3470
3673
3983
1931
0
4771
3986
3833
3501
3620
1710
0
4407
3799
3654
3332
3693
1780
0
0
4574
4016
3748
3559
1625
0
4548
3726
2780
0
0
122
0
5005
4300
3772
3929
3917
2021
0
4820
4117
3668
3664
3639
1742
0
4473
4151
3844
3499
3736
1838
0
4346
3693
3297
3327
3639
1773
0
4519
0
4352
4079
4143
1970
0
4693
4018
3679
3838
3606
1601
0
0
4289
4011
3742
3710
1781
0
4186
3707
3600
3484
3702
1747
0
4195
3838
3504
3609
3934
1943
0
0
5243
4754
4164
4121
1854
0
0
5173
4518
3875
3889
1904
0
5105
4056
4186
4079
3953
1846
0
4543
4341
4013
2998
4048
1767
0
0
4317
5260
5185
4969
2046
0
5683
5004
4567
4542
4266
2065
0
4357
5281
4830
4510
0
1567
0
5818
4906
4518
4218
4275
2074
0
5005
4645
4543
4558
4574
2129
0
4755
0
4458
3845
3746
1689
0
4285
3476
3447
2959
3470
1584
0
0
4159
3881
3533
3360
1643
0
4152
3748
3329
3112
3303
1790
0
3852
4190
3482
3313
3400
1582
0
4042
3706
3451
3137
3178
1518
0
4077
3754
3429
3369
3307
1467
0
3918
3620
3442
3302
3168
1630
0
3967
3707
3397
3294
3314
1646
0
4196
3812
3478
3111
3113
1411
0
0
3717
3501
3282
3366
1554
0
3737
3428
3028
2960
2977
1513
0
3608
3306
2941
2918
3238
1543
0
0
3959
3678
3367
3237
1024
0
0
4057
3562
3344
3367
1602
0
3784
3581
3395
2948
3009
1446
0
3676
3276
3112
3125
3133
1502
0
4200
4027
3739
3531
3222
2
0
4446
4342
4066
3811
2932
1643
0
4587
4534
4146
3994
3350
1400
0
1248
0
4248
4629
4346
1844
0
168
The frequency = parameter in ts() function indicates the number of observations before pattern repetition. If you set a seasonality of 365 (1 year) with 1 year of data it will have only 1 period and so decompose() tells you: time series has no or less than 2 periods.
As you said "7 observations per cycle", you may want to set frequency equal to 7. Or if you want to analyze year seasonality put more data in tsData.
Just change:
# ....
tsData <- ts(Data, start = c(2018, as.numeric(format(inds[1], "%j"))), frequency = 365)
# ...
to :
# ...
### weekly seasonality
tsData <- ts(Data, start = c(2018, as.numeric(format(inds[1], "%j"))), frequency = 7)
#...
and now decompose works:
comp = decompose(tsData) # NO ERROR
### get the plot
plot(comp)
# ... rest of your code ...
here the plot:
EDIT on your comment:
The X-axis on the plot depends on how you declare the start, please have a look at ts documentation.
If you want to have the 2018 year value you can simply use (see documentation) autoplot() :
# ... rest of code ...
autoplot(tsData)
# ... rest of code ...
that is also highly customizable, if you want to know how to customize the plot (made through ggplot2 package) just have a look at documentation and all the posts on this blog etc.
Related
I am working with the R programming language.
In the "datasets" library in R, there is a data set called "eurodist" that contains the distance between each combination of cities :
library(datasets)
This data set can be then converted into a "matrix":
eurodist = as.matrix(eurodist)
Athens Barcelona Brussels Calais Cherbourg Cologne Copenhagen Geneva Gibraltar Hamburg Hook of Holland Lisbon Lyons Madrid Marseilles Milan Munich Paris Rome Stockholm Vienna
Athens 0 3313 2963 3175 3339 2762 3276 2610 4485 2977 3030 4532 2753 3949 2865 2282 2179 3000 817 3927 1991
Barcelona 3313 0 1318 1326 1294 1498 2218 803 1172 2018 1490 1305 645 636 521 1014 1365 1033 1460 2868 1802
Brussels 2963 1318 0 204 583 206 966 677 2256 597 172 2084 690 1558 1011 925 747 285 1511 1616 1175
Calais 3175 1326 204 0 460 409 1136 747 2224 714 330 2052 739 1550 1059 1077 977 280 1662 1786 1381
Cherbourg 3339 1294 583 460 0 785 1545 853 2047 1115 731 1827 789 1347 1101 1209 1160 340 1794 2196 1588
Cologne 2762 1498 206 409 785 0 760 1662 2436 460 269 2290 714 1764 1035 911 583 465 1497 1403 937
Copenhagen 3276 2218 966 1136 1545 760 0 1418 3196 460 269 2971 1458 2498 1778 1537 1104 1176 2050 650 1455
Geneva 2610 803 677 747 853 1662 1418 0 1975 1118 895 1936 158 1439 425 328 591 513 995 2068 1019
Gibraltar 4485 1172 2256 2224 2047 2436 3196 1975 0 2897 2428 676 1817 698 1693 2185 2565 1971 2631 3886 2974
Hamburg 2977 2018 597 714 1115 460 460 1118 2897 0 550 2671 1159 2198 1479 1238 805 877 1751 949 1155
Hook of Holland 3030 1490 172 330 731 269 269 895 2428 550 0 2280 863 1730 1183 1098 851 457 1683 1500 1205
Lisbon 4532 1305 2084 2052 1827 2290 2971 1936 676 2671 2280 0 1178 668 1762 2250 2507 1799 2700 3231 2937
Lyons 2753 645 690 739 789 714 1458 158 1817 1159 863 1178 0 1281 320 328 724 471 1048 2108 1157
Madrid 3949 636 1558 1550 1347 1764 2498 1439 698 2198 1730 668 1281 0 1157 1724 2010 1273 2097 3188 2409
Marseilles 2865 521 1011 1059 1101 1035 1778 425 1693 1479 1183 1762 320 1157 0 618 1109 792 1011 2428 1363
Milan 2282 1014 925 1077 1209 911 1537 328 2185 1238 1098 2250 328 1724 618 0 331 856 586 2187 898
Munich 2179 1365 747 977 1160 583 1104 591 2565 805 851 2507 724 2010 1109 331 0 821 946 1754 428
Paris 3000 1033 285 280 340 465 1176 513 1971 877 457 1799 471 1273 792 856 821 0 1476 1827 1249
Rome 817 1460 1511 1662 1794 1497 2050 995 2631 1751 1683 2700 1048 2097 1011 586 946 1476 0 2707 1209
Stockholm 3927 2868 1616 1786 2196 1403 650 2068 3886 949 1500 3231 2108 3188 2428 2187 1754 1827 2707 0 2105
Vienna 1991 1802 1175 1381 1588 937 1455 1019 2974 1155 1205 2937 1157 2409 1363 898 428 1249 1209 2105 0
My Question: Suppose I have 6 cities and the Longitude/Latitude for each of these cities :
data_1 = data.frame(id = c(1,2,3), long = rnorm(3, -74, 1 ), lat = rnorm(3, 40, 1 ))
data_2 = data.frame(id = c(4,5,6), long = rnorm(3, -78, 1 ), lat = rnorm(3, 42, 1 ))
final_data = rbind(data_1, data_2)
final_data$names <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
id long lat names
1 1 -75.28447 40.21079 city_1
2 2 -73.29385 40.09104 city_2
3 3 -75.12737 38.88355 city_3
4 4 -79.42325 42.61917 city_4
5 5 -77.82508 41.11707 city_5
6 6 -77.62831 39.94935 city_6
I can also make a similar matrix for these cities that contains the distance between each pair of cities:
library(geosphere)
N <- nrow(final_data)
dists <- outer(seq_len(N), seq_len(N), function(a,b) {
geosphere::distHaversine(final_data[a,2:3], final_data[b,2:3]) # Notes 1, 2
})
D <- as.matrix(dists)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.0 169895.7 148361.1 437239.3 237056.7 201742.0
[2,] 169895.7 0.0 207068.8 584183.9 399577.9 369814.4
[3,] 148361.1 207068.8 0.0 551356.0 338698.3 245620.3
[4,] 437239.3 584183.9 551356.0 0.0 213326.6 332955.7
[5,] 237056.7 399577.9 338698.3 213326.6 0.0 131051.7
[6,] 201742.0 369814.4 245620.3 332955.7 131051.7 0.0
How can I make my matrix look the same way as the "eurodist" matrix?
I had thought of the following way to do this:
colnames(dists) <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
rownames(dists) <- c("city_1", "city_2", "city_3", "city_4", "city_5", "city_6")
city_1 city_2 city_3 city_4 city_5 city_6
city_1 0.0 169895.7 148361.1 437239.3 237056.7 201742.0
city_2 169895.7 0.0 207068.8 584183.9 399577.9 369814.4
city_3 148361.1 207068.8 0.0 551356.0 338698.3 245620.3
city_4 437239.3 584183.9 551356.0 0.0 213326.6 332955.7
city_5 237056.7 399577.9 338698.3 213326.6 0.0 131051.7
city_6 201742.0 369814.4 245620.3 332955.7 131051.7 0.0
In the end, I would like to use the above matrix as input for a customized Travelling Salesman Problem (R: Customizing the Travelling Salesman Problem) - e.g. Try to find the optimal path when you are forced to start at "city 4" and the third city should be "city 5":
D <- dists
transformMatrix <- function(fixed_points, D){
if(length(fixed_points) == 0) return(D)
p <- integer(nrow(D))
pos <- match(names(fixed_points), colnames(D))
p[fixed_points] <- pos
p[-fixed_points] <- sample(setdiff(seq_len(nrow(D)), pos))
D[p, p]
}
fixed_points <- c(
"city_4" = 1, "city_5" = 3
)
D_perm <- transformMatrix(fixed_points, D)
feasiblePopulation <- function(n, size, fixed_points){
positions <- setdiff(seq_len(n), fixed_points)
m <- matrix(0, size, n)
if(length(fixed_points) > 0){
m[, fixed_points] <- rep(fixed_points, each = size)
for(i in seq_len(size))
m[i, -fixed_points] <- sample(positions)
} else {
for(i in seq_len(size))
m[i,] <- sample(positions)
}
m
}
mutation <- function(n, fixed_points){
positions <- setdiff(seq_len(n), fixed_points)
function(obj, parent){
vec <- obj#population[parent,]
if(length(positions) < 2) return(vec)
indices <- sample(positions, 2)
replace(vec, indices, vec[rev(indices)])
}
}
fitness <- function(tour, distMatrix) {
tour <- c(tour, tour[1])
route <- embed(tour, 2)[,2:1]
1/sum(distMatrix[route])
}
popSize = 500
res <- ga(
type = "permutation",
fitness = fitness,
distMatrix = D_perm,
lower = 1,
upper = nrow(D_perm),
mutation = mutation(nrow(D_perm), fixed_points),
crossover = gaperm_pmxCrossover,
suggestions = feasiblePopulation(nrow(D_perm), popSize, fixed_points),
popSize = popSize,
maxiter = 5000,
run = 500,
pmutation = 0.2
)
colnames(D_perm)[res#solution[1,]]
This results in the following error:
Error in if (object#run >= run) break :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In max(fitness) : no non-missing arguments to max; returning -Inf
2: In max(Fitness, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
3: In max(fitness) : no non-missing arguments to max; returning -Inf
4: In max(x, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Is the above error because I have not made "distance matrix" (i.e. "D") properly? Is there a different way to name the columns and rows of a matrix in R?
Thanks!
Note : If anyone knows another way to solve this constraint Travelling Salesman Problem with custom cities using the Genetic Algorithm in R (e.g. different objective function, different way to specify constraints, etc.), please let me know. I am open to different ways to solving this problem!
That’s not the problem. The error says the it encountered code:
if (object#run >= run) break
… and either object#run or run had length 0 which the if function cannot handle gracefully. It may be an error in the ga function itself or in the arguments to it.
To address the direct question about how to make the distance matrix look like the example in eurodist: There is a dimnames attribute for matrices. You need to assign a list with a rownames and a colnames value in it and assign that list to the dimnames attribute.
dimnames(D) <- list(rownames=final_data$names,
colnames=final_data$names)
Then when you run your code you get an error from the ga(...) call:
Error in gaperm_pmxCrossover_Rcpp(object, parents) : index error
Looking at the problem setup, your population size appears much larger than needed. If you drop it down a bit to say 100 or 200, then the results begin to be computed.
popSize=200;
# now calculate a res
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
popSize=100
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
popSiz=20
colnames(D_perm)[res#solution[1,]]
#[1] "city_4" "city_6" "city_5" "city_1" "city_3" "city_2"
It doesn't seem "proper" that a population size larger than needed should cause an obscure error, so you might contact the package maintainer with your example (now that it has been "dressed up" properly.)
I've got the following data:
ClusterID AvgGenes nCoreGenes Ratio
20001 1941 1572 0.809892
20005 1599 1374 0.859287
20008 2017 1712 0.848785
20009 1808 1590 0.879425
20013 1823 1469 0.805815
20015 2056 1677 0.815661
20019 2135 1783 0.835129
20020 3152 2625 0.832805
20026 2028 1586 0.782051
20028 1835 1420 0.773842
20030 2885 2189 0.758752
20031 1772 1485 0.838036
20032 1722 1473 0.855401
20034 1801 1459 0.810105
20035 1677 1339 0.798450
20042 2193 1651 0.752850
20047 1747 1345 0.769891
20049 1306 1008 0.771822
20051 1738 1358 0.781358
20052 1552 1188 0.765464
20062 2179 1509 0.692520
20065 2047 1894 0.925256
20074 1948 1568 0.804928
20088 2588 2192 0.846986
20103 1916 1341 0.699896
20109 2511 2190 0.872162
20117 1668 1278 0.766187
20162 1936 1601 0.826963
20167 2068 1856 0.897485
20168 4375 3992 0.912457
20170 3961 3252 0.821005
20190 2327 2013 0.865062
20196 3350 2522 0.752836
20198 3028 2302 0.760238
20207 1522 1241 0.815375
20208 1791 1546 0.863205
20215 3013 1853 0.615002
20219 2803 2043 0.728862
20225 4604 2931 0.636620
20247 1927 1567 0.813181
20248 2510 1732 0.690040
20251 2252 1674 0.743339
20279 2843 1775 0.624340
20293 1611 1245 0.772812
20313 2277 1914 0.840580
20314 2320 1915 0.825431
20318 2201 1762 0.800545
20320 2287 1943 0.849585
20321 2060 1645 0.798544
20323 2242 1524 0.679750
20327 2132 1845 0.865385
20328 1685 1402 0.832047
20329 2393 1727 0.721688
20341 2190 1729 0.789498
20368 3906 2991 0.765745
20370 3245 2325 0.716487
20373 2608 1935 0.741948
20374 3632 2380 0.655286
20388 1787 1435 0.803022
20408 1506 1262 0.837981
20423 1979 1428 0.721577
20433 2452 1646 0.671289
20459 2118 1649 0.778565
20462 1778 1496 0.841395
20478 1653 1447 0.875378
20492 2709 1895 0.699520
20494 2686 1773 0.660089
20498 2676 1909 0.713378
20508 1425 1092 0.766316
20517 2461 1983 0.805770
20548 2752 2059 0.748183
20565 2239 1764 0.787852
20566 2368 1882 0.794764
20569 2285 1877 0.821444
20572 2179 1703 0.781551
20573 1609 1355 0.842138
20577 1753 1379 0.786651
20579 1786 1426 0.798432
20589 1811 1239 0.684152
20600 2293 1822 0.794592
20650 1693 1422 0.839929
20677 1904 1485 0.779937
20729 1680 1362 0.810714
20742 2210 1855 0.839367
20744 1583 1372 0.866709
20746 2087 1743 0.835170
20750 1859 1418 0.762776
20753 1701 1496 0.879483
20758 1480 1169 0.789865
20759 1839 1406 0.764546
20772 2068 1786 0.863636
20773 2321 2024 0.872038
20775 2528 2012 0.795886
20784 1869 1592 0.851792
20788 1843 1516 0.822572
20809 1541 1352 0.877352
20811 1569 1346 0.857871
20824 1594 1323 0.829987
20836 2287 1688 0.738085
20857 2252 1704 0.756661
20890 1884 1340 0.711253
20903 1681 1404 0.835217
20966 1826 1455 0.796824
20967 1877 1605 0.855088
20990 2125 1605 0.755294
21002 1743 1345 0.771658
21027 1866 1504 0.806002
21047 2866 2191 0.764480
21049 2163 1596 0.737864
21059 2298 1847 0.803742
21085 1640 1490 0.908537
21258 3002 1950 0.649567
21325 2945 2117 0.718846
21326 2343 1996 0.851899
21348 2362 1809 0.765876
21370 2313 1553 0.671422
21384 1932 1383 0.715839
21405 1948 1398 0.717659
21477 1852 1538 0.830454
21584 2514 1838 0.731106
21586 1247 910 0.729751
21734 1619 1452 0.896850
21818 1593 1363 0.855618
21826 2688 2009 0.747396
21845 2595 1854 0.714451
21889 1678 1285 0.765793
22085 1718 1314 0.764843
22153 1290 1139 0.882946
22347 2356 1629 0.691426
22359 2170 1552 0.715207
22396 1648 1337 0.811286
I would like to use AvgGenes as my x-axis and nCoreGenes as my primary y-axis. In addition, I would like to add a second y-axis for the ratio which is nCoreGenes/AvgGenes*100 (pCoreGenes). However, I couldn't find the right formula: y-axis/x-axis*100 to use for scale_y_continuous(sec.axis()) in ggplot2.
cluster2core$pCoreGenes <- cluster2core$Ratio*100
g6 <- ggplot(cluster2core, aes(AvgGenes, nCoreGenes))
g6 <- g6 + geom_point(aes(y = nCoreGenes)) + geom_smooth(method = lm)
g6 <- g6 + geom_line(aes(y = pCoreGenes))
g6 <- g6 + labs(y = "Number of core genes", x = "Average number of genes")
#g6 <- g6 + scale_y_continuous(sec.axis = sec_axis())
The mean value of the ratio % is 78.7 so I expect to get a horizontal line which indicates that on average genomes has 78% core genes.
Using secondary axes in ggplot requires a cheat. You need to pretend that your secondary y axis data are in the same range as the primary y axis data, so scale it accordingly. Multiplying by 100 does not suffice, as you want to have the data in the range around 1000 or so. Multiplying by 4000 should get you there.
Then, you need to reverse the process for the axis, specifying an argument to sec_axis. Normally, you would divide by 4000, but since you want percentage, divide by 40:
ggplot(df, aes(x=AvgGenes, y=nCoreGenes)) + geom_point() +
geom_smooth(method=lm) +
geom_line(aes(y=Ratio*4000)) +
scale_y_continuous(sec.axis=sec_axis( ~ . / 40))
Also, there is no need to specify the esthetics in geom_point since it is inherited from the esthetics in the ggplot() call.
I have a time series which represents the amount of a certain product sold throughout the year 2018. I am trying to decompose the time series but I get the following error Error in decompose(myzoo) : time series has no or less than 2 periods. This is my code in R
## requiere packages
library(forecast)
library(sweep)
library(timetk)
library(zoo)
## Read the Data
Data <- read.delim("R Project/Dataset/MyData.txt")
## Create a daily Date object
inds <- seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day")
## Create a time series object
myzoo <- zoo(Data, inds)
## print myzoo
print(myzoo)
## plot myzoo
plot(myzoo)
plot(myzoo, xaxt = "n")
Axis(inds, side = 1, at = seq(inds[1], tail(inds, 1) + 60, by = "1 months"), format = "%b %Y")
## Decompose myzoo
composition = decompose(myzoo)
stl(myzoo)
## use auto.arima to choose ARIMA terms
fit <- auto.arima(myzoo)
## forecast for next 60 time points
fore <- forecast(fit, h = 15, level=c(99.5))
## plot it with no x-axis
plot(fore, xaxt = "n")
Axis(inds, side = 1, at = seq(inds[1], tail(inds, 1) + 60, by = "1 months"), format = "%b %Y")
And this is my data (MyData.txt):
X
0
2621
3407
3644
3569
1212
0
0
4473
3885
3671
3641
1453
0
4182
3812
3650
3444
3557
1612
0
4004
3631
3342
3203
3424
1597
0
4280
3644
3642
3696
3793
1753
0
4416
3935
3522
3544
3569
1649
0
3871
3442
3144
3158
3693
1780
0
4322
3682
3499
3279
3485
1716
0
4255
3713
3470
3673
3983
1931
0
4771
3986
3833
3501
3620
1710
0
4407
3799
3654
3332
3693
1780
0
0
4574
4016
3748
3559
1625
0
4548
3726
2780
0
0
122
0
5005
4300
3772
3929
3917
2021
0
4820
4117
3668
3664
3639
1742
0
4473
4151
3844
3499
3736
1838
0
4346
3693
3297
3327
3639
1773
0
4519
0
4352
4079
4143
1970
0
4693
4018
3679
3838
3606
1601
0
0
4289
4011
3742
3710
1781
0
4186
3707
3600
3484
3702
1747
0
4195
3838
3504
3609
3934
1943
0
0
5243
4754
4164
4121
1854
0
0
5173
4518
3875
3889
1904
0
5105
4056
4186
4079
3953
1846
0
4543
4341
4013
2998
4048
1767
0
0
4317
5260
5185
4969
2046
0
5683
5004
4567
4542
4266
2065
0
4357
5281
4830
4510
0
1567
0
5818
4906
4518
4218
4275
2074
0
5005
4645
4543
4558
4574
2129
0
4755
0
4458
3845
3746
1689
0
4285
3476
3447
2959
3470
1584
0
0
4159
3881
3533
3360
1643
0
4152
3748
3329
3112
3303
1790
0
3852
4190
3482
3313
3400
1582
0
4042
3706
3451
3137
3178
1518
0
4077
3754
3429
3369
3307
1467
0
3918
3620
3442
3302
3168
1630
0
3967
3707
3397
3294
3314
1646
0
4196
3812
3478
3111
3113
1411
0
0
3717
3501
3282
3366
1554
0
3737
3428
3028
2960
2977
1513
0
3608
3306
2941
2918
3238
1543
0
0
3959
3678
3367
3237
1024
0
0
4057
3562
3344
3367
1602
0
3784
3581
3395
2948
3009
1446
0
3676
3276
3112
3125
3133
1502
0
4200
4027
3739
3531
3222
2
0
4446
4342
4066
3811
2932
1643
0
4587
4534
4146
3994
3350
1400
0
1248
0
4248
4629
4346
1844
0
168
The zeros represent sales on holidays and Sundays. The purpose of this script is to be able to make forecast.
Thanks in advance.
Can't help you with the software . Perhaps contact the author. You data is better suited to deterministic effects rather than arima memory effects. There are strong monthly effects and even stronger daily effects. along with a host of pulses probably reflecting holiday or promotion effects that are currently omitted from the model.
The Actual/Fit and Forecast should give you motivation to pursue this approach. with statistical summary here
I have a daily data of sales with zero values (by holidays and sundays) and I want to apply boxCox.lambda() function, but clearly with the zero values this is impossible. Mi options actually are:
1 - Change the zero values by values approaching zero, but I do not know how this can affect my forecast.
Any suggestions I will be grateful.
This is my data:
Data
0
2621
3407
3644
3569
1212
0
0
4473
3885
3671
3641
1453
0
4182
3812
3650
3444
3557
1612
0
4004
3631
3342
3203
3424
1597
0
4280
3644
3642
3696
3793
1753
0
4416
3935
3522
3544
3569
1649
0
3871
3442
3144
3158
3693
1780
0
4322
3682
3499
3279
3485
1716
0
4255
3713
3470
3673
3983
1931
0
4771
3986
3833
3501
3620
1710
0
4407
3799
3654
3332
3693
1780
0
0
4574
4016
3748
3559
1625
0
4548
3726
2780
0
0
122
0
5005
4300
3772
3929
3917
2021
0
4820
4117
3668
3664
3639
1742
0
4473
4151
3844
3499
3736
1838
0
4346
3693
3297
3327
3639
1773
0
4519
0
4352
4079
4143
1970
0
4693
4018
3679
3838
3606
1601
0
0
4289
4011
3742
3710
1781
0
4186
3707
3600
3484
3702
1747
0
4195
3838
3504
3609
3934
1943
0
0
5243
4754
4164
4121
1854
0
0
5173
4518
3875
3889
1904
0
5105
4056
4186
4079
3953
1846
0
4543
4341
4013
2998
4048
1767
0
0
4317
5260
5185
4969
2046
0
5683
5004
4567
4542
4266
2065
0
4357
5281
4830
4510
0
1567
0
5818
4906
4518
4218
4275
2074
0
5005
4645
4543
4558
4574
2129
0
4755
0
4458
3845
3746
1689
0
4285
3476
3447
2959
3470
1584
0
0
4159
3881
3533
3360
1643
0
4152
3748
3329
3112
3303
1790
0
3852
4190
3482
3313
3400
1582
0
4042
3706
3451
3137
3178
1518
0
4077
3754
3429
3369
3307
1467
0
3918
3620
3442
3302
3168
1630
0
3967
3707
3397
3294
3314
1646
0
4196
3812
3478
3111
3113
1411
0
0
3717
3501
3282
3366
1554
0
3737
3428
3028
2960
2977
1513
0
3608
3306
2941
2918
3238
1543
0
0
3959
3678
3367
3237
1024
0
0
4057
3562
3344
3367
1602
0
3784
3581
3395
2948
3009
1446
0
3676
3276
3112
3125
3133
1502
0
4200
4027
3739
3531
3222
2
0
4446
4342
4066
3811
2932
1643
0
4587
4534
4146
3994
3350
1400
0
1248
0
4248
4629
4346
1844
0
168
I'd recommend you just drop all the Sundays from your data. As we know they will alway s be zero there is no point in spending time and effort on forecasting them.
The periodicity is very strong even with them removed, and diagnosing the data by looking at acf plots etc. is much more straight forward.
# Removing every Sunday and creating a ts object of appropriate frequency
x6 <- x[seq_along(x) %% 7 != 0]
x6.ts <- ts(x6, frequency=6)
# Plenty of periodic structure left
par(mfcol=c(2, 1))
sp <- split(x6.ts, (seq_along(x6.ts)-1) %% 6 + 1)
stripchart(sp, vertical=TRUE, col=rainbow(6, alpha=0.2, start=0.97), pch=16,
method="jitter", group.names=c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
plot.default(x6.ts, type="p", pch=16, col=rainbow(6, alpha=0.6, start=0.97))
The we could f.ex apply a SARIMA model
acf(x6.ts, adj=c(0.5))
title("x6.ts", cex.main=0.9)
acf(diff(x6.ts, lag=6))
title("diff(x6.ts, lag=6)", cex.main=0.9)
I see a seasonal random walk there, and once we take the seasonal difference we see that there's at least a couple of seasonal autoregressive components, and maybe a non-seasonal autoregression.
aa6.1 <- arima(x6.ts, order=c(0, 0, 0), seasonal=c(1, 1, 0))
aa6.2 <- arima(x6.ts, order=c(0, 0, 0), seasonal=c(2, 1, 0))
aa6.3 <- arima(x6.ts, order=c(1, 0, 0), seasonal=c(2, 1, 0))
aa6.4 <- arima(x6.ts, order=c(1, 0, 0), seasonal=c(3, 1, 0))
dummy11 <- model.matrix(~ as.factor(seq_along(x6.ts) %% 11))[,2]
aa6.5 <- arima(x6.ts, order=c(1, 0, 0), seasonal=c(3, 1, 0),
xreg=dummy11)
AIC(aa6.1, aa6.2, aa6.3, aa6.4, aa6.5)
# df AIC
# aa6.1 2 5244.846
# aa6.2 3 5195.019
# aa6.3 4 5192.212
# aa6.4 5 5179.310
# aa6.5 6 5164.567
acfr <- function(x){
a <- acf(residuals(x), plot=FALSE)
a$acf[1, 1, 1] <- 0
plot(a, main="", frame.plot=FALSE, ylim=c(-0.2, 0.2))
mod <- paste(paste(names(x$call),
as.character(x$call), sep="=")[-1], collapse=", ")
text(-0.1, 0.19, pos=4, xpd=NA,
paste0("AIC: ", round(x$aic), "\n", "Mod: ", mod))
}
par(mfcol=c(5, 1))
k <- lapply(list(aa6.1, aa6.2, aa6.3, aa6.4, aa6.5), acfr)
Seems like (1 0 0) (3 1 0)[6] does a decent job, but there's a persistent autocorrelation at lag 11. This is an artefact of the removal of Sundays, but we can address it by including an external regressor of dummys.
I am trying this example with my data set but it gives me very strange results: Example of Time Series Prediction using Neural Networks in R
Do you have any idea why it is like this?
This is my source code:
require(quantmod)
require(nnet)
require(caret)
series = read.csv("data.csv")
model <- train(y ~ x1+x2 , series, method='nnet', linout=TRUE, trace = FALSE)
series["o"] <- predict(model, series)
plot.ts(series)
write.csv(series, paste(format(Sys.time(), "%Y%m%d%I%p"), "csv", sep = "."))
This is my data set:
3938
1317
4021
10477
9379
7707
9507
4194
2681
3522
5599
5641
6737
7781
2044
1501
6586
4915
5918
6132
9394
2113
935
9729
5236
8815
3169
5888
5722
191
9539
3384
6006
7139
7285
136
1843
5094
3795
5985
5566
3545
965
14
3738
4645
8439
6390
13842
7754
11440
7572
4876
3206
5577
2734
1169
20
5049
6612
2685
7000
6711
4091
26
5383
5516
7185
6118
4484
2178
754
8104
8209
6159
11137
8994
5172
425
8082
5337
5712
7157
6385
3343
4196
5957
8581
3686
0
254
1819
1071
876
3509
2777
1474
4945
3971
21
5466
5509
1316
5653
2775
797
22
5601
6177
5662
5132
6543
1700
4361
6951
7734
3451
5385
6358
6838
19
6460
5813
6839
6335
2105
8
6
9530
1250
5668
5595
6008
2315
1712
8553
5570
5979
4818
6745
5250
43
5727
7416
5888
6270
4931
0
31
6190
11164
5768
7307
5412
2716
35
8391
6054
2796
5081
6646
4597
1978
7570
5909
9581
3571
6740
1702
1080
6719
963
6781
7544
7708
1993
597
2394
5516
12966
723
6528
2476
86
5956
5820
6995
6682
2460
2479
56
7095
7255
6310
9971
3725
5400
452
6018
5803
6673
6098
9476
692
20
7855
11970
10557
5696
7765
3847
47
6020
6037
5684
7089
6372
970
861
3590
7672
3730
10689
9428
1514
2062
6154
5234
6160
5134
879
1079
9164
6338
6687
8195
6351
1123
4216
3759
9372
7782
3143
4773
6993
849
906
6385
7512
8824
8150
12464
7726
8745
13594
6589
6524
2784
0
1785
688
7998
6797
8289
10815
10280
4839
3928
10935
4588
5785
6771
7628
2908
11391
6637
5585
7454
5828
8259
6644
2436
7055
7206
7873
7368
6239
3595
3166
1846
2301
21
1600
2390
1894
1469
9097
8401
2034
3244
8811
2979
20
7808
7698
11031
4556
7149
3745
5563
9673
8149
12158
7043
6273
1855
80
10729
5880
9327
6343
7227
3522
1244
6382
7186
4964
6162
7435
10524
2449
7437
11970
6661
6122
7323
6707
25
2270
5117
6676
5317
7032
7689
4891
8051
5699
4927
11553
6418
2968
11338
7662
9976
5526
14341
4331
10026
1672
5199
4699
7774
7958
7720
2499
10745
19609
15896
5705
6207
7699
2543
32
3642
6307
7491
6236
8644
2121
1448
7838
5434
5945
6074
6962
5441
42
7424
5818
8877
5743
7980
3140
3046
8329
8186
5994
2931
7309
862
145
8141
6252
9536
6213
7150
2718
1687
5000
6068
5918
10652
12257
1505
2421
10518
2368
7341
8137
7997
3437
2009
5468
3947
5836
8567
11039
3726
746
3417
8649
8016
7652
8298
1306
4031
5525
6203
11847
7688
10911
1080
1001
12315
6084
6529
4074
8526
3161
2184
7400
4916
4521
1523
398
1364
925
38
2580
1039
6556
2040
1166
825
7672
7177
6104
7928
6240
1420
1214
10638
10726
2323
6113
8112
2757
3761
6982
5680
7793
8983
8546
1335
817
6136
3778
6639
6548
6120
3648
584
9099
6434
8828
9988
6066
2575
2237
5114
5879
4094
9309
8008
1614
4307
5801
8006
6344
4803
10904
1339
411
8468
6945
5471
8828
4157
1134
1071
5542
2213
5633
9245
2145
4901
39
10430
7941
6189
7985
8296
614
894
6236
1704
4257
7707
8388
1050
855
9352
4801
7088
8466
470
2433
1036
392
2169
84
5316
8339
4272
2617
1840
7254
5999
6178
4563
3370
756
2773
6610
8967
6182
7452
2570
1443
6537
5338
9158
3870
12036
3574
864
10135
5595
8643
2287
9918
2484