My data looks like below.
ID Group timing glucose_level
<chr> <dbl> <int> <dbl>
1 black 7 0 0 136
2 black 1 0 0 116
3 blue 20 0 0 144
4 green 18 0 0 114
5 red 4 0 0 126
6 red 5 0 0 80
7 green 17 0 0 111
8 green 3 0 0 109
9 red 20 0 0 96
10 black 39 0 0 140
There are some missing values in glucose level.
Below are part of glucose level data
[697] 128 157 132 142 141 128 97 120 123 131 132 126 140 103 147 181 217 257 218 234 240 281 273 224 210 227 NA NA 245
[726] 230 252 270 238 134 173 193 151 128 180 218 218 190 225 214 186 140 237 239 279 246 244 146 196 157 178 140 127 187
[755] 206 177 220 179 167 127 219 223 241 162 235 140 187 154 172 116 139 194 173 150 187 131 176 114 154 180 223 150 219
[784] 130 169 104 136 132 121 175 169 128 110 101 100 92 122 196 203 96 143 129 NA 72 141 143 129 149 132 107 94 76
[813] 80 95 63 198 181 86 122
I wanna use a loop to replace the missing values.
Here are my code:
for(i in 1:length(data)){
if(is.na(data[i,'glucose_level'])){
if(data[i,'Group']==0){
data[i,'glucose_level']=162.7059
}else if(data[i,'Group']==1){
data[i,'glucose_level']= 163.1415
}else{
data[i,'glucose_level']= 165.9106
}
}
}
I print out data$glucose_level and find there are still missing values in it.why no changes in my data???
You can use nested ifelse or case_when and check for conditions and assign values accordingly.
library(dplyr)
data <- data %>%
mutate(glucose_level = case_when(!is.na(glucose_level) ~ glucose_level,
Group == 0 ~ 162.7059,
Group == 1 ~ 163.1415,
TRUE ~ 165.9106))
We can use fcase from data.table
library(data.table)
setDT(data)[, glucose_level := fcase(!is.na(glucose_level), glucose_level,
Group == 0, 162.7059,
Group == 1,163.1415,
165.9106)]
Related
I am calling table in a loop, so the variable name have to be in a variable itself. Reproducible code below:
library(FactoMineR)
data(hobbies)
htable <- matrix(as.numeric(NA), nrow=18, ncol=7,
dimnames=list( colnames(hobbies)[1:18],
names(with(hobbies,
table(Profession)))))
### Then filling the tables with values:
for(hob in rownames(htable)) {
htable[hob, ] <- with(hobbies, table(hob,
Profession))[2, ]
}
+ + > Error in table(hob, Profession) : all arguments must
have the same length
Somehow, the length of hob is taken as 1, so it is not interpreted in the context of the data frame hobbies? What is wrong?
We can use [ instead of with
for(hob in rownames(htable))
htable[hob, ] <- table(hobbies[, hob],
hobbies[, "Profession"])[2,]
-output
> htable
Unskilled worker Manual labourer Technician Foreman Management Employee Other
Reading 361 603 241 570 907 1804 154
Listening music 488 715 285 554 853 1879 150
Cinema 181 307 169 365 640 1016 92
Show 84 189 132 264 533 734 68
Exhibition 92 208 138 326 604 717 74
Computer 147 307 203 351 606 915 80
Sport 163 296 187 361 595 867 75
Walking 320 516 209 443 632 1295 112
Travelling 165 314 189 398 720 932 89
Playing music 69 105 78 152 288 430 52
Collecting 69 129 51 85 120 287 16
Volunteering 54 114 69 168 263 377 40
Mechanic 268 629 249 381 553 867 90
Gardening 266 501 162 347 501 968 92
Knitting 147 98 31 111 116 634 43
Cooking 314 435 156 313 465 1302 91
Fishing 110 223 66 94 103 168 19
TV 78 138 56 127 206 331 34
The values passed in for each loop is a string i.e. 'hob' signifies each value of the rownames,
> with(hobbies, "Reading")
[1] "Reading"
It doesn't return the value of the column 'Reading' when we wrap with hobbies although, directly we can do this
> head(with(hobbies, Reading))
[1] 1 1 1 1 1 0
Levels: 0 1
or with [ or [[
> head(hobbies[, "Reading"])
[1] 1 1 1 1 1 0
Levels: 0 1
> head(hobbies[["Reading"]])
[1] 1 1 1 1 1 0
Levels: 0 1
I'd like to do several manipulations with datasets that are in-built in R from the packages that I have. So, first, I made a vector with dataset's names, but when I tried to filter the datasets which have only one column, I got an error, saying that the length of the argument is 0. Here is the code:
for (i in datasets){
if (ncol(i)==1){dataset <- i datasets <- c(dataset, datasets) }
}
It treats the names of the datasets as a character vector.
Here is the head of the aforementioned vector: [1] ability.cov airmiles AirPassengers airquality anscombe attenu. It's silly, but how could I treat the entries as dataframes?
I don't fully understand your logic, but based on your code, you want to identify which dataset that has one column by using ncol(x) == 1. If that's true, then you need to deal with some issues:
the various structures of the datasets. ncol produces the number of columns on data.frame and matrix but does not on time-series. For example: ncol(anscombe) results in 8 but ncol(AirPassengers) results in NULL. If you decide to use ncol, then you need to coerce each dataset to a data.frame by using as.data.frame.
indexing the character vector of the names of the datasets. You need to call a dataset, not its character name, to be able to use as.data.frame. One way of doing this is by using eval(parse(text=the_name)).
the way to store the result. You can use c() to combine the results but the datasets will be converted to vectors, no longer in their initial structures. I recommend using list to preserve the data frame structures of the datasets.
Here is one possible solution based on those considerations:
datasets <- c("ability.cov", "airmiles", "AirPassengers", "airquality", "anscombe", "attenu")
single_col_datasets <- vector('list', 1)
for (i in seq_along(datasets)){
if (ncol(as.data.frame(eval(parse(text = datasets[i])))) == 1){
single_col_datasets[[i]] <- as.data.frame(eval(parse(text = datasets[i])))
names(single_col_datasets[[i]]) <- datasets[i]
}
not.null.element <- single_col_datasets[lengths(single_col_datasets) != 0]
new.datasets <- list(not.null.element, datasets)
}
Here is the result:
new.datasets
[[1]]
[[1]][[1]]
airmiles
1 412
2 480
3 683
4 1052
5 1385
6 1418
7 1634
8 2178
9 3362
10 5948
11 6109
12 5981
13 6753
14 8003
15 10566
16 12528
17 14760
18 16769
19 19819
20 22362
21 25340
22 25343
23 29269
24 30514
[[1]][[2]]
AirPassengers
1 112
2 118
3 132
4 129
5 121
6 135
7 148
8 148
9 136
10 119
11 104
12 118
13 115
14 126
15 141
16 135
17 125
18 149
19 170
20 170
21 158
22 133
23 114
24 140
25 145
26 150
27 178
28 163
29 172
30 178
31 199
32 199
33 184
34 162
35 146
36 166
37 171
38 180
39 193
40 181
41 183
42 218
43 230
44 242
45 209
46 191
47 172
48 194
49 196
50 196
51 236
52 235
53 229
54 243
55 264
56 272
57 237
58 211
59 180
60 201
61 204
62 188
63 235
64 227
65 234
66 264
67 302
68 293
69 259
70 229
71 203
72 229
73 242
74 233
75 267
76 269
77 270
78 315
79 364
80 347
81 312
82 274
83 237
84 278
85 284
86 277
87 317
88 313
89 318
90 374
91 413
92 405
93 355
94 306
95 271
96 306
97 315
98 301
99 356
100 348
101 355
102 422
103 465
104 467
105 404
106 347
107 305
108 336
109 340
110 318
111 362
112 348
113 363
114 435
115 491
116 505
117 404
118 359
119 310
120 337
121 360
122 342
123 406
124 396
125 420
126 472
127 548
128 559
129 463
130 407
131 362
132 405
133 417
134 391
135 419
136 461
137 472
138 535
139 622
140 606
141 508
142 461
143 390
144 432
[[2]]
[1] "ability.cov" "airmiles" "AirPassengers" "airquality" "anscombe" "attenu"
You can use the get function:
for (i in datasets){
if (ncol(get(i))==1){
dataset <- i
datasets <- c(dataset, datasets)
}
}
The function ludridate::yday returns the day of the year as an integer:
> lubridate::yday("2020-07-01")
[1] 183
I would like to be able to calculate the day of the year assuming a different yearly start date. For example, I would like to start all years on July 1st (07-01), such that I could call:
> lubridate::yday("2020-07-01", start = "2020-07-01")
[1] 1
I could call :
> lubridate::yday("2020-07-01") - lubridate::yday("2020-06-30")
[1] 1
But not only this would fail to account for leap years, it would be difficult to account for a date with a 2021 year (or any date that crosses the January 1st threshold for any given year):
> lubridate::yday("2021-01-01") - lubridate::yday("2020-06-30")
[1] -181
After working a little bit with this on my own, this is what I have created:
valiDATE <- function(date) {
stopifnot(`date must take the form of "MM-DD"` = stringr::str_detect(date, "^\\d{2}-\\d{2}$"))
}
days <- function(x, end = "06-30") {
valiDATE(end)
calcdiff <- function(x) {
endx <- glue::glue("{lubridate::year(x)}-{end}")
if(lubridate::yday(x) > lubridate::yday(endx)) {
diff <- ceiling(difftime(x, endx, units = "days"))
} else {
endx <- glue::glue("{lubridate::year(x)-1}-{end}")
diff <- ceiling(difftime(x, endx, units = "days"))
}
unclass(diff)
}
purrr::map_dbl(x, calcdiff)
}
day_vec <- seq(as.Date("2020-07-01"), as.Date("2021-06-30"), by = "days")
days(day_vec)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
[38] 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
[75] 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
[112] 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
[149] 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
[186] 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222
[223] 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259
[260] 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296
[297] 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333
[334] 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365
I would still like to see other solutions. Thanks!
Adding or subtracting months from your date to the desired start of the year can be helpful in this case.
For your example vector of dates day_vec, you can subtract six months from all the dates if you want to start your year on July 1.
day_vec <- seq(as.Date("2020-07-01"), as.Date("2021-06-30"), by = "days")
day_vec2 <- day_vec %m-% months(6) #Substracting because the new year will start 6 months later
yday(day_vec2) #Result is similar to what you desired.
The point to keep in mind is whether your new beginning of the year is before or after the conventional beginning. If your year starts early, you should add months and vice-versa.
I have a data set that list the percentiles for a set of scores like this:
> percentiles
Score Percentile
1 231 0
2 385 1
3 403 2
4 413 3
5 418 4
6 424 5
7 429 6
8 434 7
9 437 8
10 441 9
11 443 10
I would like the "Score" column to run from 100 to 500. That is, I would like Scores 100 to 231 to be associated with a Percentile of 0, Scores 232 to 385 to be associated with a Percentile of 1, etc. Is there a simple way to fill in the values that do not appear in the sequence of "Score" values so it looks like the below data set?
> percentiles
Score Percentile
1 100 0
2 101 0
3 102 0
4 103 0
5 104 0
6 105 0
7 106 0
8 107 0
9 108 0
10 109 0
--------------------
130 229 0
131 230 0
132 231 0
133 232 1
134 233 1
135 234 1
136 235 1
137 236 1
138 237 1
139 238 1
140 239 1
If you convert percentiles to a data.table, you could do a rolling join with a new table of all scores 100:500. The rolling join with roll = -Inf gives a fill-backward behavior by itself, but still the 444:500 values are NA so a forward nafill is added at the end.
library(data.table)
setDT(percentiles)
percentiles[data.table(Score = 100:500), on = .(Score), roll = -Inf
][, Percentile := nafill(Percentile, 'locf')]
# Score Percentile
# 1: 100 0
# 2: 101 0
# 3: 102 0
# 4: 103 0
# 5: 104 0
# ---
# 397: 496 10
# 398: 497 10
# 399: 498 10
# 400: 499 10
# 401: 500 10
You might think about this differently: instead of a data frame to fill, as a set of breaks for binning your scores. Use the scores as the breaks with -Inf tacked on to have the lower bound. If you need something different to happen for the scores above the highest break, add Inf to the end of the breaks, but you'll need to come up with an additional label.
library(dplyr)
dat <- data.frame(Score = 100:500) %>%
mutate(Percentile = cut(Score, breaks = c(-Inf, percentiles$Score),
labels = percentiles$Percentile,
right = T, include.lowest = F))
Taking a look at a few of the breaking points:
slice(dat, c(129:135, 342:346))
#> Score Percentile
#> 1 228 0
#> 2 229 0
#> 3 230 0
#> 4 231 0
#> 5 232 1
#> 6 233 1
#> 7 234 1
#> 8 441 9
#> 9 442 10
#> 10 443 10
#> 11 444 <NA>
#> 12 445 <NA>
We could use complete
library(dplyr)
library(tidyr)
out <- complete(percentiles, Score = 100:500) %>%
fill(Percentile, .direction = "updown")
out %>%
slice(c(1:10, 130:140)) %>%
as.data.frame
# Score Percentile
#1 100 0
#2 101 0
#3 102 0
#4 103 0
#5 104 0
#6 105 0
#7 106 0
#8 107 0
#9 108 0
#10 109 0
#11 229 0
#12 230 0
#13 231 0
#14 232 1
#15 233 1
#16 234 1
#17 235 1
#18 236 1
#19 237 1
#20 238 1
#21 239 1
data
percentiles <- structure(list(Score = c(231L, 385L, 403L, 413L, 418L, 424L,
429L, 434L, 437L, 441L, 443L), Percentile = 0:10), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))
In base R you could use the findInterval function to break up your sequence 100:500 into buckets determined by the Score, then index into the Percentile column:
x <- 100:500
ind <- findInterval(x, percentiles$Score, left.open = TRUE)
output <- data.frame(Score = x, Percentile = percentiles$Percentile[ind + 1])
Values of x above 443 will receive a percentile of NA.
Here is a base R solution, where cut() and match() are the key points to make it, i.e.,
df <- data.frame(Score = (x <- 100:500),
percentile = percentiles$Percentile[match(s <-cut(x,c(0,percentiles$Score)),levels(s))])
such that
> df
Score percentile
1 100 0
2 101 0
3 102 0
4 103 0
5 104 0
6 105 0
7 106 0
8 107 0
9 108 0
10 109 0
11 110 0
12 111 0
13 112 0
14 113 0
15 114 0
16 115 0
17 116 0
18 117 0
19 118 0
20 119 0
21 120 0
22 121 0
23 122 0
24 123 0
25 124 0
26 125 0
27 126 0
28 127 0
29 128 0
30 129 0
31 130 0
32 131 0
33 132 0
34 133 0
35 134 0
36 135 0
37 136 0
38 137 0
39 138 0
40 139 0
41 140 0
42 141 0
43 142 0
44 143 0
45 144 0
46 145 0
47 146 0
48 147 0
49 148 0
50 149 0
51 150 0
52 151 0
53 152 0
54 153 0
55 154 0
56 155 0
57 156 0
58 157 0
59 158 0
60 159 0
61 160 0
62 161 0
63 162 0
64 163 0
65 164 0
66 165 0
67 166 0
68 167 0
69 168 0
70 169 0
71 170 0
72 171 0
73 172 0
74 173 0
75 174 0
76 175 0
77 176 0
78 177 0
79 178 0
80 179 0
81 180 0
82 181 0
83 182 0
84 183 0
85 184 0
86 185 0
87 186 0
88 187 0
89 188 0
90 189 0
91 190 0
92 191 0
93 192 0
94 193 0
95 194 0
96 195 0
97 196 0
98 197 0
99 198 0
100 199 0
101 200 0
102 201 0
103 202 0
104 203 0
105 204 0
106 205 0
107 206 0
108 207 0
109 208 0
110 209 0
111 210 0
112 211 0
113 212 0
114 213 0
115 214 0
116 215 0
117 216 0
118 217 0
119 218 0
120 219 0
121 220 0
122 221 0
123 222 0
124 223 0
125 224 0
126 225 0
127 226 0
128 227 0
129 228 0
130 229 0
131 230 0
132 231 0
133 232 1
134 233 1
135 234 1
136 235 1
137 236 1
138 237 1
139 238 1
140 239 1
141 240 1
142 241 1
143 242 1
144 243 1
145 244 1
146 245 1
147 246 1
148 247 1
149 248 1
150 249 1
151 250 1
152 251 1
153 252 1
154 253 1
155 254 1
156 255 1
157 256 1
158 257 1
159 258 1
160 259 1
161 260 1
162 261 1
163 262 1
164 263 1
165 264 1
166 265 1
167 266 1
168 267 1
169 268 1
170 269 1
171 270 1
172 271 1
173 272 1
174 273 1
175 274 1
176 275 1
177 276 1
178 277 1
179 278 1
180 279 1
181 280 1
182 281 1
183 282 1
184 283 1
185 284 1
186 285 1
187 286 1
188 287 1
189 288 1
190 289 1
191 290 1
192 291 1
193 292 1
194 293 1
195 294 1
196 295 1
197 296 1
198 297 1
199 298 1
200 299 1
201 300 1
202 301 1
203 302 1
204 303 1
205 304 1
206 305 1
207 306 1
208 307 1
209 308 1
210 309 1
211 310 1
212 311 1
213 312 1
214 313 1
215 314 1
216 315 1
217 316 1
218 317 1
219 318 1
220 319 1
221 320 1
222 321 1
223 322 1
224 323 1
225 324 1
226 325 1
227 326 1
228 327 1
229 328 1
230 329 1
231 330 1
232 331 1
233 332 1
234 333 1
235 334 1
236 335 1
237 336 1
238 337 1
239 338 1
240 339 1
241 340 1
242 341 1
243 342 1
244 343 1
245 344 1
246 345 1
247 346 1
248 347 1
249 348 1
250 349 1
251 350 1
252 351 1
253 352 1
254 353 1
255 354 1
256 355 1
257 356 1
258 357 1
259 358 1
260 359 1
261 360 1
262 361 1
263 362 1
264 363 1
265 364 1
266 365 1
267 366 1
268 367 1
269 368 1
270 369 1
271 370 1
272 371 1
273 372 1
274 373 1
275 374 1
276 375 1
277 376 1
278 377 1
279 378 1
280 379 1
281 380 1
282 381 1
283 382 1
284 383 1
285 384 1
286 385 1
287 386 2
288 387 2
289 388 2
290 389 2
291 390 2
292 391 2
293 392 2
294 393 2
295 394 2
296 395 2
297 396 2
298 397 2
299 398 2
300 399 2
301 400 2
302 401 2
303 402 2
304 403 2
305 404 3
306 405 3
307 406 3
308 407 3
309 408 3
310 409 3
311 410 3
312 411 3
313 412 3
314 413 3
315 414 4
316 415 4
317 416 4
318 417 4
319 418 4
320 419 5
321 420 5
322 421 5
323 422 5
324 423 5
325 424 5
326 425 6
327 426 6
328 427 6
329 428 6
330 429 6
331 430 7
332 431 7
333 432 7
334 433 7
335 434 7
336 435 8
337 436 8
338 437 8
339 438 9
340 439 9
341 440 9
342 441 9
343 442 10
344 443 10
345 444 NA
346 445 NA
347 446 NA
348 447 NA
349 448 NA
350 449 NA
351 450 NA
352 451 NA
353 452 NA
354 453 NA
355 454 NA
356 455 NA
357 456 NA
358 457 NA
359 458 NA
360 459 NA
361 460 NA
362 461 NA
363 462 NA
364 463 NA
365 464 NA
366 465 NA
367 466 NA
368 467 NA
369 468 NA
370 469 NA
371 470 NA
372 471 NA
373 472 NA
374 473 NA
375 474 NA
376 475 NA
377 476 NA
378 477 NA
379 478 NA
380 479 NA
381 480 NA
382 481 NA
383 482 NA
384 483 NA
385 484 NA
386 485 NA
387 486 NA
388 487 NA
389 488 NA
390 489 NA
391 490 NA
392 491 NA
393 492 NA
394 493 NA
395 494 NA
396 495 NA
397 496 NA
398 497 NA
399 498 NA
400 499 NA
401 500 NA
A bit hacky Base R:
# Create a dataframe with all score values in the range:
score_range <- merge(data.frame(Score = c(100:500)), percentiles, by = "Score", all.x = TRUE)
# Reverse the order of the dataframe:
score_range <- score_range[rev(order(score_range$Score)),]
# Change the first NA to the maximum score:
score_range$Percentile[which(is.na(score_range$Percentile))][1] <- max(score_range$Percentile, na.rm = TRUE)
# Replace all NAs with the value before them:
score_range$Percentile <- na.omit(score_range$Percentile)[cumsum(!is.na(score_range$Percentile))]
Data:
percentiles <- structure(list(Score = c(231L, 385L, 403L, 413L, 418L, 424L,
429L, 434L, 437L, 441L, 443L),
Percentile = 0:10), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))
So, I have this time series that tracks the daily number of applications to a graduate program. Each application period is 64 days - so for each period, you start at zero and it goes up until the end of the period. The last period is partial, representing the current application period.
[1] 0 26 32 36 37 38 40 43 43 44 45 45 45 45 49 49 55 61 66 69 73 77 85 94 99 102 104 108 113 117 123 126 128 132 138 143 151 156 158 161 162 172 175 179 182 189 193
[48] 196 206 213 218 225 234 241 243 251 256 264 267 273 277 282 290 302 0 16 23 36 40 44 51 54 58 60 64 66 69 74 82 88 90 91 92 93 96 102 102 104 106 109 111 115 117 124
[95] 124 126 128 128 129 130 132 135 135 136 139 140 146 150 152 155 157 159 160 167 171 173 174 174 176 177 180 182 184 185 186 186 187 187 0 11 16 27 38 40 44 51 54 57 61 71 80
[142] 85 92 95 97 100 107 116 121 125 131 134 134 136 137 143 150 151 156 163 163 165 173 189 200 210 215 233 247 256 275 279 284 291 304 310 315 325 330 332 332 343 345 351 357 359 359 365
[189] 371 372 372 374 0 24 34 41 53 65 74 78 84 90 93 96 104 105 112 118 122 126 134 138 143 151 155 156 158 159 164 171 177 180 184 188 196 201 203 218 223 225 230 233 236 240 245
[236] 250 255 259 265 267 275 281 285 290 293 298 307 316 319 320 322 325 328 338 342 342 0 10 18 23 27 40 51 60 67 71 73 76 82 88 91 94 102 102 104 111 114 118 119 123 123 130
[283] 133 142 146 154 157 160 163 172 177 187 192 195 195 197 201 208 210 214 222 225 227 232 240 243 246 249 251 254 258 261 265 267 269 270 272 274 293 293 0 12 17 19 22 27 28 32 35
[330] 38 44 45 45 46 52 54 55 61 67 73 77 79 82 85 87 90 110 122 128 133 145 157 169 179 198 205 215 229 239 256 264 279 290 298 306 309 317 322 324 327 331 341 357 375 379 382
[377] 385 395 396 398 400 407 409 415 0 57 72 94 104 119 125 129 131 136 149 154 165 173 177 181 186 191 195 204 210 216 224 234 240 245 253 257 263 269 273 276 283 287 304 322 328 332 352
[424] 366 377 380 383 387 388 398 405 408 411 416 420 427 435 437 446 448 455 463 468 476 486 493 501 501 0 17 35 48 61 69 77 87 95 100 105 109 112 117 120 122 125 131 136 141 145 154
[471] 159 161 164 169 172 179 182 190 192 199 203 206 209 218 225 228 231 237 241 243 245 248 249 256 262 277 289 295 303 308 313 321 330 333 334 342 343 344 346 349 353 354 1 17 32 40 48
[518] 50 53 54 55 56 62 65 69 73 75 81 85 87 89 92 96 98 100 103 106 108 111 112 113 121 123 127 130 136 136 141 143 146 146 150 151 152 153 154 164 175 184 187 189 191 192 193
[565] 198 203 217 220 230 234 237 240 244 256 262 268 0 20 31 46
Each day, I run a simple model that happens to predict the number of applications quite well.
myts2 <- ts(df, frequency = 64)
myts2 <- HoltWinters(myts2, seasonal = "additive")
fcast <- predict(myts2, n.ahead=60, prediction.interval = T, level = 0.95)
# Creates data frame with day (0 to 63), predicted fit, and confidence intervals
castout <- data.frame((elapsed):63, as.numeric(fcast[,1]), as.numeric(fcast[,2]), as.numeric(fcast[,3]))
names(castout) <- c("Day", "Total", "High", "Low")
# Simplified; this block ensures the low esimate cannot dip below the current number of applications
castout$Low[castout$Low < 53)] <- 53
Here's a graph of the results, and the output of fcast:
> fcast
Time Series:
Start = c(10, 5)
End = c(10, 64)
Frequency = 64
fit upr lwr
10.06250 51.08407 77.18901 24.979132
10.07812 55.25007 91.76327 18.736879
10.09375 61.69342 106.24630 17.140542
10.10938 65.36204 116.71089 14.013186
10.12500 69.29609 126.64110 11.951078
10.14062 71.76356 134.53454 8.992582
10.15625 76.06790 143.83176 8.304034
10.17188 78.42243 150.83574 6.009127
10.18750 81.85213 158.63385 5.070411
10.20312 86.70147 167.61610 5.786832
10.21875 94.62669 179.47316 9.780222
10.23438 101.18980 189.79380 12.585798
10.25000 104.27303 196.48157 12.064493
10.26562 106.00446 201.68183 10.327081
10.28125 107.74120 206.76598 8.716431
10.29688 109.56690 211.82956 7.304241
10.31250 112.75659 218.15771 7.355464
10.32812 119.17347 227.62227 10.724667
10.34375 120.76563 232.17877 9.352490
10.35938 123.42045 237.72108 9.119822
10.37500 126.19423 243.31117 9.077281
10.39062 130.27639 250.14350 10.409274
10.40625 133.92534 256.48092 11.369764
10.42188 138.90565 264.09197 13.719325
10.43750 142.15385 269.91676 14.390943
10.45312 149.87770 280.16626 19.589151
10.46875 152.03874 284.80490 19.272586
10.48438 155.52991 290.72828 20.331547
10.50000 143.70956 281.29715 6.121980
10.51562 144.86804 284.80405 4.932018
10.53125 150.57027 292.81595 8.324581
10.54688 156.17148 300.68993 11.653042
10.56250 162.91642 309.67243 16.160415
10.57812 167.96348 316.92344 19.003512
10.59375 170.24252 321.37431 19.110738
10.60938 173.24254 326.51538 19.969707
10.62500 173.89835 329.28274 18.513961
10.64062 181.92820 339.39583 24.460577
10.65625 185.62127 345.14493 26.097603
10.67188 188.82313 350.37666 27.269594
10.68750 191.58817 355.14638 28.029951
10.70312 197.56781 363.10643 32.029187
10.71875 201.46633 368.96194 33.970710
10.73438 203.75381 373.18381 34.323802
10.75000 211.86575 383.20831 40.523188
10.76562 218.58229 391.81629 45.348290
10.78125 223.19144 398.29645 48.086433
10.79688 229.36717 406.32341 52.410940
10.81250 237.59928 416.38758 58.810989
10.82812 244.59432 425.19609 63.992543
10.84375 247.02798 429.42520 64.630764
10.85938 253.22807 437.40324 69.052906
10.87500 258.46738 444.40349 72.531266
10.89062 265.76017 453.44071 78.079642
10.90625 268.82203 458.23093 79.413143
10.92188 274.29332 465.41494 83.171700
10.93750 278.46062 471.27976 85.641485
10.95312 283.35496 477.85680 88.853120
10.96875 290.67334 486.84344 94.503231
10.98438 301.22108 499.04539 103.396775
As you can see, the # of applications in a given cycle is either flat or increasing. Yet in the prediction, there's a dip just after day 30. For the life of me, I cannot figure out what is causing it. Any ideas?