ARIMA Number of regressors does not match fitted model , Error in forecast.forecast_ARIMA(fit, xreg = ) in R - r

I have a time series object named timeseries2 which is as shown below:
timeseries2
timeseries2
Time Series:
Start = 1
End = 49
Frequency = 1
sum_profit sum_quantity sum_discount sum_Segment sum_Ship_mode
1 2424.1125 269 9.45 145 105
2 866.1925 163 8.05 100 79
3 123.4122 527 23.15 329 223
4 3313.2568 543 17.20 352 207
5 2636.2171 468 18.65 277 208
6 5316.8660 506 21.42 245 212
I fit the time series where y = sum_profits column and x = columns other than profit which is sum_quantity, sum_discount, sum_Segment and sum_Ship_mode. I fit these and then try to forecast for nexxt 8 periods. I am getting error as shown
(fit <- auto.arima(timeseries2[,"sum_profit"],
xreg=timeseries2[,c(2:5)]))
fcast <- forecast(fit, xreg=rep(mean(timeseries2[,c(2:5)]),8))
Error in forecast.forecast_ARIMA(fit, xreg = rep(mean(timeseries2[,
c(2:5)]), : Number of regressors does not match fitted model

This error appears because the result from rep(mean(timeseries2[,c(2:5)]),8) is a 1-dimensional vector, whereas your ARIMA model requires a 4-dimensional matrix of values. The following adjustment will run:
fcast <- forecast(fit, xreg=matrix(rep(mean(timeseries2[,c(2:5)]),8),ncol=4))
Of course, this will only give you a 2 period forecast since it is really 2 observations but that is easily solved. You will get a warning unless you provide names to the matrix columns that match your original data, but this is safely ignored if you check your input properly.

Related

Autocorrelation Functions - time series analysis in R

I have the following df:
TS A_f1 A_p B_f1 B_p C_f1 C_p
1 10 100 15 150 17 170
2 20 200 25 250 27 270
3 30 300 35 350 37 370
This is, however, only a simplification of my real df with 40k+ observations and 100+ features.
TS are timestamps - in every row there are stores listed ("A","B","C", n ...) with features (f1, p, f_n ...)
Before I want to train a LSTM on my df, I want to use the acf function (or pacf) to find some patterns on my data to do some feature selection beforehand.
Any idea, how I can do this with my data?

Creating and plotting confidence intervals

I have fitted a gaussian GLM model to my data, i now wish to create 95% CIs and fit them to my data. Im having a couple of issues with this when plotting as i cant get them to capture my data, they just seem to plot the same line as the model without captuing the data points. Also Im also unsure that I've created my CIs the correct way here for the mean. I entered my data and code below if anyone knows how to fix this
data used
aids
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
7 28 3 84.50
8 28 4 84.75
9 36 1 85.00
10 32 2 85.25
11 46 3 85.50
12 47 4 85.75
13 50 1 86.00
14 61 2 86.25
15 99 3 86.50
16 95 4 86.75
17 150 1 87.00
18 143 2 87.25
19 197 3 87.50
20 159 4 87.75
21 204 1 88.00
22 168 2 88.25
23 196 3 88.50
24 194 4 88.75
25 210 1 89.00
26 180 2 89.25
27 277 3 89.50
28 181 4 89.75
29 327 1 90.00
30 276 2 90.25
31 365 3 90.50
32 300 4 90.75
33 356 1 91.00
34 304 2 91.25
35 307 3 91.50
36 386 4 91.75
37 331 1 92.00
38 368 2 92.25
39 416 3 92.50
40 374 4 92.75
41 412 1 93.00
42 358 2 93.25
43 416 3 93.50
44 414 4 93.75
45 496 1 94.00
my code used to create the model and intervals before plotting
#creating the model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#now to add approx. 95% confidence envelope around this line
#predict again but at the linear predictor level along with standard errors
my_preds <- predict(model3, newdata=data.frame(aids), se.fit=T, type="link")
#calculate CI limit since linear predictor is approx. Gaussian
upper <- my_preds$fit+1.96*my_preds$se.fit #this might be logit not log
lower <- my_preds$fit-1.96*my_preds$se.fit
#transform the CI limit to get one at the level of the mean
upper <- exp(upper)/(1+exp(upper))
lower <- exp(lower)/(1+exp(lower))
#plotting data
plot(aids$date, aids$cases,
xlab = 'Date', ylab = 'Cases', pch = 20)
#adding CI lines
plot(aids$date, exp(my_preds$fit), type = "link",
xlab = 'Date', ylab = 'Cases') #add title
lines(aids$date,exp(my_preds$fit+1.96*my_preds$se.fit),lwd=2,lty=2)
lines(aids$date,exp(my_preds$fit-1.96*my_preds$se.fit),lwd=2,lty=2)
outcome i currently get with no data points, the model is correct here but the CI isnt as i have no data points, so the CIs are made incorrectly i think somewhere
Edit: Response to OP's providing full data set.
This started out as a question about plotting data and models on the same graph, but has morphed considerably. You seem you have an answer to the original question. Below is one way to address the rest.
Looking at your (and my) plots it seems clear that poisson glm is just not a good model. To say it differently, the number of cases may vary with date, but is also influenced by other things not in your model (external regressors).
Plotting just your data suggests strongly that you have at least two and perhaps more regimes: time frames where the growth in cases follows different models.
ggplot(aids, aes(x=date)) + geom_point(aes(y=cases))
This suggests segmented regression. As with most things in R, there is a package for that (more than one actually). The code below uses the segmented package to build successive poisson glm using 1 breakpoint (two regimes).
library(data.table)
library(ggplot2)
library(segmented)
setDT(aids) # convert aids to a data.table
aids[, pred:=
predict(
segmented(glm(cases~date, .SD, family = poisson), seg.Z = ~date, npsi=1),
type='response', se.fit=TRUE)$fit]
ggplot(aids, aes(x=date))+ geom_line(aes(y=pred))+ geom_point(aes(y=cases))
Note that we need to tell segmented the count of breakpoints, but not where they are - the algorithm figures that out for you. So here, we see a regime prior to 3Q87 which is well modeled using poission glm, and a regime after that which is not. This is a fancy way of saying that "something happened" around 3Q87 which changed the course of the disease (at least in this data).
The code below does the same thing but for between 1 and 4 breakpoints.
get.pred <- \(p.n, p.DT) {
fit <- glm(cases~date, p.DT, family=poisson)
seg.fit <- segmented(fit, seg.Z = ~date, npsi=p.n)
predict(seg.fit, type='response', se.fit=TRUE)[c('fit', 'se.fit')]
}
gg.dt <- rbindlist(lapply(1:4, \(x) { copy(aids)[, c('pred', 'se'):=get.pred(x, .SD)][, npsi:=x] } ))
ggplot(gg.dt, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))+
facet_wrap(~npsi)
Note that the location of the first breakpoint does not seem to change, and also that, notwithstanding the use of the poisson glm the growth appears linear in all but the first regime.
There are goodness-of-fit metrics described in the package documentation which can help you decide how many break points are most consistent with your data.
Finally, there is also the mcp package which is a bit more powerful but also a bit more complex to use.
Original Response: Here is one way that builds the model predictions and std. error in a data.table, then plots using ggplot.
library(data.table)
library(ggplot2)
setDT(aids) # convert aids to a data.table
aids[, c('pred', 'se', 'resid.scale'):=predict(glm(cases~date, data=.SD, family=poisson), type='response', se.fit=TRUE)]
ggplot(aids, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))
Or, you could let ggplot do all the work for you.
ggplot(aids, aes(x=date, y=cases))+
stat_smooth(method = glm, method.args=list(family=poisson))+
geom_point()

How to estimate the median survival with upper and lower confidence limits for the median at 90% confidence levels?

My goal is estimate the median survival with upper and lower confidence limits for the median at 90% confidence levels, using a survfit object.
churn_dat <-read_csv("https://raw.githubusercontent.com/square/pysurvival/master/pysurvival/datasets/churn.csv")
churn_dat <- churn_dat %>% filter(months_active > 0)
#create a function of the dataframe by sizes
boot <- function(size,n_sims){
#1. filter data into a particular size
df <- churn_dat %>% filter(company_size == size)
n = nrow(df)
#2. run the bootstrap
experiments = tibble(experiment = rep(1:n_sims, each = n),
index = sample(1:n, size = n * n_sims, replace = TRUE),
time_star = df$months_active[index],
event_star = df$churned[index])
return(experiments)
}
#create a function for plotting
plot_boot_data <- function(experiments){
fit <- survfit(Surv(time_star, event_star) ~ experiment, data = experiments)
#get the median of surv
med <- surv_median(fit)
med <- data.frame(med = med$median)
ggplot(med , aes(x = med, fill= med)) +
geom_histogram(binwidth = .8)+theme_bw()
}
df_10to50 <- boot("10-50",10)
plot_boot_data(df_10to50)
I have found the similar function i.e. surv_median() to do it, but the confidence levels is at 95 %
How can i construct the same thing with confidence levels set to 90 %
The surv_median-function in pkg:survminer is essentially doing what someone doing a screen scrape of the console would do after running the non-exposed survmean function in pkg:survival. (Note the need for the triple-colon (':::') extraction operator from package survival.) surv_median uses the hard coded column names and so is unable to deal with a fit-object that was constructed with a different value of the conf.int parameter in the results of a call to survfit. If you want the output of the survmean-function from such a call, it's not at all difficult. Using your data:
fit <- survfit(Surv(time_star, event_star) ~ experiment, data = df_10to50, conf.int=0.9)
med <- survival:::survmean(fit,rmean=FALSE)
med # result is a named list
#------------
$matrix
records n.max n.start events rmean se(rmean) median 0.9LCL 0.9UCL
experiment=1 673 673 673 298 7.347565 0.2000873 7 5 12
experiment=2 673 673 673 309 7.152863 0.2028425 6 5 10
experiment=3 673 673 673 298 7.345891 0.2068490 9 5 12
experiment=4 673 673 673 323 7.035011 0.1981676 5 4 7
experiment=5 673 673 673 313 7.044400 0.2074104 6 5 9
experiment=6 673 673 673 317 7.061878 0.2021348 6 4 9
experiment=7 673 673 673 311 7.029602 0.2081835 5 4 9
experiment=8 673 673 673 301 7.345766 0.2032876 9 6 10
experiment=9 673 673 673 318 6.912700 0.2050143 7 5 9
experiment=10 673 673 673 327 6.988065 0.1990601 5 4 7
$end.time
[1] 12 12 12 12 12 12 12 12 12 12
If you wanted the median and bounds at 0.9 confidence level it could be obtained with:
med$matrix[ 1 , 7:9] # using numbers instead of column names.
#----------
median 0.9LCL 0.9UCL
7 5 12
I'm afraid there is not a sufficient description of your goal of the process of getting there for me to make sense of the dplyr/magrittr chain of logic, so I'm unable to fill in the proper places in the boot function or the handling of its output by ggplot2. I was initially very confused because you were using a function named boot and I thought you were doing bootstrapped analysis, but there didn't seem to be any mechanism for getting any bootstrapped results, i.e. no randomized selection of rows in an indexable dataset.
If you still wanted to make a purpose-built variant of surv_median you might try modifying this line inside the code:
.table <- .table %>% dplyr::select_(
.dots = c("strata", "median", "`0.95LCL`", "`0.95UCL`"))
I wasn't able to figure out what surv_median was doing with the "strata" column since it didn't match the output of survmean, but that probably because it was using summary.survfit rather than going directly to the function that summary.survfit calls to do the calculations. So happy hacking.

Stratifying multiple columns for cross-validation

There are many ways I've seen to stratify a sample by a single variable to use for cross-validation. The caret package does this nicely with the createFolds() function. By default it seems that caret will partition such that each fold has roughly the same target event rate.
What I want to do though is stratify by the target rate and by time. I've found a function that can partially do this, it's the splitstackshape package and uses the stratified() function. The issue with that function though is it returns a single sample, it doesn't split the data into k groups under the given conditions.
Here's some dummy data to reproduce.
set.seed(123)
time = rep(seq(1:10),100)
target = rbinom(n=100, size=1, prob=0.3)
data = as.data.frame(cbind(time,target))
table(data$time,data$target)
0 1
1 60 40
2 80 20
3 80 20
4 60 40
5 80 20
6 80 20
7 60 40
8 60 40
9 70 30
10 80 20
As you can see, the target event rate is not the same across time. It's 40% in time 1 and 20% in time 2, etc. I want to preserve this when creating the folds used for cross-validation. If I understand correctly, caret will partition by the overall event rate.
table(data$target)
0 1
710 290
This rate of ~30% will be preserved overall, but target event rate over time will not.
We can get one sample like this:
library(splitstackshape)
train.index <- stratified(data,c("target","time"),size=.2)
I need to repeat this though 4 more times for a 5-fold cross validation and it needs to be done such that once a row is assigned it can't be assigned again. I feel like there should be a function designed for this already. Any ideas?
I know this post is old but I just had the same problem and I couldn't find another solution. In case anyone else needs an answer, here's the solution I'm implementing.
library(data.table)
mystratified <- function(indt, group, NUM_FOLDS) {
indt <- setDT(copy(indt))
if (is.numeric(group))
group <- names(indt)[group]
temp_grp <- temp_ind <- NULL
indt[, `:=`(temp_ind, .I)]
indt[, `:=`(temp_grp, do.call(paste, .SD)), .SDcols = group]
samp_sizes <- indt[, .N, by = group]
samp_sizes[, `:=`(temp_grp, do.call(paste, .SD)), .SDcols = group]
inds <- split(indt$temp_ind, indt$temp_grp)[samp_sizes$temp_grp]
z = unlist(inds,use.names=F)
model_folds <- suppressWarnings(split(z, 1:NUM_FOLDS))
}
Which is basically a rewriting of splitstackshape::stratified. It works like the following, giving as output a list of validation indeces for each fold.
myfolds = mystratified(indt = data, group = colnames(data), NUM_FOLDS = 5)
str(myfolds)
List of 5
$ 1: int [1:200] 1 91 181 261 351 441 501 591 681 761 ...
$ 2: int [1:200] 41 101 191 281 361 451 541 601 691 781 ...
$ 3: int [1:200] 51 141 201 291 381 461 551 641 701 791 ...
$ 4: int [1:200] 61 151 241 301 391 481 561 651 741 801 ...
$ 5: int [1:200] 81 161 251 341 401 491 581 661 751 841 ...
So, for instance the train and validation data for each fold are:
# first fold
train = data[-myfolds[[1]],]
valid = data[myfolds[[1]],]
# second fold
train = data[-myfolds[[2]],]
valid = data[myfolds[[2]],]
# etc...

Plotting estimated probabilities from binary logistic regression when one or more predictor variables are held constant

I am a biology grad student who has been spinning my wheels for about thirty hours on the following issue. In summary I would like to plot a figure of estimated probabilities from a glm binary logistic regression model i produced. I have already gone through model selection, validation, etc and am now simply trying to produce figures. I had no problem plotting probability curves for the model i selected but what i am really interested in is producing a figure that shows probabilities of a binary outcome for a predictor variable when the other predictor variable is held constant.
I cannot figure out how to assign this constant value to only one of the predictor variables and plot the probability for the other variable. Ultimately i would like to produce figures similar to the crude example i attached desired output. I admit I am a novice in R and I certainly appreciate folks' time but i have exhausted online searches and have yet to find the approach or a solution adequately explained. This is the closest information related to my question but i found the explanation vague and it failed to provide an example for assigning one predictor a constant value while plotting the probability of the other predictor. https://stat.ethz.ch/pipermail/r-help/2010-September/253899.html
Below i provided a simulated dataset and my progress. Thank you very much for your expertise, i believe a solution and code example would be helpful for other ecologists who use logistic regression.
The simulated dataset shows survival outcomes over the winter for lizards. The predictor variables are "mass" and "depth".
x<-read.csv('logreg_example_data.csv',header = T)
x
survival mass depth
1 0 4.294456 262
2 0 8.359857 261
3 0 10.740580 257
4 0 10.740580 257
5 0 6.384678 257
6 0 6.384678 257
7 0 11.596380 270
8 0 11.596380 270
9 0 4.294456 262
10 0 4.294456 262
11 0 8.359857 261
12 0 8.359857 261
13 0 8.359857 261
14 0 7.920406 258
15 0 7.920406 258
16 0 7.920406 261
17 0 10.740580 257
18 0 10.740580 258
19 0 38.824960 262
20 0 9.916840 239
21 1 6.384678 257
22 1 6.384678 257
23 1 11.596380 270
24 1 11.596380 270
25 1 11.596380 270
26 1 23.709520 288
27 1 23.709520 288
28 1 23.709520 288
29 1 38.568970 262
30 1 38.568970 262
31 1 6.581013 295
32 1 6.581013 298
33 1 0.766564 269
34 1 5.440803 262
35 1 5.440803 262
36 1 19.534710 252
37 1 19.534710 259
38 1 8.359857 263
39 1 10.740580 257
40 1 38.824960 264
41 1 38.824960 264
42 1 41.556970 239
#Dataset name is x
# time to run the glm model
model1<-glm(formula=survival ~ mass + depth, family = "binomial", data=x)
model1
summary(model1)
#Ok now heres how i predict the probability of a lizard "Bob" surviving the winter with a mass of 32.949 grams and a burrow depth of 264 mm
newdata<-data.frame(mass = 32.949, depth = 264)
predict(model1, newdata, type = "response")
# the lizard "Bob" has a 87.3% chance of surviving the winter
#Now lets assume the glm. model was robust and the lizard was endangered,
#from all my research I know the average burrow depth is 263.9 mm at a national park
#lets say i am also interested in survival probabilities at burrow depths of 200 and 100 mm, respectively
#how do i use the valuable glm model produced above to generate a plot
#showing the probability of lizards surviving with average burrow depths stated above
#across a range of mass values from 0.0 to 100.0 grams??????????
#i know i need to use the plot and predict functions but i cannot figure out how to tell R that i
#want to use the glm model i produced to predict "survival" based on "mass" when the other predictor "depth" is held at constant values of biological relevance
#I would also like to add dashed lines for 95% CI

Resources