Creating and plotting confidence intervals - r

I have fitted a gaussian GLM model to my data, i now wish to create 95% CIs and fit them to my data. Im having a couple of issues with this when plotting as i cant get them to capture my data, they just seem to plot the same line as the model without captuing the data points. Also Im also unsure that I've created my CIs the correct way here for the mean. I entered my data and code below if anyone knows how to fix this
data used
aids
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
7 28 3 84.50
8 28 4 84.75
9 36 1 85.00
10 32 2 85.25
11 46 3 85.50
12 47 4 85.75
13 50 1 86.00
14 61 2 86.25
15 99 3 86.50
16 95 4 86.75
17 150 1 87.00
18 143 2 87.25
19 197 3 87.50
20 159 4 87.75
21 204 1 88.00
22 168 2 88.25
23 196 3 88.50
24 194 4 88.75
25 210 1 89.00
26 180 2 89.25
27 277 3 89.50
28 181 4 89.75
29 327 1 90.00
30 276 2 90.25
31 365 3 90.50
32 300 4 90.75
33 356 1 91.00
34 304 2 91.25
35 307 3 91.50
36 386 4 91.75
37 331 1 92.00
38 368 2 92.25
39 416 3 92.50
40 374 4 92.75
41 412 1 93.00
42 358 2 93.25
43 416 3 93.50
44 414 4 93.75
45 496 1 94.00
my code used to create the model and intervals before plotting
#creating the model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#now to add approx. 95% confidence envelope around this line
#predict again but at the linear predictor level along with standard errors
my_preds <- predict(model3, newdata=data.frame(aids), se.fit=T, type="link")
#calculate CI limit since linear predictor is approx. Gaussian
upper <- my_preds$fit+1.96*my_preds$se.fit #this might be logit not log
lower <- my_preds$fit-1.96*my_preds$se.fit
#transform the CI limit to get one at the level of the mean
upper <- exp(upper)/(1+exp(upper))
lower <- exp(lower)/(1+exp(lower))
#plotting data
plot(aids$date, aids$cases,
xlab = 'Date', ylab = 'Cases', pch = 20)
#adding CI lines
plot(aids$date, exp(my_preds$fit), type = "link",
xlab = 'Date', ylab = 'Cases') #add title
lines(aids$date,exp(my_preds$fit+1.96*my_preds$se.fit),lwd=2,lty=2)
lines(aids$date,exp(my_preds$fit-1.96*my_preds$se.fit),lwd=2,lty=2)
outcome i currently get with no data points, the model is correct here but the CI isnt as i have no data points, so the CIs are made incorrectly i think somewhere

Edit: Response to OP's providing full data set.
This started out as a question about plotting data and models on the same graph, but has morphed considerably. You seem you have an answer to the original question. Below is one way to address the rest.
Looking at your (and my) plots it seems clear that poisson glm is just not a good model. To say it differently, the number of cases may vary with date, but is also influenced by other things not in your model (external regressors).
Plotting just your data suggests strongly that you have at least two and perhaps more regimes: time frames where the growth in cases follows different models.
ggplot(aids, aes(x=date)) + geom_point(aes(y=cases))
This suggests segmented regression. As with most things in R, there is a package for that (more than one actually). The code below uses the segmented package to build successive poisson glm using 1 breakpoint (two regimes).
library(data.table)
library(ggplot2)
library(segmented)
setDT(aids) # convert aids to a data.table
aids[, pred:=
predict(
segmented(glm(cases~date, .SD, family = poisson), seg.Z = ~date, npsi=1),
type='response', se.fit=TRUE)$fit]
ggplot(aids, aes(x=date))+ geom_line(aes(y=pred))+ geom_point(aes(y=cases))
Note that we need to tell segmented the count of breakpoints, but not where they are - the algorithm figures that out for you. So here, we see a regime prior to 3Q87 which is well modeled using poission glm, and a regime after that which is not. This is a fancy way of saying that "something happened" around 3Q87 which changed the course of the disease (at least in this data).
The code below does the same thing but for between 1 and 4 breakpoints.
get.pred <- \(p.n, p.DT) {
fit <- glm(cases~date, p.DT, family=poisson)
seg.fit <- segmented(fit, seg.Z = ~date, npsi=p.n)
predict(seg.fit, type='response', se.fit=TRUE)[c('fit', 'se.fit')]
}
gg.dt <- rbindlist(lapply(1:4, \(x) { copy(aids)[, c('pred', 'se'):=get.pred(x, .SD)][, npsi:=x] } ))
ggplot(gg.dt, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))+
facet_wrap(~npsi)
Note that the location of the first breakpoint does not seem to change, and also that, notwithstanding the use of the poisson glm the growth appears linear in all but the first regime.
There are goodness-of-fit metrics described in the package documentation which can help you decide how many break points are most consistent with your data.
Finally, there is also the mcp package which is a bit more powerful but also a bit more complex to use.
Original Response: Here is one way that builds the model predictions and std. error in a data.table, then plots using ggplot.
library(data.table)
library(ggplot2)
setDT(aids) # convert aids to a data.table
aids[, c('pred', 'se', 'resid.scale'):=predict(glm(cases~date, data=.SD, family=poisson), type='response', se.fit=TRUE)]
ggplot(aids, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))
Or, you could let ggplot do all the work for you.
ggplot(aids, aes(x=date, y=cases))+
stat_smooth(method = glm, method.args=list(family=poisson))+
geom_point()

Related

Why am I getting similar CIs with so different sample sizes?

I just learned how to do bootstrap in R, and I'm excited. I was playing with some data, and found that, doesn't matter how many bootstrap samples I take, the CIs seem to be always around the same. I believe that, the more samples, the more narrow should the CI be. Here's the code.
library(boot)
M.<-function(dados,i){
d<-dados[i,]
mean(d$queimadas)
}
bootmu<-boot(dados,statistic=M.,R=10000)
boot.ci(bootmu)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates
CALL :
boot.ci(boot.out = bootmu)
Intervals :
Level Normal Basic
95% (18.36, 21.64 ) (18.37, 21.63 )
Level Percentile BCa
95% (18.37, 21.63 ) (18.37, 21.63 )
Calculations and Intervals on Original Scale
Warning message:
In boot.ci(bootmu) : bootstrap variances needed for studentized intervals
As one can see, I took 10000 samples. Now let's try with just 100.
bootmu<-boot(dados,statistic=M.,R=100)
boot.ci(bootmu)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100 bootstrap replicates
CALL :
boot.ci(boot.out = bootmu)
Intervals :
Level Normal Basic
95% (18.33, 21.45 ) (18.19, 21.61 )
Level Percentile BCa
95% (18.39, 21.81 ) (18.10, 21.10 )
Calculations and Intervals on Original Scale
Some basic intervals may be unstable
Some percentile intervals may be unstable
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
Warning messages:
1: In boot.ci(bootmu) :
bootstrap variances needed for studentized intervals
2: In norm.inter(t, adj.alpha) :
extreme order statistics used as endpoints
>
The sample size is many times lower, but the CIs are essentially the same. Why?
If anyone wants to replicate the exact same example, here's the data.
> dados
queimadas plantacoes
1 27 418
2 13 353
3 21 239
4 14 251
5 18 482
6 18 361
7 22 213
8 24 374
9 21 298
10 15 182
11 23 413
12 17 218
13 10 299
14 23 306
15 22 267
16 18 56
17 24 538
18 19 424
19 15 64
20 16 225
21 25 266
22 21 218
23 24 424
24 26 38
25 19 309
26 20 451
27 16 351
28 15 174
29 24 302
30 30 492
The confidence interval for your estimator does not depend on the number of bootstrap replicates, it depends on the size of the original dataset.
Increasing the number of bootstrap replicates will increase the precision with which the sampling distribution (hence the confidence intervals) are calculated, but cannot make your estimate of the mean of your samples more precise.
Try calculating the confidence interval around the mean using an analytic method for comparison.
> confint(lm(dados$queimadas~1))
2.5 % 97.5 %
(Intercept) 18.27624 21.72376
You will see that both bootstraps (with 100 or 10000 samples) are both estimating the CI calculated by linear regression fairly well

ARIMA Number of regressors does not match fitted model , Error in forecast.forecast_ARIMA(fit, xreg = ) in R

I have a time series object named timeseries2 which is as shown below:
timeseries2
timeseries2
Time Series:
Start = 1
End = 49
Frequency = 1
sum_profit sum_quantity sum_discount sum_Segment sum_Ship_mode
1 2424.1125 269 9.45 145 105
2 866.1925 163 8.05 100 79
3 123.4122 527 23.15 329 223
4 3313.2568 543 17.20 352 207
5 2636.2171 468 18.65 277 208
6 5316.8660 506 21.42 245 212
I fit the time series where y = sum_profits column and x = columns other than profit which is sum_quantity, sum_discount, sum_Segment and sum_Ship_mode. I fit these and then try to forecast for nexxt 8 periods. I am getting error as shown
(fit <- auto.arima(timeseries2[,"sum_profit"],
xreg=timeseries2[,c(2:5)]))
fcast <- forecast(fit, xreg=rep(mean(timeseries2[,c(2:5)]),8))
Error in forecast.forecast_ARIMA(fit, xreg = rep(mean(timeseries2[,
c(2:5)]), : Number of regressors does not match fitted model
This error appears because the result from rep(mean(timeseries2[,c(2:5)]),8) is a 1-dimensional vector, whereas your ARIMA model requires a 4-dimensional matrix of values. The following adjustment will run:
fcast <- forecast(fit, xreg=matrix(rep(mean(timeseries2[,c(2:5)]),8),ncol=4))
Of course, this will only give you a 2 period forecast since it is really 2 observations but that is easily solved. You will get a warning unless you provide names to the matrix columns that match your original data, but this is safely ignored if you check your input properly.

Plotting estimated probabilities from binary logistic regression when one or more predictor variables are held constant

I am a biology grad student who has been spinning my wheels for about thirty hours on the following issue. In summary I would like to plot a figure of estimated probabilities from a glm binary logistic regression model i produced. I have already gone through model selection, validation, etc and am now simply trying to produce figures. I had no problem plotting probability curves for the model i selected but what i am really interested in is producing a figure that shows probabilities of a binary outcome for a predictor variable when the other predictor variable is held constant.
I cannot figure out how to assign this constant value to only one of the predictor variables and plot the probability for the other variable. Ultimately i would like to produce figures similar to the crude example i attached desired output. I admit I am a novice in R and I certainly appreciate folks' time but i have exhausted online searches and have yet to find the approach or a solution adequately explained. This is the closest information related to my question but i found the explanation vague and it failed to provide an example for assigning one predictor a constant value while plotting the probability of the other predictor. https://stat.ethz.ch/pipermail/r-help/2010-September/253899.html
Below i provided a simulated dataset and my progress. Thank you very much for your expertise, i believe a solution and code example would be helpful for other ecologists who use logistic regression.
The simulated dataset shows survival outcomes over the winter for lizards. The predictor variables are "mass" and "depth".
x<-read.csv('logreg_example_data.csv',header = T)
x
survival mass depth
1 0 4.294456 262
2 0 8.359857 261
3 0 10.740580 257
4 0 10.740580 257
5 0 6.384678 257
6 0 6.384678 257
7 0 11.596380 270
8 0 11.596380 270
9 0 4.294456 262
10 0 4.294456 262
11 0 8.359857 261
12 0 8.359857 261
13 0 8.359857 261
14 0 7.920406 258
15 0 7.920406 258
16 0 7.920406 261
17 0 10.740580 257
18 0 10.740580 258
19 0 38.824960 262
20 0 9.916840 239
21 1 6.384678 257
22 1 6.384678 257
23 1 11.596380 270
24 1 11.596380 270
25 1 11.596380 270
26 1 23.709520 288
27 1 23.709520 288
28 1 23.709520 288
29 1 38.568970 262
30 1 38.568970 262
31 1 6.581013 295
32 1 6.581013 298
33 1 0.766564 269
34 1 5.440803 262
35 1 5.440803 262
36 1 19.534710 252
37 1 19.534710 259
38 1 8.359857 263
39 1 10.740580 257
40 1 38.824960 264
41 1 38.824960 264
42 1 41.556970 239
#Dataset name is x
# time to run the glm model
model1<-glm(formula=survival ~ mass + depth, family = "binomial", data=x)
model1
summary(model1)
#Ok now heres how i predict the probability of a lizard "Bob" surviving the winter with a mass of 32.949 grams and a burrow depth of 264 mm
newdata<-data.frame(mass = 32.949, depth = 264)
predict(model1, newdata, type = "response")
# the lizard "Bob" has a 87.3% chance of surviving the winter
#Now lets assume the glm. model was robust and the lizard was endangered,
#from all my research I know the average burrow depth is 263.9 mm at a national park
#lets say i am also interested in survival probabilities at burrow depths of 200 and 100 mm, respectively
#how do i use the valuable glm model produced above to generate a plot
#showing the probability of lizards surviving with average burrow depths stated above
#across a range of mass values from 0.0 to 100.0 grams??????????
#i know i need to use the plot and predict functions but i cannot figure out how to tell R that i
#want to use the glm model i produced to predict "survival" based on "mass" when the other predictor "depth" is held at constant values of biological relevance
#I would also like to add dashed lines for 95% CI

glm giving different results in R 3.2.0 and R 3.2.2. Was there any substantial change in usage?

I use glm monthly to calculate a binomial model on the payment behaviour of a credit database, using a call like:
modelx = glm(paid ~ ., data = credit_db, family = binomial())
For the last month, I use R version 3.2.2 (just recently upgraded) and the results were very different than the previous month (done with R version 3.2.0). In order to check the code, I repeated the previous month calculations with version 3.2.2 and got different results from the previous calculation done in R 3.2.0.
Coefficients are also very different, in a wild form. I use at the beginning an exploratory model, with a variable that is the average number of delinquency days during the month, which should yield low coefficients for low average. In version 3.2.0, an extract of summary(modelx) was:
## Coefficients: Estimate Std. Error z value
## delinquency_avg_days1 -0.59329 0.18581 -3.193
## delinquency_avg_days2 -1.32286 0.19830 -6.671
## delinquency_avg_days3 -1.47359 0.21986 -6.702
## delinquency_avg_days4 -1.64158 0.21653 -7.581
## delinquency_avg_days5 -2.56311 0.25234 -10.158
## delinquency_avg_days6 -2.59042 0.25886 -10.007
and for version 3.2.2
## Coefficients Estimate Std. Error z value
## delinquency_avg_days.L -1.320e+01 1.083e+03 -0.012
## delinquency_avg_days.Q -1.140e+00 1.169e+03 -0.001
## delinquency_avg_days.C 3.439e+00 1.118e+03 0.003
## delinquency_avg_days^4 8.454e+00 1.020e+03 0.008
## delinquency_avg_days^5 3.733e+00 9.362e+02 0.004
## delinquency_avg_days^6 -4.988e+00 9.348e+02 -0.005
The summary output is a little different, since the Pr(>|z|) is shown. Notice also that the coefficient names changed too.
In the dataset this delinquency_avg_days variable have the following distribution (0 is "not paid", 1 is "paid", and as you can see, coefficients might be large for average days larger than 20 or so. Number of paid was sampled to match closely the number of "not paid".
0 1
0 140 663
1 59 209
2 62 118
3 56 87
4 66 50
5 69 41
6 64 40
7 78 30
8 75 31
9 70 29
10 77 23
11 69 18
12 79 17
13 61 13
14 53 5
15 67 18
16 50 10
17 40 9
18 39 8
19 23 9
20 24 2
21 36 9
22 35 1
23 17 0
24 11 0
25 11 0
26 7 1
27 3 0
28 0 0
29 0 1
30 1 0
In previous months, I used this exploratory model to create a second binomial model using ranges af average delinquency days. But this other model gives similar results with a few levels.
Now, I'd like to know whether there are substantial changes that require specifying other parameters or there is an issue with glm in version 3.2.2.

What values to look at in cross validated linear regression in DAAG package

I performed the following on a data set that contains 151 variables with 161 observations:-
> library(DAAG)
> fit <- lm(RT..seconds.~., data=cadets)
> cv.lm(df = cadets, fit, m = 10)
And got the following results:-
fold 1
Observations in test set: 16
7 11 12 24 33 38 52 67 72
Predicted 49.6 44.1 26.4 39.8 53.3 40.33 47.8 56.7 58.5
cvpred 575.0 -113.2 640.7 -1045.8 876.7 -5.93 2183.0 -129.7 212.6
RT..seconds. 42.0 44.0 44.0 45.0 45.0 46.00 49.0 56.0 58.0
CV residual -533.0 157.2 -596.7 1090.8 -831.7 51.93 -2134.0 185.7 -154.6
What I want to do is compare the predicted results to the actual experimental results, so I can plot a graph of the two against each other to show how similar they are. I'm I right in assuming I would do this by using the values in the Predicted row as my predicted results and not the cvpred?
I only ask this as when I performed the very same thing in the caret package, the predicted and the observed values came out to be far more different from one another:-
library(caret)
ctrl <- trainControl(method = "cv", savePred=T, classProb=T)
mod <- train(RT..seconds.~., data=cadets, method = "lm", trControl = ctrl)
mod$pred
pred obs rowIndex .parameter Resample
1 141.2 42 6 none Fold01
2 -504.0 42 7 none Fold01
3 1196.1 44 16 none Fold01
4 45.0 45 27 none Fold01
5 262.2 45 35 none Fold01
6 570.9 52 58 none Fold01
7 -166.3 53 61 none Fold01
8 -1579.1 59 77 none Fold01
9 2699.0 60 79 none Fold01
The model shouldn't be this inaccurate as I originally started from 1664 variables, reduced it through the use of random forest so only variables that had a variable importance of greater than 1 was used, which massively reduced my dataset from 162 * 1664 to 162 * 151.
If someone could explain this to me I'd be grateful, thanks
I think there are few areas of confusion here, let me try to clear the up for you.
The "Predicted" section from cv.lm does not correspond to results from crossvalidaiton. If you're interested with crossvalidaiton then you need to look at your "cvpred" results -- "Predicted" corresponds to predictions from the model fit using all of your data.
The reason that there is a such a large difference between your predictions and your cvpredictions is likely because your final model is overfitting which should illustrate why crossvalidation is so important.
I believe that you are fitting your cv.lm model incorrectly. I've never used the package but I think you want to pass in something like cv.lm(df = cadets, RT..seconds.~., m = 10) rather than your fit object. I'm not sure why you see such a large difference between your cvpred and Predicted options in the example above, but these results tell me that passing in a model will lead to using a model that was fit on all of the data for each CV fold:
library(DAAG)
fit <- lm(Sepal.Length ~ ., data=iris)
mod1 <- cv.lm(df=iris,fit,m=10)
mod2 <- cv.lm(df=iris,Sepal.Length ~ .,m=10)
> sqrt(mean((mod1$cvpred - mod1$Sepal.Length)^2))
[1] 0.318
> sqrt(mean((mod2$cvpred - mod2$Sepal.Length)^2))
[1] 5.94
> sqrt(mean((mod1$cvpred - mod1$Predicted)^2))
[1] 0.0311
> sqrt(mean((mod2$cvpred - mod2$Predicted)^2))
[1] 5.94
The reason that there is such a difference between your caret results is because you were looking at the "Predicted" section. "cvpred" should line up closely with caret (although make sure to make indices on your cv results) and if you want to line up the "Predicted" results with caret you will need to get your predictions using something like predict(mod,cadets).

Resources