How to perform repeated k-fold cross validation in R with DAAG package? - r

I have created a 3-fold linear regression model using the HousePrices data set of DAAG package. I have read some of the threads in here and in Cross Validated and it was mentioned multiple times that the cross validation must be repeated many times (like 50 or 100) for robustness. I'm not sure what it means? Does it mean to simply run the code 50 times and calculate the average of the overall ms?
> cv.lm(data = DAAG::houseprices, form.lm = formula(sale.price ~ area+bedrooms),
+ m = 3, dots = FALSE, seed = 29, plotit = c("Observed","Residual"),
+ main="Small symbols show cross-validation predicted values",
+ legend.pos="topleft", printit = TRUE)
Analysis of Variance Table
Response: sale.price
Df Sum Sq Mean Sq F value Pr(>F)
area 1 18566 18566 17.0 0.0014 **
bedrooms 1 17065 17065 15.6 0.0019 **
Residuals 12 13114 1093
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
fold 1
Observations in test set: 5
11 20 21 22 23
Predicted 206 249 259.8 293.3 378
cvpred 204 188 199.3 234.7 262
sale.price 215 255 260.0 293.0 375
CV residual 11 67 60.7 58.3 113
Sum of squares = 24351 Mean square = 4870 n = 5
fold 2
Observations in test set: 5
10 13 14 17 18
Predicted 220.5 193.6 228.8 236.6 218.0
cvpred 226.1 204.9 232.6 238.8 224.1
sale.price 215.0 112.7 185.0 276.0 260.0
CV residual -11.1 -92.2 -47.6 37.2 35.9
Sum of squares = 13563 Mean square = 2713 n = 5
fold 3
Observations in test set: 5
9 12 15 16 19
Predicted 190.5 286.3 208.6 193.3 204
cvpred 174.8 312.5 200.8 178.9 194
sale.price 192.0 274.0 212.0 220.0 222
CV residual 17.2 -38.5 11.2 41.1 27
Sum of squares = 4323 Mean square = 865 n = 5
Overall (Sum over all 5 folds)
ms
2816
Every time I repeat it I get this same ms=2816. Can someone please explain what exactly it means to repeat the CV 100 times? Because repeating this code 100 times doesn't seem to change the ms.

Repeating this code 100 times will not change anything. You have set a seed which means that your sets are always the same sets, which means with three folds, you will have the same three folds, so all 100 times you will get the same mean square error.
It does not seem like you have enough samples to support 50 or 100 folds would be appropriate. And there is NO set number of folds that is appropriate across all sets of data.
The number of folds should be reasonable such that you have sufficient testing data.
Also, you do not want to run multiple different CV models with different seeds, to try to find the best performing seed, because that form of error hacking is a proxy for overfitting.
You should groom your data well, engineer and transform your variables properly pick a reasonable number of folds, set a seed so your stakeholders can repeat your findings and then build your model.

Related

Creating and plotting confidence intervals

I have fitted a gaussian GLM model to my data, i now wish to create 95% CIs and fit them to my data. Im having a couple of issues with this when plotting as i cant get them to capture my data, they just seem to plot the same line as the model without captuing the data points. Also Im also unsure that I've created my CIs the correct way here for the mean. I entered my data and code below if anyone knows how to fix this
data used
aids
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
7 28 3 84.50
8 28 4 84.75
9 36 1 85.00
10 32 2 85.25
11 46 3 85.50
12 47 4 85.75
13 50 1 86.00
14 61 2 86.25
15 99 3 86.50
16 95 4 86.75
17 150 1 87.00
18 143 2 87.25
19 197 3 87.50
20 159 4 87.75
21 204 1 88.00
22 168 2 88.25
23 196 3 88.50
24 194 4 88.75
25 210 1 89.00
26 180 2 89.25
27 277 3 89.50
28 181 4 89.75
29 327 1 90.00
30 276 2 90.25
31 365 3 90.50
32 300 4 90.75
33 356 1 91.00
34 304 2 91.25
35 307 3 91.50
36 386 4 91.75
37 331 1 92.00
38 368 2 92.25
39 416 3 92.50
40 374 4 92.75
41 412 1 93.00
42 358 2 93.25
43 416 3 93.50
44 414 4 93.75
45 496 1 94.00
my code used to create the model and intervals before plotting
#creating the model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#now to add approx. 95% confidence envelope around this line
#predict again but at the linear predictor level along with standard errors
my_preds <- predict(model3, newdata=data.frame(aids), se.fit=T, type="link")
#calculate CI limit since linear predictor is approx. Gaussian
upper <- my_preds$fit+1.96*my_preds$se.fit #this might be logit not log
lower <- my_preds$fit-1.96*my_preds$se.fit
#transform the CI limit to get one at the level of the mean
upper <- exp(upper)/(1+exp(upper))
lower <- exp(lower)/(1+exp(lower))
#plotting data
plot(aids$date, aids$cases,
xlab = 'Date', ylab = 'Cases', pch = 20)
#adding CI lines
plot(aids$date, exp(my_preds$fit), type = "link",
xlab = 'Date', ylab = 'Cases') #add title
lines(aids$date,exp(my_preds$fit+1.96*my_preds$se.fit),lwd=2,lty=2)
lines(aids$date,exp(my_preds$fit-1.96*my_preds$se.fit),lwd=2,lty=2)
outcome i currently get with no data points, the model is correct here but the CI isnt as i have no data points, so the CIs are made incorrectly i think somewhere
Edit: Response to OP's providing full data set.
This started out as a question about plotting data and models on the same graph, but has morphed considerably. You seem you have an answer to the original question. Below is one way to address the rest.
Looking at your (and my) plots it seems clear that poisson glm is just not a good model. To say it differently, the number of cases may vary with date, but is also influenced by other things not in your model (external regressors).
Plotting just your data suggests strongly that you have at least two and perhaps more regimes: time frames where the growth in cases follows different models.
ggplot(aids, aes(x=date)) + geom_point(aes(y=cases))
This suggests segmented regression. As with most things in R, there is a package for that (more than one actually). The code below uses the segmented package to build successive poisson glm using 1 breakpoint (two regimes).
library(data.table)
library(ggplot2)
library(segmented)
setDT(aids) # convert aids to a data.table
aids[, pred:=
predict(
segmented(glm(cases~date, .SD, family = poisson), seg.Z = ~date, npsi=1),
type='response', se.fit=TRUE)$fit]
ggplot(aids, aes(x=date))+ geom_line(aes(y=pred))+ geom_point(aes(y=cases))
Note that we need to tell segmented the count of breakpoints, but not where they are - the algorithm figures that out for you. So here, we see a regime prior to 3Q87 which is well modeled using poission glm, and a regime after that which is not. This is a fancy way of saying that "something happened" around 3Q87 which changed the course of the disease (at least in this data).
The code below does the same thing but for between 1 and 4 breakpoints.
get.pred <- \(p.n, p.DT) {
fit <- glm(cases~date, p.DT, family=poisson)
seg.fit <- segmented(fit, seg.Z = ~date, npsi=p.n)
predict(seg.fit, type='response', se.fit=TRUE)[c('fit', 'se.fit')]
}
gg.dt <- rbindlist(lapply(1:4, \(x) { copy(aids)[, c('pred', 'se'):=get.pred(x, .SD)][, npsi:=x] } ))
ggplot(gg.dt, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))+
facet_wrap(~npsi)
Note that the location of the first breakpoint does not seem to change, and also that, notwithstanding the use of the poisson glm the growth appears linear in all but the first regime.
There are goodness-of-fit metrics described in the package documentation which can help you decide how many break points are most consistent with your data.
Finally, there is also the mcp package which is a bit more powerful but also a bit more complex to use.
Original Response: Here is one way that builds the model predictions and std. error in a data.table, then plots using ggplot.
library(data.table)
library(ggplot2)
setDT(aids) # convert aids to a data.table
aids[, c('pred', 'se', 'resid.scale'):=predict(glm(cases~date, data=.SD, family=poisson), type='response', se.fit=TRUE)]
ggplot(aids, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))
Or, you could let ggplot do all the work for you.
ggplot(aids, aes(x=date, y=cases))+
stat_smooth(method = glm, method.args=list(family=poisson))+
geom_point()

Why am I getting similar CIs with so different sample sizes?

I just learned how to do bootstrap in R, and I'm excited. I was playing with some data, and found that, doesn't matter how many bootstrap samples I take, the CIs seem to be always around the same. I believe that, the more samples, the more narrow should the CI be. Here's the code.
library(boot)
M.<-function(dados,i){
d<-dados[i,]
mean(d$queimadas)
}
bootmu<-boot(dados,statistic=M.,R=10000)
boot.ci(bootmu)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates
CALL :
boot.ci(boot.out = bootmu)
Intervals :
Level Normal Basic
95% (18.36, 21.64 ) (18.37, 21.63 )
Level Percentile BCa
95% (18.37, 21.63 ) (18.37, 21.63 )
Calculations and Intervals on Original Scale
Warning message:
In boot.ci(bootmu) : bootstrap variances needed for studentized intervals
As one can see, I took 10000 samples. Now let's try with just 100.
bootmu<-boot(dados,statistic=M.,R=100)
boot.ci(bootmu)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100 bootstrap replicates
CALL :
boot.ci(boot.out = bootmu)
Intervals :
Level Normal Basic
95% (18.33, 21.45 ) (18.19, 21.61 )
Level Percentile BCa
95% (18.39, 21.81 ) (18.10, 21.10 )
Calculations and Intervals on Original Scale
Some basic intervals may be unstable
Some percentile intervals may be unstable
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
Warning messages:
1: In boot.ci(bootmu) :
bootstrap variances needed for studentized intervals
2: In norm.inter(t, adj.alpha) :
extreme order statistics used as endpoints
>
The sample size is many times lower, but the CIs are essentially the same. Why?
If anyone wants to replicate the exact same example, here's the data.
> dados
queimadas plantacoes
1 27 418
2 13 353
3 21 239
4 14 251
5 18 482
6 18 361
7 22 213
8 24 374
9 21 298
10 15 182
11 23 413
12 17 218
13 10 299
14 23 306
15 22 267
16 18 56
17 24 538
18 19 424
19 15 64
20 16 225
21 25 266
22 21 218
23 24 424
24 26 38
25 19 309
26 20 451
27 16 351
28 15 174
29 24 302
30 30 492
The confidence interval for your estimator does not depend on the number of bootstrap replicates, it depends on the size of the original dataset.
Increasing the number of bootstrap replicates will increase the precision with which the sampling distribution (hence the confidence intervals) are calculated, but cannot make your estimate of the mean of your samples more precise.
Try calculating the confidence interval around the mean using an analytic method for comparison.
> confint(lm(dados$queimadas~1))
2.5 % 97.5 %
(Intercept) 18.27624 21.72376
You will see that both bootstraps (with 100 or 10000 samples) are both estimating the CI calculated by linear regression fairly well

extract beta, sd and P-value from meta-regression using meta package in r to a nice output

I am using the code below to do meta-regression in R and repeat it several time for different variables.
My dataframe and codes are as follow
data<-read.table(text="Studlab PCI.total.FU CABG.total.FU PCI CABG Mean.Age Females..
A 4515 4485 45 51 65.1 22.35
B 4740 4785 74 49 65.95 23.15
C 3621.4 3598.6 41 31 63.15 28.65
D 2337 2314.2 20 29 60 30.5
E 1835.2 1835.2 20 16 66.2 22
F 2014.8 2033.2 11 6 64.45 28.55
G 1125 1125 4 5 61.95 20.65
H 1500 1500 6 3 62.25 23.5
I 976 1000 11 3 61.5 21
J 202 194 10 0 62.4 1", sep="", header=T)
library(meta);library(metafor)
mr <- metainc( PCI, PCI.total.FU,CABG, CABG.total.FU,
data = data, studlab = Studlab, method = "Inverse")
Then for meta-regression I used the following code
MEG<-metareg (mr, ~Mean.Age);MEG ;
#==================================
b = round(MEG[["b"]], digits = 2)
se = round(MEG[["se"]], digits = 2)
pval = round(MEG[["pval"]], digits = 2)
paste0(b,"±",se,", P=",pval)
# Then I repeat meta-regression with another variable
MEG<-metareg (mr, ~Females..);MEG
#==================================
b = round(MEG[["b"]], digits = 2)
se = round(MEG[["se"]], digits = 2)
pval = round(MEG[["pval"]], digits = 2)
paste0(b,"±",se,", P=",pval)
and so on. So; b,se, pval and paste0 steps will be repeated frequently to get the needed output
The content of MEG is shown in the screenshot below.
My question is there is anyway to repeat this function (those repeated steps) several times with different variables (here I used "Mean.Age" then I used "Females..". In another term , I reproduce several MEG with different variables. I am thinking if there is anyway like Macro or so to call those function repeatedly without continuous copy and paste the code several times
Any advice will be greatly appreciated.
I am doing that to finally create a table like this

Fitting logistic growth curves to data

I've been attempting to fit logistic growth equations to data sets I have, with mixed results. I typically use a setup like this:
# Post PT
time <- 1:48
Diversity <- new8
plot(time, Diversity,log="y",las=1, pch=16, type="l")
logisticModel <- nls(Diversity~K/(1+exp(Po+r*time)), start=list(Po=25, r=-1.6, K=200),control=list(maxiter=1000,minFactor=.00000000001))
The goal here is to model Diversity over time logistically; this is a species diversity curve that asymptotes. However, for particular datasets, I cannot get the model to work and can't for the life of me figure out why. As an example, in one iteration, the Diversity (and therefore, new8) value that is being pulled is
[1] 25 22 68 72 126 141 82 61 97 126 101 110 173 164 160 137 122 113 104 104 109 102 107 122 149 127 137 146 185 188 114 91 102 132 147
[36] 148 151 154 165 215 216 206 205 207 207 220 200 204
# plot via this, and it is a nice species diversity curve beginning to level off
plot(Diversity,type="l")
This data is beginning to reach its limit, yet I cannot fit a logistic curve to it. If I try, I get an exceeded max iterations error, no matter how high I crank up the iterations. I've played with the starting parameters over and over with no luck. Currently, for example the code looks like this:
# Post PT
time <- 1:48
Diversity <- new8
plot(time, Diversity,log="y",las=1, pch=16, type="l")
logisticModel <- nls(Diversity~K/(1+exp(Po+r*time)), start=list(Po=25, r=-1.6, K=200),control=list(maxiter=1000,minFactor=.00000000001))
Any help is more than appreciated. Spent all day sitting on my couch stuck on this. If someone has a better way to coerce a logistic growth curve out of data, I'd love to hear it! As a side note, I've used SSlogis for these datasets with no luck, either.
Numerical instability is often a problem with models involving exponential terms. Try evaluating your model at your starting parameters:
> 200/(1+exp(25-1.6*df$norm_time))
[1] 2.871735e-09 2.969073e-09 3.069710e-09 3.173759e-09 3.281333e-09 3.392555e-09 3.507546e-09 3.626434e-09 3.749353e-09
[10] 3.876437e-09 4.007830e-09 4.143676e-09 4.284126e-09 4.429337e-09 4.579470e-09 4.734691e-09 4.895174e-09 5.061097e-09
[19] 5.232643e-09 5.410004e-09 5.593377e-09 5.782965e-09 5.978979e-09 6.181637e-09 6.391165e-09 6.607794e-09 6.831766e-09
[28] 7.063329e-09 7.302742e-09 7.550269e-09 7.806186e-09 8.070778e-09 8.344338e-09 8.627170e-09 8.919589e-09 9.221919e-09
[37] 9.534497e-09 9.857670e-09 1.019180e-08 1.053725e-08 1.089441e-08 1.126368e-08 1.164546e-08 1.204019e-08 1.244829e-08
[46] 1.287023e-08 1.330646e-08 1.375749e-08
With predicted data having such small values, it's likely that any moderate change in the parameters, as required by nls() to estimate gradients, will produce changes in the data that are very small, barely above or even below minFactor().
It's better to normalize your data so that its numerical range is within a nice friendly range, like 0 to 1.
require(stringr)
require(ggplot2)
new8 <- '25 22 68 72 126 141 82 61 97 126 101 110 173 164 160 137 122 113 104 104 109 102 107 122 149 127 137 146 185 188 114 91 102 132 147 148 151 154 165 215 216 206 205 207 207 220 200 204'
Diversity = as.numeric(str_split(new8, '[ ]+')[[1]])
time <- 1:48
df = data.frame(time=time, div=Diversity)
# normalize time
df$norm_time <- df$time / max(df$time)
# normalize diversity
df$norm_div <- (df$div - min(df$div)) / max(df$div)
With this way of normalizing diversity, your Po parameter can always be assumed to be 0. That means we can eliminate it from the model. The model now only has two degrees of freedom instead of three, which also makes fitting easier.
That leads us to the following model:
logisticModel <- nls(norm_div~K/(1+exp(r*norm_time)), data=df,
start=list(K=1, r=-1.6),
control=list(maxiter=1000, minFactor=.00000000001))
Your data doesn't look like that great a fit to the model to me, but I'm not an expert in your field:
ggplot(data=df, aes(x=norm_time, y=norm_div)) +
geom_point(log='y') +
geom_line(aes(x=norm_time, y=predict(logisticModel)), color='red') +
theme_bw()
quartz.save('~/Desktop/SO_31236153.png', type='png')
summary(logisticModel)
Formula: norm_div ~ K/(1 + exp(r * norm_time))
Parameters:
Estimate Std. Error t value Pr(>|t|)
K 0.6940 0.1454 4.772 1.88e-05 ***
r -2.6742 2.4222 -1.104 0.275
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1693 on 46 degrees of freedom
Number of iterations to convergence: 20
Achieved convergence tolerance: 5.895e-06

What values to look at in cross validated linear regression in DAAG package

I performed the following on a data set that contains 151 variables with 161 observations:-
> library(DAAG)
> fit <- lm(RT..seconds.~., data=cadets)
> cv.lm(df = cadets, fit, m = 10)
And got the following results:-
fold 1
Observations in test set: 16
7 11 12 24 33 38 52 67 72
Predicted 49.6 44.1 26.4 39.8 53.3 40.33 47.8 56.7 58.5
cvpred 575.0 -113.2 640.7 -1045.8 876.7 -5.93 2183.0 -129.7 212.6
RT..seconds. 42.0 44.0 44.0 45.0 45.0 46.00 49.0 56.0 58.0
CV residual -533.0 157.2 -596.7 1090.8 -831.7 51.93 -2134.0 185.7 -154.6
What I want to do is compare the predicted results to the actual experimental results, so I can plot a graph of the two against each other to show how similar they are. I'm I right in assuming I would do this by using the values in the Predicted row as my predicted results and not the cvpred?
I only ask this as when I performed the very same thing in the caret package, the predicted and the observed values came out to be far more different from one another:-
library(caret)
ctrl <- trainControl(method = "cv", savePred=T, classProb=T)
mod <- train(RT..seconds.~., data=cadets, method = "lm", trControl = ctrl)
mod$pred
pred obs rowIndex .parameter Resample
1 141.2 42 6 none Fold01
2 -504.0 42 7 none Fold01
3 1196.1 44 16 none Fold01
4 45.0 45 27 none Fold01
5 262.2 45 35 none Fold01
6 570.9 52 58 none Fold01
7 -166.3 53 61 none Fold01
8 -1579.1 59 77 none Fold01
9 2699.0 60 79 none Fold01
The model shouldn't be this inaccurate as I originally started from 1664 variables, reduced it through the use of random forest so only variables that had a variable importance of greater than 1 was used, which massively reduced my dataset from 162 * 1664 to 162 * 151.
If someone could explain this to me I'd be grateful, thanks
I think there are few areas of confusion here, let me try to clear the up for you.
The "Predicted" section from cv.lm does not correspond to results from crossvalidaiton. If you're interested with crossvalidaiton then you need to look at your "cvpred" results -- "Predicted" corresponds to predictions from the model fit using all of your data.
The reason that there is a such a large difference between your predictions and your cvpredictions is likely because your final model is overfitting which should illustrate why crossvalidation is so important.
I believe that you are fitting your cv.lm model incorrectly. I've never used the package but I think you want to pass in something like cv.lm(df = cadets, RT..seconds.~., m = 10) rather than your fit object. I'm not sure why you see such a large difference between your cvpred and Predicted options in the example above, but these results tell me that passing in a model will lead to using a model that was fit on all of the data for each CV fold:
library(DAAG)
fit <- lm(Sepal.Length ~ ., data=iris)
mod1 <- cv.lm(df=iris,fit,m=10)
mod2 <- cv.lm(df=iris,Sepal.Length ~ .,m=10)
> sqrt(mean((mod1$cvpred - mod1$Sepal.Length)^2))
[1] 0.318
> sqrt(mean((mod2$cvpred - mod2$Sepal.Length)^2))
[1] 5.94
> sqrt(mean((mod1$cvpred - mod1$Predicted)^2))
[1] 0.0311
> sqrt(mean((mod2$cvpred - mod2$Predicted)^2))
[1] 5.94
The reason that there is such a difference between your caret results is because you were looking at the "Predicted" section. "cvpred" should line up closely with caret (although make sure to make indices on your cv results) and if you want to line up the "Predicted" results with caret you will need to get your predictions using something like predict(mod,cadets).

Resources