Predicting/imputing the missing values of a Poisson GLM Regression in R? - r

I'm trying to explore ways of imputing missing values in a data set. My dataset contains the number of counts of an occurance (Unnatural, Natural and the sum Total) for Year(2001-2009), Month(1-12), Gender(M/F) and AgeGroup(4 groups).
One of the imputation techniques I'm exploring is (poisson) regression imputation.
Say my data looks like this:
Year Month Gender AgeGroup Unnatural Natural Total
569 2006 5 Male 15up 278 820 1098
570 2006 6 Male 15up 273 851 1124
571 2006 7 Male 15up 304 933 1237
572 2006 8 Male 15up 296 1064 1360
573 2006 9 Male 15up 298 899 1197
574 2006 10 Male 15up 271 819 1090
575 2006 11 Male 15up 251 764 1015
576 2006 12 Male 15up 345 792 1137
577 2007 1 Female 0 NA NA NA
578 2007 2 Female 0 NA NA NA
579 2007 3 Female 0 NA NA NA
580 2007 4 Female 0 NA NA NA
581 2007 5 Female 0 NA NA NA
...
After doing a basic GLM regression - 96 observations have been deleted due to them being missing.
Is there perhaps a way/package/function in R which will use the coefficients of this GLM model to 'predict' (ie. impute) the missing values for Total (even if it just stores it in a separate dataframe - I will use Excel to merge them)? I know I can use the coefficients to predict the different hierarchal rows - but this will take forever. Hopefully there's an one step function/method?
Call:
glm(formula = Total ~ Year + Month + Gender + AgeGroup, family = poisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-13.85467 -1.13541 -0.04279 1.07133 10.33728
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.3433865 1.7541626 7.607 2.81e-14 ***
Year -0.0047630 0.0008750 -5.443 5.23e-08 ***
Month 0.0134598 0.0006671 20.178 < 2e-16 ***
GenderMale 0.2265806 0.0046320 48.916 < 2e-16 ***
AgeGroup01-4 -1.4608048 0.0224708 -65.009 < 2e-16 ***
AgeGroup05-14 -1.7247276 0.0250743 -68.785 < 2e-16 ***
AgeGroup15up 2.8062812 0.0100424 279.444 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 403283.7 on 767 degrees of freedom
Residual deviance: 4588.5 on 761 degrees of freedom
(96 observations deleted due to missingness)
AIC: 8986.8
Number of Fisher Scoring iterations: 4

First, be very careful about the assumption of missing at random. Your example looks like missingness co-occurs with Female and agegroup. You should really test whether missingness is related to any predictors (or whether any predictors are missing). If so, the responses could be skewed.
Second, the function you are seeking is likely to be predict, which can take a glm model. See ?predict.glm for more guidance. You may want to fit a cascade of models (i.e. nested models) to address missing values.

The mice package provides a function of the same name that allows each missing value to be predicted using a regression scheme based on the other values. It can cope with predictors also being missing because it uses an iterative MCMC algorithm.
I don't think poisson regression is an option, but if all of your counts are as large as the example normal regression should offer a reasonable approximation.

Related

How to estimate the median survival with upper and lower confidence limits for the median at 90% confidence levels?

My goal is estimate the median survival with upper and lower confidence limits for the median at 90% confidence levels, using a survfit object.
churn_dat <-read_csv("https://raw.githubusercontent.com/square/pysurvival/master/pysurvival/datasets/churn.csv")
churn_dat <- churn_dat %>% filter(months_active > 0)
#create a function of the dataframe by sizes
boot <- function(size,n_sims){
#1. filter data into a particular size
df <- churn_dat %>% filter(company_size == size)
n = nrow(df)
#2. run the bootstrap
experiments = tibble(experiment = rep(1:n_sims, each = n),
index = sample(1:n, size = n * n_sims, replace = TRUE),
time_star = df$months_active[index],
event_star = df$churned[index])
return(experiments)
}
#create a function for plotting
plot_boot_data <- function(experiments){
fit <- survfit(Surv(time_star, event_star) ~ experiment, data = experiments)
#get the median of surv
med <- surv_median(fit)
med <- data.frame(med = med$median)
ggplot(med , aes(x = med, fill= med)) +
geom_histogram(binwidth = .8)+theme_bw()
}
df_10to50 <- boot("10-50",10)
plot_boot_data(df_10to50)
I have found the similar function i.e. surv_median() to do it, but the confidence levels is at 95 %
How can i construct the same thing with confidence levels set to 90 %
The surv_median-function in pkg:survminer is essentially doing what someone doing a screen scrape of the console would do after running the non-exposed survmean function in pkg:survival. (Note the need for the triple-colon (':::') extraction operator from package survival.) surv_median uses the hard coded column names and so is unable to deal with a fit-object that was constructed with a different value of the conf.int parameter in the results of a call to survfit. If you want the output of the survmean-function from such a call, it's not at all difficult. Using your data:
fit <- survfit(Surv(time_star, event_star) ~ experiment, data = df_10to50, conf.int=0.9)
med <- survival:::survmean(fit,rmean=FALSE)
med # result is a named list
#------------
$matrix
records n.max n.start events rmean se(rmean) median 0.9LCL 0.9UCL
experiment=1 673 673 673 298 7.347565 0.2000873 7 5 12
experiment=2 673 673 673 309 7.152863 0.2028425 6 5 10
experiment=3 673 673 673 298 7.345891 0.2068490 9 5 12
experiment=4 673 673 673 323 7.035011 0.1981676 5 4 7
experiment=5 673 673 673 313 7.044400 0.2074104 6 5 9
experiment=6 673 673 673 317 7.061878 0.2021348 6 4 9
experiment=7 673 673 673 311 7.029602 0.2081835 5 4 9
experiment=8 673 673 673 301 7.345766 0.2032876 9 6 10
experiment=9 673 673 673 318 6.912700 0.2050143 7 5 9
experiment=10 673 673 673 327 6.988065 0.1990601 5 4 7
$end.time
[1] 12 12 12 12 12 12 12 12 12 12
If you wanted the median and bounds at 0.9 confidence level it could be obtained with:
med$matrix[ 1 , 7:9] # using numbers instead of column names.
#----------
median 0.9LCL 0.9UCL
7 5 12
I'm afraid there is not a sufficient description of your goal of the process of getting there for me to make sense of the dplyr/magrittr chain of logic, so I'm unable to fill in the proper places in the boot function or the handling of its output by ggplot2. I was initially very confused because you were using a function named boot and I thought you were doing bootstrapped analysis, but there didn't seem to be any mechanism for getting any bootstrapped results, i.e. no randomized selection of rows in an indexable dataset.
If you still wanted to make a purpose-built variant of surv_median you might try modifying this line inside the code:
.table <- .table %>% dplyr::select_(
.dots = c("strata", "median", "`0.95LCL`", "`0.95UCL`"))
I wasn't able to figure out what surv_median was doing with the "strata" column since it didn't match the output of survmean, but that probably because it was using summary.survfit rather than going directly to the function that summary.survfit calls to do the calculations. So happy hacking.

How to perform repeated k-fold cross validation in R with DAAG package?

I have created a 3-fold linear regression model using the HousePrices data set of DAAG package. I have read some of the threads in here and in Cross Validated and it was mentioned multiple times that the cross validation must be repeated many times (like 50 or 100) for robustness. I'm not sure what it means? Does it mean to simply run the code 50 times and calculate the average of the overall ms?
> cv.lm(data = DAAG::houseprices, form.lm = formula(sale.price ~ area+bedrooms),
+ m = 3, dots = FALSE, seed = 29, plotit = c("Observed","Residual"),
+ main="Small symbols show cross-validation predicted values",
+ legend.pos="topleft", printit = TRUE)
Analysis of Variance Table
Response: sale.price
Df Sum Sq Mean Sq F value Pr(>F)
area 1 18566 18566 17.0 0.0014 **
bedrooms 1 17065 17065 15.6 0.0019 **
Residuals 12 13114 1093
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
fold 1
Observations in test set: 5
11 20 21 22 23
Predicted 206 249 259.8 293.3 378
cvpred 204 188 199.3 234.7 262
sale.price 215 255 260.0 293.0 375
CV residual 11 67 60.7 58.3 113
Sum of squares = 24351 Mean square = 4870 n = 5
fold 2
Observations in test set: 5
10 13 14 17 18
Predicted 220.5 193.6 228.8 236.6 218.0
cvpred 226.1 204.9 232.6 238.8 224.1
sale.price 215.0 112.7 185.0 276.0 260.0
CV residual -11.1 -92.2 -47.6 37.2 35.9
Sum of squares = 13563 Mean square = 2713 n = 5
fold 3
Observations in test set: 5
9 12 15 16 19
Predicted 190.5 286.3 208.6 193.3 204
cvpred 174.8 312.5 200.8 178.9 194
sale.price 192.0 274.0 212.0 220.0 222
CV residual 17.2 -38.5 11.2 41.1 27
Sum of squares = 4323 Mean square = 865 n = 5
Overall (Sum over all 5 folds)
ms
2816
Every time I repeat it I get this same ms=2816. Can someone please explain what exactly it means to repeat the CV 100 times? Because repeating this code 100 times doesn't seem to change the ms.
Repeating this code 100 times will not change anything. You have set a seed which means that your sets are always the same sets, which means with three folds, you will have the same three folds, so all 100 times you will get the same mean square error.
It does not seem like you have enough samples to support 50 or 100 folds would be appropriate. And there is NO set number of folds that is appropriate across all sets of data.
The number of folds should be reasonable such that you have sufficient testing data.
Also, you do not want to run multiple different CV models with different seeds, to try to find the best performing seed, because that form of error hacking is a proxy for overfitting.
You should groom your data well, engineer and transform your variables properly pick a reasonable number of folds, set a seed so your stakeholders can repeat your findings and then build your model.

Plotting estimated probabilities from binary logistic regression when one or more predictor variables are held constant

I am a biology grad student who has been spinning my wheels for about thirty hours on the following issue. In summary I would like to plot a figure of estimated probabilities from a glm binary logistic regression model i produced. I have already gone through model selection, validation, etc and am now simply trying to produce figures. I had no problem plotting probability curves for the model i selected but what i am really interested in is producing a figure that shows probabilities of a binary outcome for a predictor variable when the other predictor variable is held constant.
I cannot figure out how to assign this constant value to only one of the predictor variables and plot the probability for the other variable. Ultimately i would like to produce figures similar to the crude example i attached desired output. I admit I am a novice in R and I certainly appreciate folks' time but i have exhausted online searches and have yet to find the approach or a solution adequately explained. This is the closest information related to my question but i found the explanation vague and it failed to provide an example for assigning one predictor a constant value while plotting the probability of the other predictor. https://stat.ethz.ch/pipermail/r-help/2010-September/253899.html
Below i provided a simulated dataset and my progress. Thank you very much for your expertise, i believe a solution and code example would be helpful for other ecologists who use logistic regression.
The simulated dataset shows survival outcomes over the winter for lizards. The predictor variables are "mass" and "depth".
x<-read.csv('logreg_example_data.csv',header = T)
x
survival mass depth
1 0 4.294456 262
2 0 8.359857 261
3 0 10.740580 257
4 0 10.740580 257
5 0 6.384678 257
6 0 6.384678 257
7 0 11.596380 270
8 0 11.596380 270
9 0 4.294456 262
10 0 4.294456 262
11 0 8.359857 261
12 0 8.359857 261
13 0 8.359857 261
14 0 7.920406 258
15 0 7.920406 258
16 0 7.920406 261
17 0 10.740580 257
18 0 10.740580 258
19 0 38.824960 262
20 0 9.916840 239
21 1 6.384678 257
22 1 6.384678 257
23 1 11.596380 270
24 1 11.596380 270
25 1 11.596380 270
26 1 23.709520 288
27 1 23.709520 288
28 1 23.709520 288
29 1 38.568970 262
30 1 38.568970 262
31 1 6.581013 295
32 1 6.581013 298
33 1 0.766564 269
34 1 5.440803 262
35 1 5.440803 262
36 1 19.534710 252
37 1 19.534710 259
38 1 8.359857 263
39 1 10.740580 257
40 1 38.824960 264
41 1 38.824960 264
42 1 41.556970 239
#Dataset name is x
# time to run the glm model
model1<-glm(formula=survival ~ mass + depth, family = "binomial", data=x)
model1
summary(model1)
#Ok now heres how i predict the probability of a lizard "Bob" surviving the winter with a mass of 32.949 grams and a burrow depth of 264 mm
newdata<-data.frame(mass = 32.949, depth = 264)
predict(model1, newdata, type = "response")
# the lizard "Bob" has a 87.3% chance of surviving the winter
#Now lets assume the glm. model was robust and the lizard was endangered,
#from all my research I know the average burrow depth is 263.9 mm at a national park
#lets say i am also interested in survival probabilities at burrow depths of 200 and 100 mm, respectively
#how do i use the valuable glm model produced above to generate a plot
#showing the probability of lizards surviving with average burrow depths stated above
#across a range of mass values from 0.0 to 100.0 grams??????????
#i know i need to use the plot and predict functions but i cannot figure out how to tell R that i
#want to use the glm model i produced to predict "survival" based on "mass" when the other predictor "depth" is held at constant values of biological relevance
#I would also like to add dashed lines for 95% CI

Fitting logistic growth curves to data

I've been attempting to fit logistic growth equations to data sets I have, with mixed results. I typically use a setup like this:
# Post PT
time <- 1:48
Diversity <- new8
plot(time, Diversity,log="y",las=1, pch=16, type="l")
logisticModel <- nls(Diversity~K/(1+exp(Po+r*time)), start=list(Po=25, r=-1.6, K=200),control=list(maxiter=1000,minFactor=.00000000001))
The goal here is to model Diversity over time logistically; this is a species diversity curve that asymptotes. However, for particular datasets, I cannot get the model to work and can't for the life of me figure out why. As an example, in one iteration, the Diversity (and therefore, new8) value that is being pulled is
[1] 25 22 68 72 126 141 82 61 97 126 101 110 173 164 160 137 122 113 104 104 109 102 107 122 149 127 137 146 185 188 114 91 102 132 147
[36] 148 151 154 165 215 216 206 205 207 207 220 200 204
# plot via this, and it is a nice species diversity curve beginning to level off
plot(Diversity,type="l")
This data is beginning to reach its limit, yet I cannot fit a logistic curve to it. If I try, I get an exceeded max iterations error, no matter how high I crank up the iterations. I've played with the starting parameters over and over with no luck. Currently, for example the code looks like this:
# Post PT
time <- 1:48
Diversity <- new8
plot(time, Diversity,log="y",las=1, pch=16, type="l")
logisticModel <- nls(Diversity~K/(1+exp(Po+r*time)), start=list(Po=25, r=-1.6, K=200),control=list(maxiter=1000,minFactor=.00000000001))
Any help is more than appreciated. Spent all day sitting on my couch stuck on this. If someone has a better way to coerce a logistic growth curve out of data, I'd love to hear it! As a side note, I've used SSlogis for these datasets with no luck, either.
Numerical instability is often a problem with models involving exponential terms. Try evaluating your model at your starting parameters:
> 200/(1+exp(25-1.6*df$norm_time))
[1] 2.871735e-09 2.969073e-09 3.069710e-09 3.173759e-09 3.281333e-09 3.392555e-09 3.507546e-09 3.626434e-09 3.749353e-09
[10] 3.876437e-09 4.007830e-09 4.143676e-09 4.284126e-09 4.429337e-09 4.579470e-09 4.734691e-09 4.895174e-09 5.061097e-09
[19] 5.232643e-09 5.410004e-09 5.593377e-09 5.782965e-09 5.978979e-09 6.181637e-09 6.391165e-09 6.607794e-09 6.831766e-09
[28] 7.063329e-09 7.302742e-09 7.550269e-09 7.806186e-09 8.070778e-09 8.344338e-09 8.627170e-09 8.919589e-09 9.221919e-09
[37] 9.534497e-09 9.857670e-09 1.019180e-08 1.053725e-08 1.089441e-08 1.126368e-08 1.164546e-08 1.204019e-08 1.244829e-08
[46] 1.287023e-08 1.330646e-08 1.375749e-08
With predicted data having such small values, it's likely that any moderate change in the parameters, as required by nls() to estimate gradients, will produce changes in the data that are very small, barely above or even below minFactor().
It's better to normalize your data so that its numerical range is within a nice friendly range, like 0 to 1.
require(stringr)
require(ggplot2)
new8 <- '25 22 68 72 126 141 82 61 97 126 101 110 173 164 160 137 122 113 104 104 109 102 107 122 149 127 137 146 185 188 114 91 102 132 147 148 151 154 165 215 216 206 205 207 207 220 200 204'
Diversity = as.numeric(str_split(new8, '[ ]+')[[1]])
time <- 1:48
df = data.frame(time=time, div=Diversity)
# normalize time
df$norm_time <- df$time / max(df$time)
# normalize diversity
df$norm_div <- (df$div - min(df$div)) / max(df$div)
With this way of normalizing diversity, your Po parameter can always be assumed to be 0. That means we can eliminate it from the model. The model now only has two degrees of freedom instead of three, which also makes fitting easier.
That leads us to the following model:
logisticModel <- nls(norm_div~K/(1+exp(r*norm_time)), data=df,
start=list(K=1, r=-1.6),
control=list(maxiter=1000, minFactor=.00000000001))
Your data doesn't look like that great a fit to the model to me, but I'm not an expert in your field:
ggplot(data=df, aes(x=norm_time, y=norm_div)) +
geom_point(log='y') +
geom_line(aes(x=norm_time, y=predict(logisticModel)), color='red') +
theme_bw()
quartz.save('~/Desktop/SO_31236153.png', type='png')
summary(logisticModel)
Formula: norm_div ~ K/(1 + exp(r * norm_time))
Parameters:
Estimate Std. Error t value Pr(>|t|)
K 0.6940 0.1454 4.772 1.88e-05 ***
r -2.6742 2.4222 -1.104 0.275
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1693 on 46 degrees of freedom
Number of iterations to convergence: 20
Achieved convergence tolerance: 5.895e-06

Using glmer.nb(), the error message:(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate is returned

When using glmer.nb, we just get error message
> glm1 <- glmer.nb(Jul ~ scale(I7)+ Maylg+(1|Year), data=bph.df)
Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
In addition: Warning message:
In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace > :
iteration limit reached
Who can help me? Thanks very much!
My data listed below.
Year Jul A7 Maylg L7b
331 1978 1948 6 1.322219 4
343 1979 8140 32 2.678518 2
355 1980 106896 26 2.267172 2
367 1981 36227 25 4.028205 2
379 1982 19085 18 2.752816 2
391 1983 26010 32 2.086360 3
403 1984 1959 1 2.506505 4
415 1985 8025 18 2.656098 0
427 1986 9780 20 1.939519 0
439 1987 48235 29 4.093912 1
451 1988 7473 30 2.974972 2
463 1989 2850 25 2.107210 2
475 1990 10555 18 2.557507 3
487 1991 70217 30 4.843563 0
499 1992 2350 31 1.886491 2
511 1993 3363 32 2.956649 4
523 1994 5140 37 1.934498 4
535 1995 14210 36 2.492760 1
547 1996 3644 27 1.886491 1
559 1997 9828 29 1.653213 1
571 1998 3119 41 2.535294 4
583 1999 5382 10 2.472756 3
595 2000 690 5 1.886491 2
607 2001 871 13 NA 2
619 2002 12394 27 0.845098 5
631 2003 4473 36 1.342423 2
You're going to have a lot of problems with this data set, among other things, because you have an observation-level random effect (you only have one data point per Year) and are trying to fit a negative binomial model. That essentially means you're trying to fit the overdispersion in two different ways at the same time.
If you fit the Poisson model, you can see that the results are strongly underdispersed (for a Poisson model, the residual deviance should be approximately equal to the residual degrees of freedom).
library("lme4")
glm0 <- glmer(Jul ~ scale(A7)+ Maylg+(1|Year), data=bph.df,
family="poisson")
print(glm0)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson ( log )
Formula: Jul ~ scale(A7) + Maylg + (1 | Year)
Data: bph.df
AIC BIC logLik deviance df.resid
526.4904 531.3659 -259.2452 518.4904 21
Random effects:
Groups Name Std.Dev.
Year (Intercept) 0.9555
Number of obs: 25, groups: Year, 25
Fixed Effects:
(Intercept) scale(A7) Maylg
7.3471 0.3363 0.6732
deviance(glm0)/df.residual(glm0)
## [1] 0.0003479596
Or alternatively:
library("aods3")
gof(glm0)
## D = 0.0073, df = 21, P(>D) = 1
## X2 = 0.0073, df = 21, P(>X2) = 1
glmmADMB does manage to fit it, but I don't know how far I would trust the results (the dispersion parameter is very large, indicating that the model has basically converged to a Poisson distribution anyway).
bph.df <- na.omit(transform(bph.df,Year=factor(Year)))
glmmadmb(Jul ~ scale(A7)+ Maylg+(1|Year), data=bph.df,
family="nbinom")
GLMM's in R powered by AD Model Builder:
Family: nbinom
alpha = 403.43
link = log
Fixed effects:
Log-likelihood: -259.25
AIC: 528.5
Formula: Jul ~ scale(A7) + Maylg + (1 | Year)
(Intercept) scale(A7) Maylg
7.3628472 0.3348105 0.6731953
Random effects:
Structure: Diagonal matrix
Group=Year
Variance StdDev
(Intercept) 0.9105 0.9542
Number of observations: total=25, Year=25
The results are essentially identical to the Poisson model from lme4 above.

Resources