hurdle models using continuous data and covariates - r

I was wondering if I get some advice about fitting hurdle models using continuous data and covariates.
I have some continuous data that are generally well fit using a right-skewed distribution such as a Pareto, Gamma, or Weibull distribution. However, there several zeros in my data which are important to my analysis. In addition, I have some categorical (two-level) covariates and would like to model the parameters of a distribution as a function of these covariates in order to formally evaluate their importance (e.g., using AIC).
I have seen examples of hurdle models fit using continuous data but have not yet found any examples of how to incorporate covariates and a model-selection framework. Does anyone have any suggestions as to how to proceed or know of any R packages that allow this procedure? I have included some code below to reproduce the type of data I am working with. The non-zero data are generated via a generalized Pareto distribution from the package texmex. The parameters were estimated directly from my non-zero data. I have also included the code to plot the data in a histogram to see their distribution.
library("texmex")
set.seed(101)
zeros <- rep(0,8)
non_zeros <- rgpd(17, sigm=exp(-10.4856), xi=0.1030, u = 0)
all.data <- c(zeros,non_zeros)
hist(non_zeros,breaks=50,xlim=c(0,0.00015),ylim=c(0,9),main="",xlab="",
col="gray")
hist(zeros,add=TRUE,col="black",breaks=100,xlim=c(0,0.00015),ylim=c(0,9))
legend("topright",legend=c("zeros"),col="black",lwd=8)

Related

estimating heteroskedastic logit model

Suppose (following Train, Discrete Choice Analysis with Simulation, chapter 3.2) you have a discrete choice model, observations from several cities, and you suspect that the variances of the unobserved factors differ over cities. The suggested appropriate
model is heteroskedastic logit (or probit). Can this setup be estimated in R? As far as I can see, the mlogit package doesn't do this
(it doesn't allow you to specify that the source of the heterogeneity is the cities). Is that correct, or am I missing
something? If it is right, any suggestions for an R package?

How to transform data after fitting a distribution with gamlss?

I have a data set where observations come from highly distinct groups. Each group may have a wildly different distribution, so I am trying to find the best distribution using fitdist from fitdistrplus, then use gamlssML from the gamlss package to find the best parameters.
My issue is with transforming the data after this step. For some of the distributions, like the Box-Cox t, I can find the equation for normalizing the data using the BCT coefficients, but for many of these distributions I cannot.
Does gamlss have a function that normalizes the data after fitting? Their documentation only provides the transformations for a small number of distributions https://www.gamlss.com/wp-content/uploads/2018/01/DistributionsForModellingLocationScaleandShape.pdf
Thanks a lot
The normalised data values (for any distribution) are exactly equal to the residuals from a gamlss fit,
m1 <- gamlss()
which can be accessed by
residuals(m1) or
m1$residuals

Survival Curves For Cox PH Models. Checking My Understanding About Plotting Them

Im using the book Applied Survival Analysis Using R by Moore to try and model some time-to-event data. The issue I'm running into is plotting the estimated survival curves from the cox model. Because of this I'm wondering if my understanding of the model is wrong or not. My data is simple: a time column t, an event indicator column (1 for event 0 for censor) i, and a predictor column with 6 factor levels p.
I believe I can plot estimated surival curves for a cox model as follows below. But I don't understand how to use survfit and baseplot, nor functions from survminer to achieve the same end. Here is some generic code for clarifying my question. I'll use the pharmcoSmoking data set to demonstrate my issue.
library(survival)
library(asaur)
t<-pharmacoSmoking$longestNoSmoke
i<-pharmacoSmoking$relapse
p<-pharmacoSmoking$levelSmoking
data<-as.data.frame(cbind(t,i,p))
model <- coxph(Surv(data$t, data$i) ~ p, data=data)
As I understand it, with the following code snippets, modeled after book examples, a baseline (cumulative) hazard at my reference factor level for p may be given from
base<-basehaz(model, centered=F)
An estimate of the survival curve is given by
s<-exp(-base$hazard)
t<-base$time
plot(s~t, typ = "l")
The survival curve associated with a different factor level may then be given by
beta_n<-model$coefficients #only one coef in this case
s_n <- s^(exp(beta_n))
lines(s_n~t)
where beta_n is the coefficient for the nth factor level from the cox model. The code above gives what I think are estimated survival curves for heavy vs light smokers in the pharmcoSmokers dataset.
Since thats a bit of code I was looking to packages for a one-liner solution, I had a hard time with the documentation for Survival ( there weren't many examples in the docs) and also tried survminer. For the latter I've tried:
library(survminer)
ggadjustedcurves(model, variable ="p" , data=data)
This gives me something different than my prior code, although it is similar. Is the method I used earlier incorrect? Or is there a different methodology that accounts for the difference? The survminer code doesn't work from my data (I get a 'can't allocated vector of size yada yada error, and my data is ~1m rows) which seems weird considering I can make plots using what I did before no problem. This is the primary reason I am wondering if I am understanding how to plot survival curves for my model.

Random forest evaluation in R

I am a newbie in R and I am trying to do my best to create my first model. I am working in a 2- classes random forest project and so far I have programmed the model as follows:
library(randomForest)
set.seed(2015)
randomforest <- randomForest(as.factor(goodkit) ~ ., data=training1, importance=TRUE,ntree=2000)
varImpPlot(randomforest)
prediction <- predict(randomforest, test,type='prob')
print(prediction)
I am not sure why I don't get the overall prediction for my model.I must be missing something in my code. I get the OOB and the prediction per case in the test set but not the overall prediction of the model.
library(pROC)
auc <-roc(test$goodkit,prediction)
print(auc)
This doesn't work at all.
I have been through the pROC manual but I cannot get to understand everything. It would be very helpful if anyone can help with the code or post a link to a good practical sample.
Using the ROCR package, the following code should work for calculating the AUC:
library(ROCR)
predictedROC <- prediction(prediction[,2], as.factor(test$goodkit))
as.numeric(performance(predictedROC, "auc")#y.values))
Your problem is that predict on a randomForest object with type='prob' returns two predictions: each column contains the probability to belong to each class (for binary prediction).
You have to decide which of these predictions to use to build the ROC curve. Fortunately for binary classification they are identical (just reversed):
auc1 <-roc(test$goodkit, prediction[,1])
print(auc1)
auc2 <-roc(test$goodkit, prediction[,2])
print(auc2)

How do I plot predictions from new data fit with gee, lme, glmer, and gamm4 in R?

I have fit my discrete count data using a variety of functions for comparison. I fit a GEE model using geepack, a linear mixed effect model on the log(count) using lme (nlme), a GLMM using glmer (lme4), and a GAMM using gamm4 (gamm4) in R.
I am interested in comparing these models and would like to plot the expected (predicted) values for a new set of data (predictor variables). My goal is to compare the predicted effects for each model under particular conditions (x variables). Of particular interest is the comparison between marginal (GEE) and conditional estimates.
I think my main problem might be getting the new data in the correct form with the correct labels and attributes and such. I am still very much an R novice and struggle with this stuff (no course on this at my university unfortunately).
I currently have fitted models
gee1 lme1 lmer1 gamm1
and can extract their fixed effect coefficients and standard errors without a problem. I also don't have a problem converting them from the log scale or estimating confidence intervals accounting for the random effects.
I also have my new dataframe newdat which has 365 observations of 23 variables (average environmental data for each day of the year).
I am stuck on how to predict new count estimates from this. I played around with the model.matrix function but couldn't get it to work. For example, I tried:
mm = model.matrix(terms(glmm1), newdat) # Error in model.frame.default(object,
# data, xlev = xlev) : object is not a matrix
newdat$pcount = mm %*% fixef(glmm1)
Any suggestions or good references would be greatly appreciated. Can anyone help with the error above?
Getting predictions for lme() and lmer() is documented on http://glmm.wikidot.com/faq

Resources