Predict with survreg/tobit goes past bound - r

So I am using survreg, and I expect my predicted results to obey a lower bound of 0, but they indicate negative results frequently. I think it is somehow estimating a linear result instead of the survival model I'm trying to create. Here's what I've done:
linear.first.stage<-lm(y ~ x, data=clip)
First I estimated some points to speed up my estimation process. It fails to converge without this first stage. I create a survival object, following the code from ?survreg that provides an explicit example of a tobit regression. I duplicated this below for x and y. In my data set, y can only be observed at a non-negative value, but if it is positive, it tends to be distributed normally around 200 or so with sd of about 20. X may take any value and isn't theoretically bound by any particular number that immediately comes to mind.
surv_y<-Surv(clip$y, clip$y>0,type="left")
first.stage<-survreg(surv_y ~ x,init=(linear.first.stage), dist="gaussian", data=clip)
I run the survival regression, which should be equivalent to a Tobit. To confirm that my interpretation of events were the same, I ran the following:
test<-tobit(y~x, left=0, right=Inf, dist="gaussian", data=clip)
p_test<-predict(test)
p<-predict(first.stage)
plot(p_test-p)
The plot shows a flat line at zero, so upon visual inspection these commands are identical, as they should be. However, in both cases, results under 0 are predicted. This is problematic because I have stated that the leftward bound of observable information is 0. My expectations is that all predicted values must be >0.
I have tried predicting using types "link", "response", "linear", but to no avail. I assume the predict command is producing the outcomes as if the censorship was not occurring. How do I produce the prediction that obeys the lower bound of 0?
References:
Running predict() after tobit() in package AER
https://stats.stackexchange.com/questions/11440/standardized-residuals-of-a-tobit-model-in-r

You probably need to scale the prediction up in the sense that is described here by one of the authors of the package.

Answer: Tobit is not the right regression type. Tobit predicts what the result ought to be in the absence of the truncation.
Clarification: I restructured my estimation process to reflect a zero-inflated or hurdle model. Tobit is for censored data, it says there exists a non-zero result, but we only observe 0 because the information is hidden somehow. For example, women's wages should be fit with Tobit, because married women who choose not to work still have a reservation wage, and still have some (invisible) return to effort doing unpaid labor of whatever type. Zero-inflated or hurdle models indicate that the result is truly zero. As in, no crimes occurred. Or no widgets produced. They more accurately reflected my model.

Related

bam() returns negative deviance explained values

I'm trying to run GAMs to analyze some temperature data. I have remote cameras and external temperature loggers, and I'm trying to model the difference in the temperatures recorded by them (camera temperature - logger temperature). Most of the time, the cameras are recording higher temperatures, but sometimes, the logger returns the higher temperature, in which case the difference ends up being a negative value. The direction of the difference is something that I care about, so I do have to have non-positive values as a response. My explanatory variables are percent canopy cover (quantitative), direct and diffuse radiation (quant.), and camera direction (ordered factor) as fixed effects as well as the camera/logger pair (factor) for a random effect.
I had mostly been using the gam() function in mgcv to run my models. I'm using a scat distribution since my data is heavy-tailed. My model code is as follows:
gam(f1, family = scat(link = "identity"), data = d)
I wanted to try using bam() since I have 60,000 data points (one temperature observation per hour of the day for several months). The gam() models run fine, though they take a while to run. But the exact same model formulas run in bam() end up returning negative deviance explained values. I also get 50+ warning messages that all say:
In y - mu : longer object length is not a multiple of shorter object length
Running gam.check() on the fitted models returns identical residuals plots. The parametric coefficients, smooth terms, and R-squared values are also almost identical. The only things that have really noticeably changed are the deviance explained values, and they've changed to something completely nonsensical (the deviance explained values for the bam() models range from -61% to -101% deviance explained).
I'll admit that I'm brand new to using GAM's. I know just enough to know that the residuals plots are more important than the deviance explained values, and the residuals plots look good (way better than they did with a Gaussian distribution, at least). More than anything, I'm curious about what's going on within the bam() function specifically that's causing the function to pass that warning and return a negative deviance explained value. Is there some extra argument that I can set in bam() or some further manipulations I can do to my data to prevent this from happening, or can I ignore it and move forward since my residuals plots look good and the outputs are mostly the same?
Thanks in advance for any help.

Poisson Regression with overload of zeroes SAS

I am testing different models for the best fit and most robust statistics to my data. My dataset contains over 50000 observations, approx. over 99.3% of the data are zeroes - such 0.7% are actual events.
Eventually see: https://imgur.com/a/CUuTlSK
I search to find the best fit of the following models; Logistic, Poisson, NB, ZIP, ZINB, PLH, NBLH. (NB: Negative-binomial, ZI: Zero-Inflated, P: Poisson, LH: Logit Hurdle)
The first way I tried doing this was by estimating the binary response with logistic regression.
My questions: Can I use Poisson on the binary variable or should I instead impose the binary with some integer values? For instance with the associated loss; if y=1 then y_val=y*loss. In my case, the variance of y_val becomes approx. 2.5E9. I held to use the binary variable because it does not matter, in this purpose, what the company defaulted with, default is default no matter the amount.
Both with logistic regression and Poisson, I got some terrible statistic: Very high deviance value (and 0 p-value), terrible estimates (=many of the estimated parameters are 0 -> odds ratio =1), very low confidence intervals, everything seems to be 'wrong'. If I transform the response variable to log(y_val) for y>1 in Poisson the statistics seem to get better - however, this is against the assumptions of integer count response in Poisson.
I briefly have tested the ZINB, it does not change the statistics significantly (=it does not help at all in this case).
Does there exist any proper way of dealing with such a dataset? I am interested in achieving the best fit for my data (about startup business' and their default status).
The data are cleaned and ready to be fitted. Is there anything I should be aware of that I don't have mentioned?
I use the genmod procedure in SAS with dist=Poisson, zinb, zip etc.
Thanks in advance.
Sorry, my rep is too low to comment, so it has to be an answer.
You should consider undersampling technique before using any regression/model, because your target is below 5%, which makes it extremely difficult to to predict.
Undersampling is a method of cutting out non-target events, in order to increase target ratio, I really recommend considering it, I got to use it once in my practice, and it seemed pretty helpful

Using offset in GAM zero inflated poisson (ziP) model

I am trying to model count data of birds in forest fragments of different size. As the plots in which the surveys were conducted also differ in size among fragments, I would like to add survey plot size as an offset term to convert counts to densities.
As I understand from previous questions on this site this is generally done for poisson models as these have a log link. The GAM model (mgcv package) I am running with family ziP has link="identity". As far as I understand in such cases the offset term will be subtracted from the response, rather than resulting in the desired response/offset rate.
However, when I run the model with the offset term and plot the results it seems to be giving the result I want (I compared the plot for a poisson model with the ziP model).
This is the model I used, whereby Guild reflects different feeding guilds, logArea is the log of fragment size and Study is my random effect (data come from several studies).
gam1 <- gam(Count ~ Guild + s(logArea, by=Guild) + s(Study,bs="re"), offset=lnTotalPlotsize, family=ziP(),data=Data_ommited2)
Can someone explain how GAM handles offset terms in this case (ziP model with identity link)? Is it really resulting in the desired response/offset rate or is it doing something else?
Thanks for your help!
Regards,
Robert
Whilst only the identity link is provided, the linear predictor returns the log of the expected count. As such the linear predictor is on a log-scale and your use of the offset is OK.
Basically, the model is parameterized for the log response not the response, hence identity link functions are used. This is the same as for the ziplss() family.

random forest gets worse as number of trees increases

I am running into difficulties when using randomForest (in R) for a classification problem. My R code, an image, and data are here:
http://www.psy.plymouth.ac.uk/research/Wsimpson/data.zip
The observer is presented with either a faint image (contrast=con) buried in noise or just noise on each trial. He rates his confidence (rating) that the face is present. I have categorised rating to be a yes/no judgement (y). The face is either inverted (invert=1) or not in each block of 100 trials (one file). I use the contrast (1st column of predictor matrix x) and the pixels (the rest of the columns) to predict y.
It is critical to my application that I have an "importance image" at the end which shows how much each pixel contributes to the decision y. I have 1000 trials (length of y) and 4248 pixels+contrast=4249 predictors (ncols of x). Using glmnet (logistic ridge regression) on this problem works fine
fit<-cv.glmnet(x,y,family="binomial",alpha=0)
However randomForest does not work at all,
fit <- randomForest(x=x, y=y, ntree=100)
and it gets worse as the number of trees increases. For invert=1, the classification error for randomForest is 34.3%, and for glmnet it is 8.9%.
Please let me know what I am doing wrong with randomForest, and how to fix it.
ridge regression's only parameter lambda is chosen via internal cross-validation in cv.glmnet, as pointed out by Hong Ooi. and the error rate you get out of cv.glmnet realtes to that. randomForest gives you OOB error that is akin to an error on a dedicated test set (which is what you are interested in).
randomForest requires you to calibrate it manually (i.e. have a dedicated validation set to see which parameters work best) and there are a few to consider: depth of the trees (via fixing the number of examples in each node or the number of nodes), number of randomly chosen attributes considered at each split and the number of trees. you can use tuneRF to find the optimal number of mtry.
when evaluated on the train set, the more trees you add the better your predictions get. however, you will see predictive ability on a test set starts diminishing after a certain number of trees are grown -- this is due to overfitting. randomForest determines the optimal number of trees via OOB error estimates or, if you provide it, by using the test set. if rf.mod is your fitted RF model then plot(rf.mod) will allow you to see at which point roughly it starts to overfit. when using the predict function on a fitted RF it will use the optimal number of trees.
in short, you are not comparing the two models' performances correctly (as pointed out by Hong Ooi) and also your parameters might be off and/or you might be overfitting (although unlikely with just 100 trees).

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Resources