R prediction interval for the mean of the new sample - r

Given a regression model created from one dataset, I have been using WinBUGS to construct prediction intervals (PIs) around the mean of a second dataset. I have just discovered the "predict" function in R, but it delivers PIs around each predicted value in the second dataset. I have searched the R help, here and on the Net and only found the intervals for the separate members.
The average of the these intervals is clearly not the same as the PI around the predicted sample mean (and I have tested that against the value I got from WinBUGS).
How do I get R to give me the PI around the mean?

There used to be an R mean.data.frame function, but it was deprecated and then removed. You can get the same result with:
mean.vec <- lapply(na.omit(dfrm), mean)
Then probably just:
predict(fit, newdata=data.frame(mean.vec) )
I say 'probably' because you provided no dataset to test this with and provision of such is in my opinion your responsibility. I have no idea whether this replicates the JMP method or the WinBUGS method.

Related

Use glm to predict on fresh data

I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly

R nonlinear regression of cumulative X and Y data

I'm trying to figure how to make a nonlinear regression of some cumulative data of X and Y values. The dataset is based on cumulative items and their respective cumulated demand. I have a plot that looks like this
based on the following observation of 5299 items, which is available here: abc.csv datafile
and I would like to fit a model that can explain it quite neatly. Given the plot, I reckon that there is a high degree of detail. Hence, I would believe that it would be possible to find a function that would explain the data with very high accuracy.
The problem is, however, that I find myself trying to fit a model with nls() by trial and error. Furthermore, some of the functions that I've tried give me some explanation, but not in full detail. For instance
nlm <- nls(abc$Cumfreq ~c*(1-exp(-a*abc$noe))+b, data=abc,
start = list(a=4.14, b=0.21, c=0.79))
Yields me:
My question is: how do I obtain a regression with a better fit? Is there a function in R or another way of achieving this? (fingers crossed for a math genius out there)

Predicted(?) values from an lmer model

I have a data frame of bird counts. I have the participants ID number, the number of birds they counted, the year they counted them, their lat and long coordinates, and their effort. I have made this model:
model = lmer(count~year+lat+long+effort+(1|participant), data = df)
I now want the model to plot predicted values from that same data set. So, that data was for 1997-2017, and I want the model to give me predicted values for each year. I want to plot these, so the final plot will have the predicted count on the y-axis, and the year (categorical) on the x-axis. Each year will have one data point w/ a confidence interval.
I have tried figuring out predict(), but I'm not quite sure how to use that to get what I want. It seems to need a new data frame, but I don't have a new data set to run through the model to predict a future count. I want the model to go back and work on the previous data that I put into it already, based off of the Beta values in the output of summary(model).
I found this thread, and it seems to be basically what I'm looking to do, but I can't get the sjPlot dependencies to download, sjlabelled throws an error every time: How to plot predicted values with standard errors for lmer model results?
You could try the ggeffects-package, which will be used in the forthcoming sjPlot-update to plot predicted values.
library(ggeffects)
dat <- ggpredict(model, terms = "dat")
plot(dat)
If you're missing dependencies, try:
install.packages(
c("sjlabelled", "sjmisc", "sjstats", "ggeffects", "sjPlot"),
dependencies = TRUE
)
You may even want to install ggeffects from GitHub, since the current dev-version has some fixes and improvements for mixed models.
devtools::install_github("strengejacke/ggeffects")
I found the package I was looking for, it's called predictedmeans and has a function where you put in the model and the model term you want predictions for predictmeans(model, model term). It works perfectly!

R package multgee -- initial values

I'm working on estimating a generalized estimating equation in R. I have a multinomial (ordinal) outcome, and so I have been attempting to use the package multgee since, as far as I know, packages like geepack or gee don't allow for the estimation of multinomial outcomes.
However, I'm running into some issues. The documentation seems good, but in particular, it seems to be requiring initial (starting) values in the model. If I try to run the model without it, I get a line requesting starting values. Here's the model:
formula <- PAINAD_recode~Age + Gender + group_2part + Cornell + stim_intensity_scale
fitmod <- ordLORgee(formula,data=data,bstart=c(1,0,1,0,1),id=data$subject,
repeated=data$trial)
I just threw in some ones and zeroes for the starting values there. However, when I enter starting values (even plausible ones), it claims that:
Starting values and parameters vector differ in length
I thought that with five predictors, I would need five starting values. I can't find more information about this particular matrix. Does anyone have any thoughts on this? The outcome here has five levels (ordinal) and the repeated component has 20 levels. ANy suggestions would be appreciated.

Running predict() after tobit() in package AER

I am doing a tobit analysis on a dataset where the dependent variable (lets call it y) is left censored at 0. So this is what I do:
library(AER)
fit <- tobit(data=mydata,formula=y ~ a + b + c)
This is fine. Now I want to run the "predict" function to get the fitted values. Ideally I am interested in the predicted values of the unobserved latent variable "y*" and the observed censored variable "y" [See Reference 1].
I checked the documentation for predict.survreg [Reference 2] and I don't think I understood which option gives me the predicted censored variables (or the latent variable).
Most examples I found online advise the following :
predict(fit,type="response").
Again, its not clear what kind of predictions these are.
My guess is that the "type" option in the predict function is the key here, with type="response" meant for the censored variable predictions and type="linear" meant for latent variable predictions.
Can someone with some experience here, shed some light for me please ?
Many Thanks!
References:
http://en.wikipedia.org/wiki/Tobit_model
http://astrostatistics.psu.edu/datasets/2006tutorial/html/survival/html/predict.survreg.html
Generally predict-"response" results have been back-transformed to the original scale of data from whatever modeling transformations were used in a regression, whereas the "linear" predictions are the linear predictors on the link transformed scale. In the case of tobit which has an identity link, they should be the same.
You can check my meta-prediction easily enough. I just checked it with the example on the ?tobit page:
plot(predict(fm.tobit2, type="response"), predict(fm.tobit2,type="linear"))
I posted a similar question on stats.stackexchange and I got an answer that could be useful for you:
https://stats.stackexchange.com/questions/149091/censored-regression-in-r
There one of the authors of the package shows how to calculate the mean of (ie. prediction) of $Y$ where $Y = max(Y^*,0)$. Using the package AER this has to be done somewhat "by hand".

Resources