I have a basic issue considering predict function and its R squared value.
data<- datasets::faithful
# Linear model
model<- lm (eruptions~., data=data)
summary (model)[[9]] # R squared ajusted
# Creating test data to predict eruption values using the previous model
test.data<- data.frame(waiting= rnorm (80, mean = 70, sd = 14 ))
pred<- predict (model, test.data) # Predict eruptions based and the previou model
# New dataset with predicted values
newdata<- data.frame (eruptions = model.pred, waiting = test.data)
# Linear model to predicted values
new.model<- lm (eruptions~., data = newdata)
summary (new.model)[[9]] ## R-squared from predicted values
The R-squared of data set with predicted values is 1. It seems obvious that if the predicted values are based on the same variables used in the predict function, the fit measured by R-squared is perfect (=1). But, my interest would be measure how good is my model to test other datasets, for example test.data in the code. Am I using predict function correctly?
Thanks in advance
Pass your new variables as the "newdata" argument of the predict() function.
Type ?predict in an R console to get its R Documentation:
newdata
An optional data frame in which to look for variables with
which to predict. If omitted, the fitted values are used.
Related
Suppose I was given a data as the following.
id1=rep(1:10,2)
trt=c(rep(1,10),rep(0,10))
outcome=rnorm(20)
set.seed(1005)
missing=c()
for(i in outcome){
if(rbinom(1,1,0.8*abs(i/max(abs(outcome))))==1){
missing=c(missing,which(outcome==i))
}
}
missing
trt[missing]=NA
dat=data.frame(id=id1,trt=as.factor(trt),outcome=outcome)
a1=mice::mice(dat,method=c('','logreg',''))
I can use mice to impute the data first and then conduct analysis on a1 where I assume the outcome and id predicts trt by logistic regression. In fact, only outcome predicts trt here. a1$formulas$trt will access the formula for imputation. I want to modify the formula here so that there is a constant offset.
forms_a1=a1$formulas
forms_a1$trt=as.formula(trt~outcome+offset(2))
mice::mice(dat,method=c('','logreg',''),formulas = forms_a1)
However, mice gives the following output.
iter imp variable
1 1 trtError in model.frame.default(formula, data = data, na.action = na.pass) :
variable lengths differ (found for 'offset(2)')
$Q1:$ How do I offset intercept here? I think I could add an extra column as variable to offset and modify formula and predictormatrix. The $\delta$ shift was implemented here(https://stefvanbuuren.name/fimd/sec-sensitivity.html) for continuous variable case. However, it seems that this may change estimated coefficients.
$Q2:$ If I am interested in offsetting a slope say by 10*id in above formula, how would I do so?
I am interested in running some multivariable linear regression data simulations to try out some new statistical methods before I use them on my real dataset, where I regress a set of predictor variables on an outcome (both continuous and categorical).
The goal would be to generate data with three fake exposures, and an outcome, with the option of setting the beta estimate for the relationship between each exposure and outcome (continuous), or the relative risk or odds ratio for the outcome (categorical outcome). Is that something that can be easily done in R?
For example, it would be great to set a 4 variable dataset where one variable is related to the categorical outcome with an OR/RR of 1.5 that I set, and then I would get a RR/OR of 1.5 for that relationship if I ran a logistic regression on the dataset.
Thanks!
You can generate random categorical variables and then set B0=1, B1=log(1.5), B2=1, B3=1, and generate the appropraite XB. Then using a logit link function, you can generate P(Y=1|x) for each observation/row x and use sample to choose Y=1 or 0 with that probability. Fit a logistic regression using the binomial family and finally exponentiate the coefficient of "a" to get the odds ratio for this variable. Since we had set it to log(1.5), exponentiating gives approximately 1.5.
dt=data.frame(a=sample(c(0,1), 10000, replace=TRUE),
b=sample(c(0,1), 10000, replace=TRUE),
c=sample(c(0,1), 10000, replace=TRUE))
library(dplyr)
dt=mutate(dt, xb=1+log(1.5)*a+b+c, linked=1/(1+exp(-xb)))
y=numeric()
for (i in 1:10000) {
y[i]=sample(c(1,0), prob=c(dt$linked[i], 1-dt$linked[i]), size=1)
}
dt$y=y
m=glm(data=dt, y ~ a+b+c, family="binomial")
exp(m$coef["a"])
1.422448
I am using random-forest for a regression problem to predict the label values of Test-Y for a given set of Test-X (new values of features). The model has been trained over a given Train-X (features) and Train-Y (labels). "randomForest" of R serves me very well in predicting the numerical values of Test-Y. But this is not all I want.
Instead of only a number, I want to use random-forest to produce a probability density function. I searched for a solution for several days and here is I found so far:
"randomForest" doesn't produce probabilities for regression, but only in classification. (via "predict" and setting type=prob).
Using "quantregForest" provides a nice way to make and visualize prediction intervals. But still not the probability density function!
Any other thought on this?
Please see the predict.all parameter of the predict.randomForest function.
library("ggplot2")
library("randomForest")
data(mpg)
rf = randomForest(cty ~ displ + cyl + trans, data = mpg)
# Predict the first car in the dataset
pred = predict(rf, newdata = mpg[1, ], predict.all = TRUE)
hist(pred$individual)
The histogram of 500 "elementary" predictions looks like this:
You can also use quantregForest with a very fine grid of quantiles, convert them into a "cumulative distribution function (cdf)" with R-function ecdf and convert this cdf into a density estimation with a kernel density estimator.
I spilt the data set into train and test as following:
splitdata<-split(sb[1:nrow(sb),], sample(rep(1:2, as.integer(nrow(sb)/2))))
test<-splitdata[[1]]
train<-rbind(splitdata[[2]])
sb is the name of original data set, so it is 50/50 train and test.
Then I fitted a glm using the training set.
fitglm<- glm(num_claims~year+vt+va+public+pri_bil+persist+penalty_pts+num_veh+num_drivers+married+gender+driver_age+credit+col_ded+car_den, family=poisson, train)
now I want to predict using this glm, say the next 10 observations.
I have trouble to specify the newdata in predict(),
I tried:
pred<-predict(fitglm,newdata=data.frame(train),type="response", se.fit=T)
this will give a number of predictions that is equal to the number of samples in training set.
and finally, how to plot these predictions with confidence intervals?
Thank you for the help
If you are asking how to construct predictions on the next 10 in the test set then:
pred10<-predict(fitglm,newdata=data.frame(test)[1:10, ], type="response", se.fit=T)
Edit 9 years later:
#carsten's comment is correct regarding how to construct a confidence interval. If one has a non-linear link function for a glm-object, fitglm then this is a reasonably general method to recover the inverse of the link function and construct a two-sided 95% CI on the response scale:
pred.fit <- predict(fitglm, newdata=newdata, se.fit=TRUE)
pred.fit <- predict(fitglm, newdata=newdata, se.fit=TRUE)
CI.pred.upper <- family(fitglm)$linkinv( # that information is in the model
pred.fit+ 1.96*pred.fit$se.fit )
CI.pred.lower <- family(fitglm)$linkinv( # that information is in the model
pred.fit$fit - 1.96*pred.fit$se.fit )
I need to plot a binned residual plot with fitted versus residual values from an ordered multinominal logit regression.
How can I extract residuals when using polr? Is there any other function that runs ord multinominal logit in which residuals can be extracted?
This is the code I used
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data, method='logistic')
fit <- mod1$fitted.values
res <- residuals(mod1)
binnedplot(fit, res)
The problem is that object 'res' is 'null'.
Thanks
For a start, can you tell us how residuals would be defined in principle for a model with categorical responses? fitted.values is a matrix of probabilities. You could define residuals in terms of correct prediction (defining the most likely outcome as the prediction, as in the default predict method for polr objects) -- or you could compute an n-by-n table of true values and predicted values. Alternatively you could reduce the ordinal data back to an integer scale and compute a mean outcome as the prediction ... but I can't see that there's any unique way to define the residuals in the first place.
In polr(), there is no function that returns residual. You should manually calculate it using its definition.
There are actually plenty of ways to get residuals from an ordinal probit/logit. Although polr does not provide any residuals, vglm provides several. See ?residualsvglm from the VGAMpackage (see also below).
NOTE: However, for a Control Function/2SRI approach Wooldridge (2014) suggests using the generalised residuals as described in Vella (1993). These are as far as I know currently not available in R, although I am working on that, but they are in Stata (using predict gr, score)
Residuals in VLGM
Surrogate residuals for polr
You can use the package sure (link), to calculate surrogate residuals with resids. The package is based on this paper, in the Journal of the American Statistical Association.
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
df1 <- df1
df1$x1 <- df1$x
df1$x <- NULL
df1$y <- df2$y
df1$x2 <- df2$x
df1$x3 <- df3$x
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data=df1, method='probit')
fit <- mod1$fitted.values
res <- resids(mod1)
EDIT: One big issue is the following (from ?resids):
"Note: Surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids. The internal functions used for sampling from truncated distributions when method = "latent" are based on modified versions of rtrunc and qtrunc."
Even when running resids(mod1, nsim=1000, method="latent"), there is no convergence of the outcome.