R: cannot predict specific value [duplicate] - r

This question already has answers here:
Predict() - Maybe I'm not understanding it
(4 answers)
Closed 6 years ago.
> age <- c(23,19,25,10,9,12,11,8)
> steroid <- c(27.1,22.1,21.9,10.7,7.4,18.8,14.7,5.7)
> sample <- data.frame(age,steroid)
> fit2 <- lm(sample$steroid~poly(sample$age,2,raw=TRUE))
> fit2
Call:
lm(formula = sample$steroid ~ poly(sample$age, 2, raw = TRUE))
Coefficients:
(Intercept) -27.7225
poly(sample$age, 2, raw = TRUE)1 5.1819
poly(sample$age, 2, raw = TRUE)2 -0.1265
> (newdata=data.frame(age=15))
age
1 15
> predict(fit2,newdata,interval="predict")
fit lwr upr
1 24.558395 17.841337 31.27545
2 25.077825 17.945550 32.21010
3 22.781034 15.235782 30.32628
4 11.449490 5.130638 17.76834
5 8.670526 2.152853 15.18820
6 16.248596 9.708411 22.78878
7 13.975514 7.616779 20.33425
8 5.638620 -1.398279 12.67552
Warning message:
'newdata' had 1 rows but variable(s) found have 8 rows
Why does the predict function unable to predict for age=15?

Instead of lm(data$y ~ data$x), use the form lm(y ~ x, data). That should solve your problem.
EDIT: the problem is not only with the call to lm, but also the use of poly(*, raw=TRUE). If you remove the raw=TRUE bit, it should then work. Not sure why raw=TRUE breaks here.

Related

Regression in R using poly() function

The function poly() in R is used in order to produce orthogonal vectors and can be helpful to interpret coefficient significance. However, I don't see the point of using it for prediction. To my view, the two following model (model_1 and model_2) should produce the same predictions.
q=1:11
v=c(3,5,7,9.2,14,20,26,34,50,59,80)
model_1=lm(v~poly(q,2))
model_2=lm(v~1+q+q^2)
predict(model_1)
predict(model_2)
But it doesn't. Why?
Because they are not the same model. Your second one has one unique covariate, while the first has two.
> model_2
Call:
lm(formula = v ~ 1 + q + q^2)
Coefficients:
(Intercept) q
-15.251 7.196
You should use the I() function to modify one parameter inside your formula in order the regression to consider it as a covariate:
model_2=lm(v~1+q+I(q^2))
> model_2
Call:
lm(formula = v ~ 1 + q + I(q^2))
Coefficients:
(Intercept) q I(q^2)
7.5612 -3.3323 0.8774
will give the same prediction
> predict(model_1)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
> predict(model_2)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930

Order of predictions from merTools predictInterval()

I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.
> # display input data to the model
> head(inputData)
id y x z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
id y x z
"factor" "numeric" "numeric" "factor"
>
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
>
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
Data: inputData
REML criterion at convergence: 444.245
Random effects:
Groups Name Std.Dev.
z (Intercept) 0.3097
Residual 0.9682
Number of obs: 157, groups: z, 17
Fixed Effects:
(Intercept) x
-0.4291 5.5638
>
> # display new data to predict in
> head(predData)
id x z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
id x z
"factor" "numeric" "factor"
>
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
>
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
>
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
>
> # display results
> head(preds_mid) # correct prediction
1 2 3 4 5 6
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470
> head(preds_interval) # incorrect order
fit upr lwr
1 1.512410 2.694813 0.133571198
2 1.273143 2.521899 0.009878347
3 1.398273 2.785358 0.232501376
4 1.878165 3.188086 0.625161201
5 1.605049 2.813737 0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
fit upr lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
fit upr lwr
1033 1.244261 2.457012 -0.0001299777
>
To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.
The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.
Is this a known issue? What can I do to avoid it?
I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.
library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
newdata=d,
newparams=list(beta=c(2, -1.1), theta=c(.25),
sigma = c(.23)), seed =463)[[1]]
lmeFit <- lmer(y ~ x + (1|z), data = d)
predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])
But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?
Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?

Logistic Regression using for loops in R [duplicate]

This question already has answers here:
How to debug "contrasts can be applied only to factors with 2 or more levels" error?
(3 answers)
Closed 5 years ago.
I am trying to run a binary logistic regression using For loops in R.
My code for the same is as follows:
mydata5<-read.table(file.choose(),header=T,sep=",")
colnames(mydata5)
Class <- 1:16
Countries <- 1:5
Months <- 1:7
DayDiff <- 1:28
mydata5$CT <- factor(mydata5$CT)
mydata5$CC <- factor(mydata5$CC)
mydata5$C <- factor(mydata5$C)
mydata5$DD <- factor(mydata5$DD)
mydata5$UM <- factor(mydata5$UM)
for(i in seq(along=Class))
{
mydata5$C=mydata5$C[i];
for(i2 in seq(along=Countries))
{
mydata5$CC=mydata5$CC[i2];
for(i3 in seq(along=Months))
{
mydata5$UM=mydata5$UM[i3];
for(i4 in seq(along=DayDiff))
{
mydata5$DD=mydata5$DD[i4];
lrfit5 <- glm(CT ~ C+CC+UM+DD, family = binomial(link = "logit"),data=mydata5)
summary(lrfit5)
library(lattice)
in_frame<-data.frame(C="mydata5$C[i]",CC="mydata5$CC[i2]",UM="mydata5$UM[i3]",DD="mydata5$DD[i4]")
predict(lrfit5,in_frame, type="response",se.fit=FALSE)
}
}
}
}
However, I'm getting the following error:
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Why is the error occurring? Also,the dataset "mydata5" has around 50000 rows.Please help.
Thanks in Advance.
You have tried to do a regression with a factor having only one level. Since you haven't given us your data we can't reproduce your analysis but I can simply reproduce your error message:
> d = data.frame(x=runif(10),y=factor("M",levels=c("M","F")))
> d
x y
1 0.07104688 M
2 0.11948466 M
3 0.20807068 M
4 0.24049508 M
5 0.44251492 M
6 0.69775646 M
7 0.44479983 M
8 0.64814971 M
9 0.75151207 M
10 0.38810621 M
> glm(x~y,data=d)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
By setting one of the factor values to "F" I don't get the error message:
> d$y[5]="F"
> glm(x~y,data=d)
Call: glm(formula = x ~ y, data = d)
Coefficients:
(Intercept) yF
0.39660 0.04591
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 0.5269
Residual Deviance: 0.525 AIC: 4.91
So somewhere in your loops (which we cannot run because we don't have your data) you are doing this.

regressions with xts in R

Is there a utility to run regressions using xts objects of the following type:
lm(y ~ lab(x, 1) + lag(x, 2) + lag(x,3), data=as.data.frame(coredata(my_xts)))
where my_xts is an xts object that contains an x and a y. The point of the question is is there a way to avoid doing a bunch of lags and merges to have a data.frame with all the lags? I think that the package dyn works for zoo objects so i would expect it to work the same way with xts but though there might be something updated.
The dyn and dynlm packages can do that with zoo objects. In the case of dyn just write dyn$lm instead of lm and pass it a zoo object instead of a data frame.
Note that lag in xts works the opposite of the usual R convention so if x is of xts class then lag(x, 1) is the same as lag(x, -1) if x were of zoo or ts class.
> library(xts)
> library(dyn)
> x <- xts(anscombe[c("y1", "x1")], as.Date(1:11)) # test data
> dyn$lm(y1 ~ lag(x1, -(1:3)), as.zoo(x))
Call:
lm(formula = dyn(y1 ~ lag(x1, -(1:3))), data = as.zoo(x))
Coefficients:
(Intercept) lag(x1, -(1:3))1 lag(x1, -(1:3))2 lag(x1, -(1:3))3
3.80530 0.04995 -0.12042 0.46631
Since you are already removing the data from the xts environment, I'm not using any xts features here. There is an embed function that will construct a "lagged" matrix to any desired degree. (I never understood the time-series lag function.) (the order of the embed-lagged variables is reversed from what I would have expected.)
embed(1:6, 3)
#--------
[,1] [,2] [,3]
[1,] 3 2 1
[2,] 4 3 2
[3,] 5 4 3
[4,] 6 5 4
#Worked example ... need to shorten the y variable
y <- rnorm(20)
x <- rnorm(20)
lm( tail(y, 18) ~ embed(x, 3) )
#-------------------
Call:
lm(formula = tail(y, 18) ~ embed(x, 3))
Coefficients:
(Intercept) embed(x, 3)1 embed(x, 3)2 embed(x, 3)3
-0.12452 -0.34919 0.01571 0.01715
It was a relief to note that after changing the lags to match those used by #GGrothendieck that we get identical results:
lm( tail(xx[,"y1"], NROW(xx)-3) ~ embed(xx[,"x1"], 4)[,2:4] )
Call:
lm(formula = tail(xx[, "y1"], NROW(xx) - 3) ~ embed(xx[, "x1"],
4)[, 2:4])
Coefficients:
(Intercept) embed(xx[, "x1"], 4)[, 2:4]1 embed(xx[, "x1"], 4)[, 2:4]2
3.80530 0.04995 -0.12042
embed(xx[, "x1"], 4)[, 2:4]3
0.46631

How can I manipulate GLM coefficients in R?

How can I manipulate a GLM object in order to bypass this error? I would like for predict to treat the unseen levels as base cases (that is, give them a coefficient of zero.)
> master <- data.frame(x = factor(floor(runif(100,0,3)), labels=c("A","B","C")), y = rnorm(100))
> part.1 <- master[master$x == 'C',]
> part.2 <- master[master$x == 'A' | master$x == 'B',]
> model.2 <- glm(y ~ x, data=part.2)
> predict.1 <- predict(model.2, part.1)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor 'x' has new level(s) C
I tried doing this:
> model.2$xlevels$x <- c(model.2$xlevels, "C")
> predict.1 <- predict(model.2, part.1)
But it's not scoring the model correctly:
> predict.1[1:5]
2 3 6 8 10
0.03701494 0.03701494 0.03701494 0.03701494 0.03701494
> summary(model.2)
Call:
glm(formula = y ~ x, data = part.2)
<snip>
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.12743 0.18021 0.707 0.482
xB -0.09042 0.23149 -0.391 0.697
predict.1 should only be 0.12743.
This is obviously just a trimmed down version--my real model has 25 or so variables in it, so an answer of predict.1 <- rep(length(part.1), 0.12743) is not useful to me.
Thanks for any help!
If you know that observations where x=='C' behave exactly like x=='A', then you can just do:
> part.1$x <- factor(rep("A",nrow(part.1)),levels=c("A","B"))
> predict(model.2, part.1)
which will give you your pure intercept model.
I disagree that you should expect any prediction. You develop a model with no items whose x variable is a factor whose value is "C" so you should not expect any prediction. Your effort to produce predictions for 1:5 also should fail.

Resources