I have some x and y values which can be fitted nicely with a polynomial
> mysubx
[1] 0.05 0.10 0.20 0.50 1.00 2.00 5.00
[8] 9.00 12.30 18.30
> mysuby
[1] 1.008 1.019 1.039 1.091 1.165 1.258 1.402
[8] 1.447 1.421 1.278
> mymodel <- lm(mysuby ~ poly(mysubx,5))
The fit can be confirmed graphically.
> plot(mysubx, mysuby)
> lines(mysubx, mymodel$fitted.values, col = "red")
My problem happens when I try to use the coefficients returned by lm to determine a y value from a given x. So for example if I try to use the first value in mysubx this should give the mymodel$fitted.values1. From the graph it can be seen I should expect to see a number around 1.01.
> ansx = 0
> for(i in seq_along(mymodel$coefficients)){
+ ansx = ansx + mysubx[1]^(i-1)*mymodel$coefficients[[i]]
+ }
> ansx
[1] 1.229575
>
Where
> mysubx[1]
[1] 0.05
> mymodel$coefficients
(Intercept) poly(mysubx, 5)1 poly(mysubx, 5)2 poly(mysubx, 5)3
1.21280000 0.35310369 -0.35739878 0.10989141
poly(mysubx, 5)4 poly(mysubx, 5)5
-0.04608682 0.02054430
As can be seen an x value on the graph of 0.05 does not give 1.229575. Obviously I don't understand what is going on? Can someone explain how I can get the correct y value from any given x value using output of the lm function?
Thank you.
In fact, what you want is not poly(mysubx, 5) but
poly(mysubx, 5, raw = TRUE)
If you let raw as FALSE, it does not use x, x**2, x**3, etc. but orthogonal polynomials.
mymodel <- lm(mysuby ~ poly(mysubx, 5, raw = T))
When you fit a model, R first builds a model matrix from your data and your formula. You can get hold of it using the model.matrix function.
> X <- model.matrix(mysuby ~ poly(mysubx,5))
This matrix has a row per input point (in your case your input is one-dimensional and kept in mysubx, but in general, you will get it from a data frame and it can be multi-dimensional). The formula specifies how the input data should be modified before we fit the model. We can take a closer look at the first row:
> X[1,]
(Intercept) poly(mysubx, 5)1 poly(mysubx, 5)2
1.0000000 -0.2517616 0.2038351
poly(mysubx, 5)3 poly(mysubx, 5)4 poly(mysubx, 5)5
-0.2264003 0.2355258 -0.2245773
As you can see, when you fit a polynomial, you get values for the intercept (always 1 since the intercept is a constant for the model; it doesn't depend on x) and the transformations you do on your input. WE call such a row the "features" you use in your model
In this case, you have a 1->N dimensional mapping from input to features. In general, it will be an M -> N-dimensional mapping. Regardless of how you map input to the model matrix, the model fitting only cares about the model matrix. The model builds a way to map each row in this matrix to a prediction.
For a linear model, the mapping from features to target variable is an inner product. You take the coefficients and compute the inner product with the features. So, for your first data point, you do:
> mymodel$coefficients %*% X[1,]
[,1]
[1,] 1.010704
For the entire data, you simply do this for each row:
> predict(mymodel)
1 2 3 4 5 6 7
1.010704 1.020083 1.038284 1.088659 1.159883 1.263722 1.400163
8 9 10
1.447700 1.420790 1.278011
> apply(X, MARGIN = 1, function(features) mymodel$coefficients %*% features)
1 2 3 4 5 6 7
1.010704 1.020083 1.038284 1.088659 1.159883 1.263722 1.400163
8 9 10
1.447700 1.420790 1.278011
Here, X doesn't have to be the data you trained the model on. You can build it from any other input data using the same formula. I would recommend not using global variables in your formulae, though, as this is likely to cause problems later on.
Related
I am trying to run the nls function in R using indexed vectors as inputs, however I am getting an error:
> a=c(1,2,3,4,5,6,7,8,9,10)
> b=c(6,7,9,11,14,18,23,30,38,50) #make some example data
>
> nls(b[1:6]~s+k*2^(a[1:6]/d),start=list(s=2,k=3,d=2.5)) #try running nls on first 6 elements of a and b
Error in parse(text = x, keep.source = FALSE) :
<text>:2:0: unexpected end of input
1: ~
^
I can run it on the full vectors:
> nls(b~s+k*2^(a/d),start=list(s=2,k=3,d=2.5))
Nonlinear regression model
model: b ~ s + k * 2^(a/d)
data: parent.frame()
s k d
1.710 3.171 2.548
residual sum-of-squares: 0.3766
Number of iterations to convergence: 3
Achieved convergence tolerance: 1.2e-07
I am fairly certain that the indexed vectors have the same variable type as the full vectors:
> a
[1] 1 2 3 4 5 6 7 8 9 10
> typeof(a)
[1] "double"
> class(a)
[1] "numeric"
> a[1:6]
[1] 1 2 3 4 5 6
> typeof(a[1:6])
[1] "double"
> class(a[1:6])
[1] "numeric"
I can run nls if I save the indexed vectors in new variables:
> a_part=a[1:6]
> b_part=b[1:6]
> nls(b_part~s+k*2^(a_part/d),start=list(s=2,k=3,d=2.5))
Nonlinear regression model
model: b_part ~ s + k * 2^(a_part/d)
data: parent.frame()
s k d
2.297 2.720 2.373
residual sum-of-squares: 0.06569
Number of iterations to convergence: 3
Achieved convergence tolerance: 1.274e-07
Furthermore, lm accepts both full and indexed vectors:
> lm(b~a)
Call:
lm(formula = b ~ a)
Coefficients:
(Intercept) a
-4.667 4.594
> lm(b[1:6]~a[1:6])
Call:
lm(formula = b[1:6] ~ a[1:6])
Coefficients:
(Intercept) a[1:6]
2.533 2.371
Is there a way to run nls on indexed vectors without saving them in new variables?
Use subset . (It would also be possible to use the weights argument giving a weight of 1 to each of the first 6 observations and 0 to the rest.)
Also you might want to use the plinear algorithm to avoid having to give the starting values for the two parameters that enter linearly. In that case provide a matrix on the RHS with column names s and k such that its first column multiplies s and the second column multiplies k.
nls(b ~ cbind(s = 1, k = 2^(a/d)), subset = 1:6, start = list(d = 2.5),
algorithm = "plinear")
giving:
Nonlinear regression model
model: b ~ cbind(s = 1, k = 2^(a/d))
data: parent.frame()
d .lin.s .lin.k
2.373 2.297 2.720
residual sum-of-squares: 0.06569
Number of iterations to convergence: 3
Achieved convergence tolerance: 7.186e-08
If I calculate the y value for a specific x value using predict() function I obtain a value different from the one I can calculate using the explicit fitting equation.
I fitted the data below using nls(MyEquation) and obtained the m1, m2,... parameters.
Then, I want to reverse calculate the y value for a specific x value using both the predict(m) function or the explicit equation I used for fitting (putting in the desired x value).
I obtain different y values for the same x value. Which one is the correct one?
> df
pH activity
1 3.0 0.88
2 4.0 1.90
3 5.0 19.30
4 6.0 70.32
5 7.0 100.40
6 7.5 100.00
7 8.0 79.80
8 9.0 7.75
9 10.0 1.21
x <- df$pH
y <- df$activity
m<-nls(y~(m1*(10^(-x))+m2*10^(-m3))/(10^(-m3)+10^(-x)) - (m5*(10^(-x))+1*10^(-i))/(10^(-i)+10^(-x)), start = list(m1=1,m2=100,m3=7,m5=1))
> m
Nonlinear regression model
model: y ~ (m1 * (10^(-x)) + m2 * 10^(-m3))/(10^(-m3) + 10^(-x)) - (m5 * (10^(-x)) + 1 * 10^(-i))/(10^(-i) + 10^(-x))
data: parent.frame()
m1 m2 m3 m5
-176.032 13.042 6.282 -180.704
residual sum-of-squares: 1522
Number of iterations to convergence: 14
Achieved convergence tolerance: 5.805e-06
list2env(as.list(coef(m)), .GlobalEnv)
#calculate y based on fitting parameters
# choose the 7th x value (i.e. x[7]) that corresponds to pH = 8
# (using predict)
> x_pH8 <- x[7]
> predict(m)[7]
[1] 52.14299
# (using the explicit fitting equation with the fitted parameters
> x1 <- x_pH8
> (m1*(10^(-x1))+m2*10^(-m3))/(10^(-m3)+10^(-x1)) - (m5*(10^(-x1))+1*10^(-8.3))/(10^(-8.3)+10^(-x1))
[1] 129.5284
As you can see:
predict(m)[7] gives y = 52.14299 (for x = 8)
while
(m1*(10^(-x1))+m2*10^(-m3))/(10^(-m3)+10^(-x1)) - (m5*(10^(-x1))+1*10^(-8.3))/(10^(-8.3)+10^(-x1)) gives y = 129.5284 (for x = 8)
The value of i you use in the manual calculation is probably not the same as the one you use in the model fitting. I don't get any discrepancy:
x <- df$pH
y <- df$activity
i <- 8.3
m <- nls(y~(m1*(10^(-x))+m2*10^(-m3))/(10^(-m3)+10^(-x)) - (m5*(10^(-x))+1*10^(-i))/(10^(-i)+10^(-x)), start = list(m1=1,m2=100,m3=7,m5=1))
x <- 8
with(as.list(coef(m)),
(m1*(10^(-x))+m2*10^(-m3))/(10^(-m3)+10^(-x)) - (m5*(10^(-x))+1*10^(-i))/(10^(-i)+10^(-x)))
# [1] 75.46504
predict(m)[7]
# [1] 75.46504
The function poly() in R is used in order to produce orthogonal vectors and can be helpful to interpret coefficient significance. However, I don't see the point of using it for prediction. To my view, the two following model (model_1 and model_2) should produce the same predictions.
q=1:11
v=c(3,5,7,9.2,14,20,26,34,50,59,80)
model_1=lm(v~poly(q,2))
model_2=lm(v~1+q+q^2)
predict(model_1)
predict(model_2)
But it doesn't. Why?
Because they are not the same model. Your second one has one unique covariate, while the first has two.
> model_2
Call:
lm(formula = v ~ 1 + q + q^2)
Coefficients:
(Intercept) q
-15.251 7.196
You should use the I() function to modify one parameter inside your formula in order the regression to consider it as a covariate:
model_2=lm(v~1+q+I(q^2))
> model_2
Call:
lm(formula = v ~ 1 + q + I(q^2))
Coefficients:
(Intercept) q I(q^2)
7.5612 -3.3323 0.8774
will give the same prediction
> predict(model_1)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
> predict(model_2)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
I am trying to use MuMIn's model.avg function with model formulae that are pasted and use an index rather than directly input, for example:
m1<-gls(as.formula(paste(response,"~",paste(combns[,j], collapse="+"))), data=dat)
'combns' is a 2D array created by combn() containing combinations of predictor variables. This is producing model-averaged coefficients and AICc values identical to those produced if the gls functions contain the formulae directly, e.g.:
m1<-gls(median_Ta ~ day_of_season + hour_of_day + pct_grey_cover +
foliage_height_diversity + tree_shannon_diversity + median_patch_size, data=dat)
However, the relative variable importance is not computing, and I believe this to do with the use of a for loop or with using a variable to access the index of the list in which the models are stored somehow causing the component model terms not to be 'read' properly (see the term codes for the models):
Component models:
df logLik AICc delta weight
1234567b 7 -233.08 481.43 0.00 0.59
1234567f 3 -237.97 482.21 0.78 0.40
1234567e 4 -241.32 491.08 9.65 0.00
1234567a 9 -241.15 502.39 20.96 0.00
1234567c 6 -248.37 509.68 28.25 0.00
1234567d 5 -250.22 511.11 29.68 0.00
Term codes:
day_of_season foliage_height_diversity hour_of_day
1 2 3
median_patch_size pct_grey_cover tree_shannon_diversity
4 5 6
urban_boundary_distance
7
This results in the relative variable importance being given as:
Relative variable importance:
day_of_season foliage_height_diversity hour_of_day
Importance: 1 1 1
N containing models: 6 6 6
median_patch_size pct_grey_cover tree_shannon_diversity
Importance: 1 1 1
N containing models: 6 6 6
urban_boundary_distance
Importance: 1
N containing models: 6
Whereas if I use model.avg over the same models with the formulae typed individually, I get the following, correct output:
Component models:
df logLik AICc delta weight
23456 7 -233.08 481.43 0.00 0.59
1 3 -237.97 482.21 0.78 0.40
57 4 -241.32 491.08 9.65 0.00
1234567 9 -241.15 502.39 20.96 0.00
1467 6 -248.37 509.68 28.25 0.00
147 5 -250.22 511.11 29.68 0.00
Relative variable importance:
pct_grey_cover median_patch_size tree_shannon_diversity
Importance: 0.6 0.59 0.59
N containing models: 3 4 3
foliage_height_diversity hour_of_day day_of_season
Importance: 0.59 0.59 0.4
N containing models: 2 2 4
urban_boundary_distance
Importance: <0.01
N containing models: 4
How can I make model.avg read the predictor variables in the formulae properly? I've only included six models as an example here but I want to compare the full set of 128 models (and I have other response variables with larger numbers of predictor variables), so typing them out individually isn't feasible.
Thanks in advance.
Edit: reproducible example
It took me a while to narrow down the problem. The first example, m.ave, shows the problem in action with a for loop. The second example, m.ave2 shows it working with the indices typed rather than using a variable. Obviously this is just a small subset of the predictor variables.
require(nlme)
require(MuMIn)
dat<-data.frame(median_Ta=rnorm(100), day_of_season=runif(100), hour_of_day=runif(100), pct_grey_cover=rnorm(100),
foliage_height_diversity=rnorm(100), urban_boundary_distance=runif(100), tree_shannon_diversity=rnorm(100),
median_patch_size=rnorm(100))
f1<-"median_Ta ~ day_of_season + hour_of_day + pct_grey_cover + foliage_height_diversity +
urban_boundary_distance + tree_shannon_diversity + median_patch_size"
f1<-gsub("\\s", "", f1) # remove whitespace
f1split <- strsplit(f1, split="~") # split predictors and response
response <- f1split[[1]][1]
predictors <- strsplit(f1split[[1]][2], split="+", fixed=TRUE)[[1]]
modelslist<-list()
combns <- combn(predictors, 6)
for (j in 1:7) {
modelslist[[j]]<-gls(as.formula(paste(response,"~",paste(combns[,j], collapse="+"))), data=dat)
}
m.ave<-model.avg(modelslist[[2]], modelslist[[3]], modelslist[[4]],
modelslist[[5]], modelslist[[6]], modelslist[[7]], modelslist[[8]])
summary(m.ave)
#compare....
modelslist2<-list()
modelslist2[[1]]<-gls(as.formula(paste(response,"~",paste(combns[,1], collapse="+"))), data=dat)
modelslist2[[2]]<-gls(as.formula(paste(response,"~",paste(combns[,2], collapse="+"))), data=dat)
modelslist2[[3]]<-gls(as.formula(paste(response,"~",paste(combns[,3], collapse="+"))), data=dat)
modelslist2[[4]]<-gls(as.formula(paste(response,"~",paste(combns[,4], collapse="+"))), data=dat)
modelslist2[[5]]<-gls(as.formula(paste(response,"~",paste(combns[,5], collapse="+"))), data=dat)
modelslist2[[6]]<-gls(as.formula(paste(response,"~",paste(combns[,6], collapse="+"))), data=dat)
modelslist2[[7]]<-gls(as.formula(paste(response,"~",paste(combns[,7], collapse="+"))), data=dat)
m.ave2<-model.avg(modelslist2[[1]], modelslist2[[2]], modelslist2[[3]], modelslist2[[4]],
modelslist2[[5]], modelslist2[[6]], modelslist2[[7]])
summary(m.ave2)
This is a bug in the formula method for gls (in package nlme). Since the actual formula is not stored anywhere in the object, it evaluates "model" argument in the function call. In case of elements of modellist they are all the same, for example:
modelslist[[1]]$call$model
modelslist[[7]]$call$model
both return
> formula(paste(response, "~", paste(combns[, j], collapse = "+")))
which, when evaluated use the current (last) value of j, so that all formula(modellist[[N]]) returns the last model formula.
all.equal(formula(modelslist[[1]]), formula(modelslist[[7]]))
returns
> TRUE
Needles to say, all this confuses model.avg which uses the formulas to build the model selection table (this is a fallback because gls lacks terms as well).
Edit: possible workarounds
Much easier way to get what you want:
model.avg(dredge(..., m.lim = c(6,6)))
or, if you want to make predictions:
modellist <- lapply(dredge(..., m.lim = c(6,6), evaluate = FALSE), eval)
But, if you want to use an arbitrary set of models, replace the $call$model element in each gls model object with a proper formula, e.g.
combns <- combn(1:7, 6)
modellist <- vector("list", 7)
for (j in 1:7) {
f <- reformulate(predictors[combns[, j]], response = response)
fm <- gls(f, data = dat)
fm$call$model <- f # assign the actual formula
modellist[[j]] <- fm
}
I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.
> # display input data to the model
> head(inputData)
id y x z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
id y x z
"factor" "numeric" "numeric" "factor"
>
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
>
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
Data: inputData
REML criterion at convergence: 444.245
Random effects:
Groups Name Std.Dev.
z (Intercept) 0.3097
Residual 0.9682
Number of obs: 157, groups: z, 17
Fixed Effects:
(Intercept) x
-0.4291 5.5638
>
> # display new data to predict in
> head(predData)
id x z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
id x z
"factor" "numeric" "factor"
>
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
>
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
>
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
>
> # display results
> head(preds_mid) # correct prediction
1 2 3 4 5 6
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470
> head(preds_interval) # incorrect order
fit upr lwr
1 1.512410 2.694813 0.133571198
2 1.273143 2.521899 0.009878347
3 1.398273 2.785358 0.232501376
4 1.878165 3.188086 0.625161201
5 1.605049 2.813737 0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
fit upr lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
fit upr lwr
1033 1.244261 2.457012 -0.0001299777
>
To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.
The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.
Is this a known issue? What can I do to avoid it?
I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.
library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
newdata=d,
newparams=list(beta=c(2, -1.1), theta=c(.25),
sigma = c(.23)), seed =463)[[1]]
lmeFit <- lmer(y ~ x + (1|z), data = d)
predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])
But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?
Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?