I am trying to run the nls function in R using indexed vectors as inputs, however I am getting an error:
> a=c(1,2,3,4,5,6,7,8,9,10)
> b=c(6,7,9,11,14,18,23,30,38,50) #make some example data
>
> nls(b[1:6]~s+k*2^(a[1:6]/d),start=list(s=2,k=3,d=2.5)) #try running nls on first 6 elements of a and b
Error in parse(text = x, keep.source = FALSE) :
<text>:2:0: unexpected end of input
1: ~
^
I can run it on the full vectors:
> nls(b~s+k*2^(a/d),start=list(s=2,k=3,d=2.5))
Nonlinear regression model
model: b ~ s + k * 2^(a/d)
data: parent.frame()
s k d
1.710 3.171 2.548
residual sum-of-squares: 0.3766
Number of iterations to convergence: 3
Achieved convergence tolerance: 1.2e-07
I am fairly certain that the indexed vectors have the same variable type as the full vectors:
> a
[1] 1 2 3 4 5 6 7 8 9 10
> typeof(a)
[1] "double"
> class(a)
[1] "numeric"
> a[1:6]
[1] 1 2 3 4 5 6
> typeof(a[1:6])
[1] "double"
> class(a[1:6])
[1] "numeric"
I can run nls if I save the indexed vectors in new variables:
> a_part=a[1:6]
> b_part=b[1:6]
> nls(b_part~s+k*2^(a_part/d),start=list(s=2,k=3,d=2.5))
Nonlinear regression model
model: b_part ~ s + k * 2^(a_part/d)
data: parent.frame()
s k d
2.297 2.720 2.373
residual sum-of-squares: 0.06569
Number of iterations to convergence: 3
Achieved convergence tolerance: 1.274e-07
Furthermore, lm accepts both full and indexed vectors:
> lm(b~a)
Call:
lm(formula = b ~ a)
Coefficients:
(Intercept) a
-4.667 4.594
> lm(b[1:6]~a[1:6])
Call:
lm(formula = b[1:6] ~ a[1:6])
Coefficients:
(Intercept) a[1:6]
2.533 2.371
Is there a way to run nls on indexed vectors without saving them in new variables?
Use subset . (It would also be possible to use the weights argument giving a weight of 1 to each of the first 6 observations and 0 to the rest.)
Also you might want to use the plinear algorithm to avoid having to give the starting values for the two parameters that enter linearly. In that case provide a matrix on the RHS with column names s and k such that its first column multiplies s and the second column multiplies k.
nls(b ~ cbind(s = 1, k = 2^(a/d)), subset = 1:6, start = list(d = 2.5),
algorithm = "plinear")
giving:
Nonlinear regression model
model: b ~ cbind(s = 1, k = 2^(a/d))
data: parent.frame()
d .lin.s .lin.k
2.373 2.297 2.720
residual sum-of-squares: 0.06569
Number of iterations to convergence: 3
Achieved convergence tolerance: 7.186e-08
Related
I'm learning to do nonlinear square fit with R, and I followed this tutorial:
p = function(x) x^3+2*x^2+5
x = seq(-0.99, 1, by = .01)
y = p(x) + runif(200)
df = data.frame(x = x, y = y)
head(df)
x y
1 -0.99 6.183018
2 -0.98 6.611669
3 -0.97 6.762615
4 -0.96 6.594278
5 -0.95 5.990637
6 -0.94 6.048369
# Then the author conducted a nonlinear regression fit.
fit = nls(y~a*x^2+b*x, data = df, start(a=0, b=0))
But when I try to run the code, it always says
"Error in hasTsp(x) : argument "x" is missing, with no default"
Does anyone know where the problem is?
Thank you so much!
We need the start as a list argument
nls(y~a*x^2+b*x, data = df, start = list(a=0, b=0))
-output
Nonlinear regression model
model: y ~ a * x^2 + b * x
data: df
a b
11.1241 0.5711
residual sum-of-squares: 2713
Number of iterations to convergence: 1
Achieved convergence tolerance: 3.246e-10
I have a list for formulas I want to fit to data, rather than running a loop I'd like to do this at once, for performance's sake. The estimations should still be separate, I'm not trying to estimate a SUR or anything.
The following code does what I want
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
y~x[,2],
y~x[,1] + x[,2])
lapply(formulae,lm)
Unfortunately this gets somewhat slow as the length of formulae increases is there a way to truly vectorize this?
If it is any help, the only results of lm I care about are coefficients, and some standard errors.
As I said in my comment, what you really need is a more efficient yet stable fitting routine other than lm(). Here I would provide you a well tested one written myself, called lm.chol(). It takes a formula and data, and returns:
a coefficient summary table, as you normally see in summary(lm(...))$coef;
Pearson estimate of residual standard error, as you get from summary(lm(...))$sigma;
adjusted-R.squared, as you get from summary(lm(...))$adj.r.squared.
## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
## stage0: get response vector and model matrix
## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
y <- data[[as.character(formula[[2]])]]
X <- model.matrix(formula, data)
n <- nrow(X); p <- ncol(X)
## stage 1: XtX and Jacobi diagonal preconditioner
XtX <- crossprod(X)
D <- 1 / sqrt(diag(XtX))
## stage 2: pivoted Cholesky factorization
R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
piv <- attr(R, "pivot")
r <- attr(R, "rank")
if (r < p) {
warning("Model is rank-deficient!")
piv <- piv[1:r]
R <- R[1:r, 1:r]
}
## stage 3: solve linear system for coefficients
D <- D[piv]
b <- D * crossprod(X, y)[piv]
z <- forwardsolve(t(R), b)
RSS <- sum(y * y) - sum(z * z)
sigma <- sqrt(RSS / (n - r))
para <- D * backsolve(R, z)
beta.hat <- rep(NA, p)
beta.hat[piv] <- para
## stage 4: get standard error
Rinv <- backsolve(R, diag(r))
se <- rep(NA, p)
se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
## stage 5: t-statistic and p-value
t.statistic <- beta.hat / se
p.value <- 2 * pt(-abs(t.statistic), df = n - r)
## stage 6: construct coefficient summary matrix
coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- colnames(X)
## stage 7: compute adjusted R.squared
adj.R2 <- 1 - sigma * sigma / var(y)
## return model fitting results
attr(coefficients, "sigma") <- sigma
attr(coefficients, "adj.R2") <- adj.R2
coefficients
}
Here I would offer three examples.
Example 1: full rank linear model
We take R's built-in dataset trees as an example.
# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2958 9.0866 9.167 6.33e-10 ***
#Girth -1.8615 1.1567 -1.609 0.1188
#Volume 0.5756 0.2208 2.607 0.0145 *
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared: 0.4123, Adjusted R-squared: 0.3703
#F-statistic: 9.82 on 2 and 28 DF, p-value: 0.0005868
## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2957705 9.0865753 9.166905 6.333488e-10
#Girth -1.8615109 1.1566879 -1.609346 1.187591e-01
#Volume 0.5755946 0.2208225 2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869
The results are exactly the same!
Example 2: rank-deficient linear model
## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)
## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164 0.2530 0.856 0.394
#x1 -0.1526 0.3252 -0.469 0.640
#x2 -0.3534 0.5707 -0.619 0.537
#x3 NA NA NA NA
#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared: 0.0069, Adjusted R-squared: -0.01358
#F-statistic: 0.337 on 2 and 97 DF, p-value: 0.7147
## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164455 0.2529576 0.8556595 0.3942949
#x1 NA NA NA NA
#x2 -0.2007894 0.6866871 -0.2924030 0.7706030
#x3 -0.3051760 0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!
Here, lm.chol() based on Cholesky factorization with complete pivoting and lm() based on QR factorization with partial pivoting have shrunk different coefficients to NA. But two estimation are equivalent, with the same fitted values and residuals.
Example 3: performance for large linear models
n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)
## using `lm()`
system.time(lm(y ~ ., dat))
# user system elapsed
# 3.212 0.096 3.315
## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
# user system elapsed
# 1.024 0.028 1.056
lm.chol() is 3 ~ 4 times faster than lm(). If you want to know the reason, read my this answer.
Remark
I have focused on improving performance on computational kernel. You can take one step further, by using Ben Bolker's parallelism suggestion. If my approach gives 3 times boost, and parallel computing gives 3 times boost on 4 cores, you end up with 9 times boost!
There's not really an easy way to vectorize this, but the pdredge function from the MuMIn package gives you a pretty easy way to parallelize it (this assumes you have multiple cores on your machine or that you can set up a local cluster in one of the ways supported by the parallel package ...
library(parallel)
clust <- makeCluster(2,"PSOCK")
library(MuMIn)
Construct data:
set.seed(101)
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
It will be easier to do this with a named data frame rather than an anonymous matrix:
df <- setNames(data.frame(y,x),c("y",paste0("x",1:3)))
The cluster nodes all need access to the data set:
clusterExport(clust,"df")
Fit the full model (you could use y~. to fit all variables)
full <- lm(y~x1+x2,data=df,na.action=na.fail)
Now fit all submodels (see ?MuMIn::dredge for many more options to control which submodels are fitted)
p <- pdredge(full,cluster=clust)
coef(p)
## (Intercept) x1 x2
## 3 -0.003805107 0.7488708 2.590204
## 2 -0.028502039 NA 2.665305
## 1 -0.101434662 1.0490816 NA
## 0 -0.140451160 NA NA
Consider the following table :
DB <- data.frame(
Y =rnorm(6),
X1=c(T, T, F, T, F, F),
X2=c(T, F, T, F, T, T)
)
Y X1 X2
1 1.8376852 TRUE TRUE
2 -2.1173739 TRUE FALSE
3 1.3054450 FALSE TRUE
4 -0.3476706 TRUE FALSE
5 1.3219099 FALSE TRUE
6 0.6781750 FALSE TRUE
I'd like to explain my quantitative variable Y by two binary variables (TRUE or FALSE) without intercept.
The argument of this choice is that, in my study, we can't observe X1=FALSE and X2=FALSE at the same time, so it doesn't make sense to have a mean, other than 0, for this level.
With intercept
m1 <- lm(Y~X1+X2, data=DB)
summary(m1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9684 1.0590 -1.859 0.1600
X1TRUE 0.7358 0.9032 0.815 0.4749
X2TRUE 3.0702 0.9579 3.205 0.0491 *
Without intercept
m0 <- lm(Y~0+X1+X2, data=DB)
summary(m0)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X1FALSE -1.9684 1.0590 -1.859 0.1600
X1TRUE -1.2325 0.5531 -2.229 0.1122
X2TRUE 3.0702 0.9579 3.205 0.0491 *
I can't explain why two coefficients are estimated for the variable X1. It seems to be equivalent to the intercept coefficient in the model with intercept.
Same results
When we display the estimation for all the combinations of variables, the two models are the same.
DisplayLevel <- function(m){
R <- outer(
unique(DB$X1),
unique(DB$X2),
function(a, b) predict(m,data.frame(X1=a, X2=b))
)
colnames(R) <- paste0('X2:', unique(DB$X2))
rownames(R) <- paste0('X1:', unique(DB$X1))
return(R)
}
DisplayLevel(m1)
X2:TRUE X2:FALSE
X1:TRUE 1.837685 -1.232522
X1:FALSE 1.101843 -1.968364
DisplayLevel(m0)
X2:TRUE X2:FALSE
X1:TRUE 1.837685 -1.232522
X1:FALSE 1.101843 -1.968364
So the two models are equivalent.
Question
My question is : can we just estimate one coefficient for the first effect ? Can we force R to assign a 0 value to the combinations X1=FALSE and X2=FALSE ?
Yes, we can, by
DB <- as.data.frame(data.matrix(DB))
## or you can do:
## DB$X1 <- as.integer(DB$X1)
## DB$X2 <- as.integer(DB$X2)
# Y X1 X2
# 1 -0.5059575 1 1
# 2 1.3430388 1 0
# 3 -0.2145794 0 1
# 4 -0.1795565 1 0
# 5 -0.1001907 0 1
# 6 0.7126663 0 1
## a linear model without intercept
m0 <- lm(Y ~ 0 + X1 + X2, data = DB)
DisplayLevel(m0)
# X2:1 X2:0
# X1:1 0.15967744 0.2489237
# X1:0 -0.08924625 0.0000000
I have explicitly coerced your TRUE/FALSE binary into numeric 1/0, so that no contrast is handled by lm().
The data appeared in my answer are different to yours, because you did not use set.seed(?) before rnorm() for reproducibility. But this is not a issue here.
I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.
> # display input data to the model
> head(inputData)
id y x z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
id y x z
"factor" "numeric" "numeric" "factor"
>
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
>
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
Data: inputData
REML criterion at convergence: 444.245
Random effects:
Groups Name Std.Dev.
z (Intercept) 0.3097
Residual 0.9682
Number of obs: 157, groups: z, 17
Fixed Effects:
(Intercept) x
-0.4291 5.5638
>
> # display new data to predict in
> head(predData)
id x z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
id x z
"factor" "numeric" "factor"
>
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
>
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
>
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
>
> # display results
> head(preds_mid) # correct prediction
1 2 3 4 5 6
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470
> head(preds_interval) # incorrect order
fit upr lwr
1 1.512410 2.694813 0.133571198
2 1.273143 2.521899 0.009878347
3 1.398273 2.785358 0.232501376
4 1.878165 3.188086 0.625161201
5 1.605049 2.813737 0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
fit upr lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
fit upr lwr
1033 1.244261 2.457012 -0.0001299777
>
To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.
The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.
Is this a known issue? What can I do to avoid it?
I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.
library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
newdata=d,
newparams=list(beta=c(2, -1.1), theta=c(.25),
sigma = c(.23)), seed =463)[[1]]
lmeFit <- lmer(y ~ x + (1|z), data = d)
predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])
But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?
Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?
I am learning R and currently using it for non linear regression (which I am also learning).
I have two sets of data (duration of an operation on different machines) and I am able to find a good non linear regression for each of these sets.
Now, I would like to find the best regression that minimise the sum of both residual sum-of-squares.
Here is what I have :
A <- c(1:5)
B <- c(100, 51, 32, 24, 19)
C <- c(150, 80, 58, 39, 29)
df <- data.frame (A,B,C)
f <- B ~ k1/A + k2
g <- C ~ k1/A + k2
n <- nls(f, data = df, start = list(k1=10, k2=10))
p <- nls(g, data = df, start = list(k1=10, k2=10))
n
#Nonlinear regression model
# model: B ~ k1/A + k2
# data: df
# k1 k2
#101.595 -1.195
# residual sum-of-squares: 2.619
#Number of iterations to convergence: 1
#Achieved convergence tolerance: 2.568e-07
p
#Nonlinear regression model
# model: C ~ k1/A + k2
# data: df
# k1 k2
#148.044 3.593
# residual sum-of-squares: 54.19
#Number of iterations to convergence: 1
#Achieved convergence tolerance: 1.803e-07
k1 and k2 constant are (of course) different for both sets (B and C), I am wondering how I could manage to find a particular k1 and a particular k2 that produce the 'best' solution for both data sets.
Hope my explanation will be understandable. Otherwise, what I'm trying to find is sometimes (at least here) called global non linear regression.
EDIT : I would also like to know how can I tell R to avoid negative values for a specific parameter. In this case, I would like k2 to be positive.
If you want identical parameters, you should just pool your data:
df2 <- data.frame(Y=c(df$B,df$C), X=rep(df$A, 2))
p <- nls(Y ~ k1/X + k2,
data = df2,
start = list(k1=10, k2=10),
lower = c(0, 0),
algorithm = "port")
summary(p)
# Formula: Y ~ k1/X + k2
#
# Parameters:
# Estimate Std. Error t value Pr(>|t|)
# k1 124.819 18.078 6.904 0.000124 ***
# k2 1.199 9.781 0.123 0.905439
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 16.59 on 8 degrees of freedom
#
# Algorithm "port", convergence message: both X-convergence and relative convergence (5)
Edit:
If you want one parameter to be equal and one to vary, you could use a mixed effects model. However, I don't know how to specify constraints for that (I believe that is not a simple task, but could possibly be achieved by reparameterization).
library(nlme)
library(reshape2)
df3 <- melt(df, id.vars="A")
r <- nlme(value ~ k1/A + k2,
data = df3,
start = c(k1=10, k2=10),
fixed = k1 + k2 ~1,
random = k2 ~ 1|variable)
summary(r)
# Nonlinear mixed-effects model fit by maximum likelihood
# Model: value ~ k1/A + k2
# Data: df3
# AIC BIC logLik
# 83.11052 84.32086 -37.55526
#
# Random effects:
# Formula: k2 ~ 1 | variable
# k2 Residual
# StdDev: 12.49915 7.991013
#
# Fixed effects: k1 + k2 ~ 1
# Value Std.Error DF t-value p-value
# k1 124.81916 9.737738 7 12.818086 0.0000
# k2 1.19925 11.198211 7 0.107093 0.9177
# Correlation:
# k1
# k2 -0.397
#
# Standardized Within-Group Residuals:
# Min Q1 Med Q3 Max
# -1.7520706 -0.5273469 0.2746039 0.5235343 1.4971808
#
# Number of Observations: 10
# Number of Groups: 2
coef(r)
# k1 k2
# B 124.8192 -10.81835
# C 124.8192 13.21684