I am attempting to analyze a dataset using nonlinear least squares that has both measurement error and heteroscedasticity. I was able to fit the model using nls() in R and adjust for heteroscedasticity by weighting the observations by a power function of the fitted values (a technique known as the power mean method). However, when I additionally try to correct for measurement error using the simex package in R, I get the following error:
Error: measurement.error is constant 0 in column(s) 1
Which is strange because I have specified a nonzero measurement error. I have pasted example code below which reproduces this error.
library(simex)
set.seed(123456789)
x = runif(n = 1000, min = 1, max = 3.6)
x_err = x + rnorm(n = 1000, mean = 0, sd = 0.1)
y_mean = 100/(1+10^(log10(100)-x)*0.75)
y_het = y_mean + rnorm(n = 1000, mean = 0, sd = 10*x^-2)
y_het = ifelse(y_het > 0, y_het, 0)
w = (100/(1+10^(log10(100)-x_err)*0.75))^-2
nls_fit = nls(y_het ~ 100/(1+10^((log10(k)-x_err)*h)), start = list("k" = 100, "h" = 0.75), weights = 1/w)
simex(nls_fit, SIMEXvariable = "x_err", measurement.error = 0.1, asymptotic = F)
Related
I tried to build this Loop so that I can test two outcomes at the same time. However, it produced an error message: "Error in model.frame.default(formula = ~outcome + centered.predictor1 + : variable lengths differ (found for 'centered.predictor1')"
But when I tested each outcome separately, the code (without loop) didn't produce errors.
Thanks in advance for your help!
n1 = rnorm(n = 2000, mean = 0, sd = 1)
n2 = rnorm(n = 2000, mean = 0, sd = 1)
Z_familism = rnorm(n = 2000, mean = 0, sd = 1)
Z_avoidance = rnorm(n = 2000, mean = 0, sd = 1)
Country = rnorm(n = 2000, mean = 0, sd = 1)
Z_anxiety = rnorm(n = 2000, mean = 0, sd = 1)
data01<-data.frame(n1,n2,Z_familism,Z_avoidance,Country,Z_anxiety)
outcome<-c('n1', 'n2')
for (n in outcome){
rsa.data<-data.frame(predictor1=data01$Z_familism,
predictor2=data01$Z_avoidance,
nest=as.factor(data01$Country),
control=data01$Z_anxiety,
multilevel=data01$Country,
outcome=data01[n])
rsa.data <- within.data.frame(rsa.data, {
centered.predictor1 <- predictor1 - 0 #Center predictor 1
centered.predictor2 <- predictor2 - 0 #Center predictor 2
squared.predictor1 <- centered.predictor1* centered.predictor1 #Create squared term
squared.predictor2 <- centered.predictor2* centered.predictor2 #Create squared term
interaction <- centered.predictor1* centered.predictor2 #Create interaction term
})
mlm.model <- lme(outcome ~ centered.predictor1+centered.predictor2 + squared.predictor1 + interaction +squared.predictor2+control,
data = rsa.data,
random = ~ 1|multilevel, # Replace "nesting.variable" with the name of your nesting variable
na.action = "na.omit")
summary(mlm.model) #View Model
intervals(mlm.model, which = "fixed")
vcov(mlm.model) #View covariance of model
}
The problem is when you create the rsa.data dataframe inside the loop, specifically with the outcome column. Instead of data01[n] which returns a dataframe, you should use data01[, n], which returns a numeric vector. That way all your data has the same length.
rsa.data<-data.frame(predictor1=data01$Z_familism,
predictor2=data01$Z_avoidance,
nest=as.factor(data01$Country),
control=data01$Z_anxiety,
multilevel=data01$Country,
outcome=data01[, n])
We simulated a data set and created a model.
set.seed(459)
# seed mass
n <- 1000
seed.mass <- round(rnorm(n, mean = 250, sd = 75),digits = 1)
## Setting up the deterministic function
detFunc <- function(a,b,x){
return(exp(a+b*x)) / (1+exp(a+b*x))
}
# logit link function for the binomial
inv.link <- function(z){
p <-1/(1+exp(-z))
return(p)
}
#setting a and b values
a <- -2.109
b <- 0.02
# Simulating data
germination <- (rbinom(n = n, size = 10,
p = inv.link(detFunc(x = seed.mass, a = a, b = b))
))/10
## make data frame
mydata <- data.frame("predictor" = seed.mass, "response" = germination)
# plotting the data
tmp.x <- seq(0,1e3,length.out=500)
plot(germination ~ seed.mass,
xlab = "seed mass (mg)",
ylab = "germination proportion")
lines(tmp.x,inv.link(detFunc(x = tmp.x, a = a, b = b)),col="red",lwd=2)
When we check the model we created and infer the parameters, we get an error:
Error in optim(par = c(a = -2.109, b = 0.02), fn = function (p) : initial value in 'vmmin' is not finite
library(bbmle)
mod1<-mle2(response ~ dbinom(size = 10,
p = inv.link(detFunc(x = predictor, a = a, b = b))
),
data = mydata,
start = list("a"= -2.109 ,"b"= 0.02))
We're stumped and can't figure out why we're getting this error.
Your problem is that you're trying to fit a binomial outcome (which must be an integer) to a proportion.
You can use round(response*10) as your predictor (to put the proportion back on the count scale; round() is because (a/b)*b is not always exactly equal to a in floating-point math ...) Specifically, with your setup
mod1 <- mle2(round(response*10) ~ dbinom(size = 10,
p = inv.link(detFunc(x = predictor, a = a, b = b))
),
data = mydata,
start = list(a = -2.109 ,b = 0.02))
works fine. coef(mod1) is {-1.85, 0.018}, plausibly close to the true values you started with (we don't expect to recover the true values exactly, except as the average of many simulations [and even then MLE is only asymptotically unbiased, i.e. for large data sets ...]
The proximal problem is that trying to evaluate dbinom() with a non-integer value gives NA. The full output from your model fit would have been:
Error in optim(par = c(a = -2.109, b = 0.02), fn = function (p) :
initial value in 'vmmin' is not finite
In addition: There were 50 or more warnings (use warnings() to see the first 50)
It's always a good idea to check those additional warnings ... in this case they are all of the form
1: In dbinom(x = c(1, 1, 1, 0.8, 1, 1, 1, 1, 1, 1, 1, 0.8, ... :
non-integer x = 0.800000
which might have given you a clue ...
PS you can use qlogis() and plogis() from base R for your link and inverse-link functions ...
I need to write a function that performs a simulation to evaluate the coverage of a bootstrap confidence interval for the variance of n samples from a normal distribution. Belowis what I've attempted but it keeps returning a mean of 0 or 0.002 for the number of samples that lie within the CI...
Var_CI_Coverage <- function(true_mean,true_var, nsim, nboot, alpha, nsamples){
cover = NULL
for(k in 1:nsim){
Var = as.numeric()
y <- rnorm(1, mean = true_mean, sd = sqrt(true_var))
for(i in 1:nboot){
resample_y <- sample(y, size = nsamples, replace = TRUE)
Var[i] <- var(resample_y)
}
LB <- quantile(Var, probs=c(alpha/2))
UB <- quantile(Var, probs=c(1 - (alpha/2)))
cover[k] <- ifelse(LB <= true_var & UB >= true_var, 1, 0)
}
return(mean(cover))
}
Var_CI_Coverage(true_mean= 0, true_var = 4, nsim = 500, nboot = 1000, alpha = 0.05, nsamples = 10)
The main problem is you generate y using
y <- rnorm(1, mean = true_mean, sd = sqrt(true_var))
which means y is a single value, and all your bootstrap samples are just that single y value repeated nsamples times. You need
y <- rnorm(nsamples, mean = true_mean, sd = sqrt(true_var))
Then you get samples with actual variance, and you get a coverage estimate that looks more in the right ballpark (no comment on whether it's correct, I haven't tried to check).
I am working with survival analysis and the smoothHR package, after modeling I'd like to plot the relative risk vs a variable, thing that is quite easy with
plot(dataset, predictor)
But I'd like to do it using the ggplot package. Any idea how to?
#the library
library(smoothHR)
#the artificial dataset
surv.days<- runif(n = 200, min = 100, max = 500)
censor<- sample(c(0,1), 200, replace=TRUE)
surv.var<- surv.days/10 + rnorm(200, mean = 0, sd = 3)
surv.var[which(surv.days>250)]<- surv.days[which(surv.days>250)]/5 + rnorm(length(which(surv.days>250)), mean = 0, sd = 10)
survdata<- data.frame(surv.days, censor, surv.var)
rm(censor, surv.days, surv.var)
#using smoothHR package to adjust a model
variabledf<-dfmacox (time = "surv.days", status = "censor",
nl.predictor = c ("surv.var"),
smoother = "ns",
method = "AIC",
data = survdata)
coxmodel<- coxph(Surv(surv.days, censor) ~ ns(surv.var, variabledf$df[1]), data = survdata, x = TRUE)
c.smoothhr<-smoothHR (data = survdata, coxfit = coxmodel)
After that, I can plot the risk as a function of the survival variable
plot (c.smoothhr, predictor = "surv.var", conf.level = 0.95, ref.label = "", main = "", xlab = "surv.var")
I would like to generate this plot using the ggplot2 package, for storing and customization purposes; but I am simply clueless about how to proceed.
I am receiving the following error when running the mle2() function from the bbmle package in R:
some parameters are on the boundary: variance-covariance calculations based on Hessian may be unreliable
I am trying to understand if this is due to a problem with my data or an issue with calling the function properly. Unfortunately, I cannot post my real data, so I am using a similar working example of the same sample size.
The custom dAction function I am using is a softmax function. There have to be upper and lower bounds on the optimization so I am using the L-BFGS-B method.
library(bbmle)
set.seed(3939)
### Reproducible data
dat1 <- rnorm(30, mean = 3, sd = 1)
dat2 <- rnorm(30, mean = 3, sd = 1)
dat1[c(1:3, 5:14, 19)] <- 0
dat2[c(4, 15:18, 20:22, 24:30)] <- 0
### Data variables
x <- sample(1:12, 30, replace = TRUE)
pe <- dat1
ne <- dat2
### Likelihood
dAction <- function(x, a, b, t, pe, ne, log = FALSE) {
u <- exp(((x - (a * ne) - (b * pe)) / t))
prob <- u / (1 + u)
if(log) return(prob) else return(-sum(log(prob)))
}
### Fit
fit <- mle2(dAction,
start = list(a = 0.1, b = 0.1, t = 0.1),
data = list(x = x, pe = pe, ne = ne),
method = "L-BFGS-B",
lower = c(a = 0.1, b = 0.1, t = 0.1),
upper = c(a = 10, b = 1, t = 10))
Warning message:
In mle2(dAction, start = list(a = 0.1, b = 0.1, t = 0.1), data = list(x = x, :
some parameters are on the boundary: variance-covariance calculations based on Hessian may be unreliable
Here are the results for summary():
summary(fit)
Maximum likelihood estimation
Call:
mle2(minuslogl = dAction, start = list(a = 0.1, b = 0.1, t = 0.1),
method = "L-BFGS-B", data = list(x = x, pe = pe, ne = ne),
lower = c(a = 0.1, b = 0.1, t = 0.1), upper = c(a = 10, b = 1,
t = 10))
Coefficients:
Estimate Std. Error z value Pr(z)
a 0.1 NA NA NA
b 0.1 NA NA NA
t 0.1 NA NA NA
-2 log L: 0.002048047
Warning message:
In sqrt(diag(object#vcov)) : NaNs produced
And the results for the confidence intervals
confint(fit)
Profiling...
2.5 % 97.5 %
a NA 1.0465358
b NA 0.5258828
t NA 1.1013322
Warning messages:
1: In sqrt(diag(object#vcov)) : NaNs produced
2: In .local(fitted, ...) :
Non-positive-definite Hessian, attempting initial std err estimate from diagonals
I don't entirely understand the context of your problem, but:
The issue (whether it is a real problem or not depends very much on the aforementioned context that I don't understand) has to do with your constraints. If we do the fit without the constraints:
### Fit
fit <- mle2(dAction,
start = list(a = 0.1, b = 0.1, t = 0.1),
data = list(x = x, pe = pe, ne = ne))
## method = "L-BFGS-B",
## lower = c(a = 0.1, b = 0.1, t = 0.1),
## upper = c(a = 10, b = 1, t = 10))
we get coefficients that are below your bounds.
coef(fit)
a b t
0.09629301 0.07724332 0.02405173
If this is correct, at least one of the constraints is going to be active (i.e. when we fit with lower bounds, at least one of our parameters will hit the bounds - in fact, it's all of them). When fits are on the boundary, the simplest machinery for computing confidence intervals (Wald intervals) doesn't work. However, this doesn't affect the profile confidence interval estimates you report above. These are correct - the lower bounds are reported as NA because the lower confidence limit is at the boundary (you can replace these by 0.1 if you like).
If you didn't expect the optimal fit to be on the boundary, then I don't know what's going on, maybe a data issue.
Your log-likelihood function is not wrong, but it's a little confusing because you have a log argument that returns the negative log-likelihood when log=FALSE (default) and the likelihood when log=TRUE. Before I realized that, I rewrote the function (I also made it a little more numerically stable by doing computations on the log scale wherever possible).
dAction <- function(x, a, b, t, pe, ne) {
logu <- (x - (a * ne) - (b * pe)) / t
lprob <- logu - log1p(exp(logu))
return(-sum(lprob))
}