Non linear regression with nls2 package - r

I work on agglomeration effect and i want to run a non linear regression with package nls2.
I am trying to run this model with R
dens=runif(100)
surf=rnorm(100, 10, 2)
zone=seq(1,100,1)
donnees<-data.frame(dens,surf,zone)
attach(donnees)
donnees$salaire<-rnorm(100, 1000,3)
mp<-rep(0,100)
MP<-rep(0,100)
MPfonc<-function(alpha){
for (i in 1:100){
for (j in 1:100){
if(j!=i){
mp[j]<- dens[j]/(surf[i]-surf[j])^alpha
}
}
MP[i]<-sum(mp)
}
return(MP)
}
fo <- salaire ~ const+ gamma1*dens+gamma2*surf+gamma3*MPfonc(alpha)
gstart <- data.frame(const = c(-100, 100), gamma1 = c(-10, 10),
gamma2 = c(-10, 10),gamma3 = c(-10, 10), alpha=c(-10, 10))
fm <- nls2(fo, start = gstart, alg = "plinear-random")
It does not run and I think it is a problem of alpha.
Can nls2 function accept a function (MP(alpha)) as an input?
Here is the specification of my model:

The problems are:
set.seed should be set to make the code reproducible
salaire is not defined -- it is defined in the data frame donnees but donnees is never used after that.
the elements summed in the sum call in MPfonc include NaN or NA elements so the sum becomes similarly undefined
For plinear algorithms the RHS of the formula must evaluate to a matrix of coefficients of the linear parameters.
For the plinear algorithms provide starting values only for the non-linear parameters (i.e. only for alpha).
the nls2 package is never loaded. A library statement is needed.
code posted to SO should be indented 4 spaces to cause it to format properly (this was addressed in an edit to the question)
the mathematical formulas in the question are not clear in how they relate to the problem and are missing significant elements, e.g. alpha. This needs to be cleaned up. We have assumed that MPfonc gives the desired result and just simplified it.
The following corrects all the points and adds some minor improvements.
library(nls2)
set.seed(123) # for reproducibility
dens <- runif(100)
surf <- rnorm(100, 10, 2)
zone <- seq(1, 100, 1)
salaire <- rnorm(100, 1000, 3)
MPfonc <- function(alpha) {
sapply(1:100, function(i) sum( (dens / (surf[i] - surf) ^ alpha)[-i], na.rm = TRUE ))
}
fo <- salaire ~ cbind(1, dens, surf, MPfonc(alpha))
gstart <- data.frame(alpha = c(-10, 10))
fm <- nls2(fo, start = gstart, alg = "plinear-random")
giving:
> fm
Nonlinear regression model
model: salaire ~ cbind(1, dens, surf, MPfonc(alpha))
data: parent.frame()
alpha .lin1 .lin.dens .lin.surf .lin4
0.90477 1001.20905 -0.50642 -0.12269 0.00681
residual sum-of-squares: 757.6
Number of iterations to convergence: 50
Achieved convergence tolerance: NA
Note: Now that we have the starting value we can use it with nls like this:
nls(fo, start = coef(fm)["alpha"], alg = "plinear")
Update: Some code improvements, corrections and clarifications.

Related

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

How to predict gam model with random effect in R?

I am working on predicting gam model with random effect to produce 3D surface plot by plot_ly.
Here is my code;
x <- runif(100)
y <- runif(100)
z <- x^2 + y + rnorm(100)
r <- rep(1,times=100) # random effect
r[51:100] <- 2 # replace 1 into 2, making two groups
df <- data.frame(x, y, z, r)
gam_fit <- gam(z ~ s(x) + s(y) + s(r,bs="re"), data = df) # fit
#create matrix data for `add_surface` function in `plot_ly`
newx <- seq(0, 1, len=20)
newy <- seq(0, 1, len=30)
newxy <- expand.grid(x = newx, y = newy)
z <- matrix(predict(gam_fit, newdata = newxy), 20, 30) # predict data as matrix
However, the last line results in error;
Error in model.frame.default(ff, data = newdata, na.action = na.act) :
variable lengths differ (found for 'r')
In addition: Warning message:
In predict.gam(gam_fit, newdata = newxy) :
not all required variables have been supplied in newdata!
Thanks to the previous answer, I am sure that above codes work without random effect, as in here.
How can I predict gam models with random effect?
Assuming you want the surface conditional upon the random effects (but not for a specific level of the random effect), there are two ways.
The first is to provide a level for the random effect but exclude that term from the predicted values using the exclude argument to predict.gam(). The second is to again use exclude but this time to not provide any data for the random effect and instead stop predict.gam() from checking the newdata using the argument newdata.guaranteed = TRUE.
Option 1:
newxy1 <- with(df, expand.grid(x = newx, y = newy, r = 2))
z1 <- predict(gam_fit, newdata = newxy1, exclude = 's(r)')
z1 <- matrix(z1, 20, 30)
Option 2:
z2 <- predict(gam_fit, newdata = newxy, exclude = 's(r)',
newdata.guaranteed=TRUE)
z2 <- matrix(z2, 20, 30)
These produce the same result:
> all.equal(z1, z2)
[1] TRUE
A couple of notes:
Which you use will depend on how complex the rest of you model is. I would generally use the first option as it provides an extra check against me doing something stupid when creating the data. But in this instance, with a simple model and set of covariates it seems safe enough to trust that newdata is OK.
Your example uses a random slope (was that intended?), not a random intercept as r is not a factor. If your real example uses a factor random effect then you'll need to be a little more careful when creating the newdata as you need to get the levels of the factor right. For example:
expand.grid(x = newx, y = newy,
r = with(df, factor(2, levels = levels(r))))
should get the right set-up for a factor r

(mis) understanding priors in MCMCglmm

I'm new to this package; I usually use Stan for Bayesian models, but I'm using a lot of data and hoping I can get the models to run faster in MCMCglmm. I have read the course notes as well as the readme on github. They are really helpful and great descriptions, but I still do not understand how to set priors.
I have a simple mixed effects model, and here is some example code. (I know that others have asked this question, but I had trouble finding one with reproducible code and those answers helped users with a specific question, while I'm trying to understand what the values in a prior list mean).
library(MCMCglmm)
# inverse logit function for simulation
inv.logit <- function(x){
exp(x)/(exp(x)+1)
}
# simulate some data
set.seed(123)
n <- 1000
covariates <- replicate(3, rnorm(n, 0, .5))
X <- cbind(rep(1, length(covariates[,1])),covariates)
colnames(X) <- c('int', 'X1', 'X2', 'X3')
coefs <- c(1, -1, -.5, 0.3)
resp <- X %*% coefs
psi <- inv.logit(resp)
ind <- rep(1:10, floor(n/10)) # assigning individuals for random effect, but it's not important
n_inds <- length(unique(ind))
y <- rbinom(n, 1, psi)
df <- data.frame(ind, y, X)
# priors: I really don't know what I'm doing here
prior1 <- list(R = list(V = 2, n = 1, fix=1),
G = list(G1 = list(V = diag(3), n = 3)))
# run the model
glmm <- MCMCglmm(y ~ X1 + X2 + X3,
random= ~ us(1 + ind), # having random slope and random intercept for ind
family = "categorical",
data = df,
prior=prior1
)
If I were running this in Stan, I would set priors for each fixed effect covariate: Beta ~ normal(0, sigma_beta) and the hyperprior: sigma_beta ~ gamma(2,1). (Although, I'd also be happy just setting the prior as Beta ~ normal(0,100) or something like this.) I would use similar priors for the random effects. I understand that MCMCglmm is more limited in distributions, but I really don't understand the notation. To be clear, I'm not really interested in specifying priors in this example model, I'm trying to understand how one does it.
Is there somewhere I can find a definition of what is exactly meant by each of the values that goes into the prior (e.g., V, n, alpha, ...) and how these values correspond to what we would write in a full model description of the priors? Or is someone willing to explain it to more simple minded people like me? The descriptions in the course notes and github were not able to answer my question.
Thank you!

CausalImpact package in R doesn't work for Poisson bsts model

I'd like to use the CausalImpact package in R to estimate the impact of an intervention on infectious disease case counts. We typically characterize the distributions of case counts as either Poisson or negative binomial. The bsts() function allows us to specify the Poisson family. However this encountered an error in CausalImpact()
set.seed(1)
x1 <- 100 + arima.sim(model = list(ar = 0.999), n = 100)
y <- rpois(100, 1.2 * x1)
y[71:100] <- y[71:100] + 10
data <- cbind(y, x1)
pre.period <- c(1, 70)
post.period <- c(71, 100)
post.period.response <- y[post.period[1] : post.period[2]]
y[post.period[1] : post.period[2]] <- NA
ss <- AddLocalLevel(list(), y)
bsts.model <- bsts(y ~ x1, ss, family="poisson", niter = 1000)
impact <- CausalImpact(bsts.model = bsts.model,
post.period.response = post.period.response)
Error in rnorm(prod(dim(state.samples)), 0, sigma.obs) : invalid arguments
This is due to the fact that bsts.model has no sigma.obs slot when generated using family="poisson".
Am I doing this correctly or is there another way to use CausalImpact with Poisson data? (I'd also love to be able to use negative binomial data, but I won't get too greedy).
Last, is this the best place for coding issues for CausalImpact? I didn't see an Issues tab on the GitHub page.

Manual Perceptron example in R - are the results acceptable?

I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.

Resources