I would like to know how to do the maximum likelihood estimation in R when fitting parameters are given in an array. This is needed when the number of parameters is large. So basically, to fit a normal distribution to the data x, I would like to do something like the following
LL <- function(param_array) {
R = dnorm(x, param_array[1], param_array[2])
-sum(log(R))
}
mle(LL, start = list(param_array = c(1,1)))
(Instead of this original code in the first section of http://www.r-bloggers.com/fitting-a-model-by-maximum-likelihood/)
If I ran the code above I will get an error
Error in dnorm(x, param_array[1], param_array[2]) : argument
"param_array" is missing, with no default
Could anyone let me know how to achieve what I want in the correct way?
stats4::mle is not a long function, you can inspect it in your R console:
> stats4::mle
Note how start is handled:
start <- sapply(start, eval.parent)
nm <- names(start)
case 1
If you do:
LL <- function(mu, sigma) {
R = dnorm(x, mu, sigma)
-sum(log(R))
}
mle(LL, start = list(mu = 1, sigma = 1))
you get:
nm
#[1] "mu" "sigma"
Also,
formalArgs(LL)
#[1] "mu" "sigma"
case 2
If you do:
LL <- function(param_array) {
R = dnorm(x, param_array[1], param_array[2])
-sum(log(R))
}
mle(LL, start = list(param_array = c(1,1)))
you get
nm
#[1] NULL
but
formalArgs(LL)
#[1] param_array
The problem
The evaluation of function LL inside stats::mle is by matching nm to the formal arguments of LL. In case 1, there is no difficulty in matching, but in case 2 you get no match, thus you will fail to evaluate LL.
So what do people do if they have like 50 parameters? Do they type them in by hand?
Isn't this a bogus argument, after a careful reflection? If you really have 50 parameters, does using an array really save your effort?
First, inside your function LL, you have to specify param_array[1], param_array[2], ..., param_array[50], i.e., you still need to manually input 50 parameters into the right positions. While when specifying start, you still need to type in a length-50 vector element by element, right? Isn't this the same amount of work, compared with not using an array but a list?
Related
I am trying to run a Bayesian clustering model where the number of clusters is random with binomial distribution.
This is my Jags model:
model{
for(i in 1:n){
y[ i ,1:M] ~ dmnorm( mu[z[i] , 1:M] , I[1:M, 1:M])
z[i] ~ dcat(omega[1:M])
}
for(j in 1:M){
mu[j,1:M] ~ dmnorm( mu_input[j,1:M] , I[1:M, 1:M] )
}
M ~ dbin(p, Mmax)
omega ~ ddirich(rep(1,Mmax))
}
to run it, we need to define the parameters anche the initial values for the variables, which is done in this R script
Mmax=10
y = matrix(0,100,Mmax)
I = diag(Mmax)
y[1:50,] = mvrnorm(50, rep(0,Mmax), I)
y[51:100,] = mvrnorm(50, rep(5,Mmax), I)
plot(y[,1:2])
z = 1*((1:100)>50) + 1
n = dim(y)[1]
M=2
mu=matrix(rnorm(Mmax^2),nrow=Mmax)
mu_input=matrix(2.5,Mmax,Mmax) ### prior mean
p=0.5
omega=rep(1,Mmax)/Mmax
data = list(y = y, I = I, n = n, mu_input=mu_input, Mmax = Mmax, p = p)
inits = function() {list(mu=mu,
M=M,
omega = omega) }
require(rjags)
modelRegress=jags.model("cluster_variabile.txt",data=data,inits=inits,n.adapt=1000,n.chains=1)
however, running the last command, one gets
Error in jags.model("cluster_variabile.txt", data = data, inits = inits,
: RUNTIME ERROR: Compilation error on line 6.
Unknown variable M Either supply values
for this variable with the data or define it on the left hand side of a relation.
which for me makes no sense, since the error is at line 6 even if M already appears at line 4 of the model! What is the actual problem in running this script?
So JAGS is not like R or other programming procedural languages in that it doesn't actually run line by line, it is a declarative language meaning the order of commands doesn't actually matter at least in terms of how the errors pop up. So just because it didn't throw an error on line 4 doesn't mean something isn't also wrong there. Im not positive, but I believe the error is occuring because JAGS tries to build the array first before inputting values, so M is not actually defined at this stage, but nothing you can do about that on your end.
With that aside, there should be a fairly easy work around for this, it is just less efficient. Instead of looping from 1:M make the loop iterate from 1:MMax that way the dimensions don't actually change, it is always an MMax x MMax. Then line 7 just assigns 1:M of those positions to a value. The downside of this is that it will require you to do some processing after the model is fit. So on each iteration, you will need to pull the sampled M and filter the matrix mu to be M x M, but that shouldn't be too tough. Let me know if you need more help.
So, I think the main problem is that you can't change the dimensionality of the stochastic node you're updating. This seems like a problem for reversible jump MCMC, though I don't think you can do this in JAGS.
I have a wrapper function of two functions. Each function has its own parameters vectors. The main idea is to pass the vectors of parameters (which is a vector or two vectors) to optim and then, I would like to maximize the sum of the function.
Since my function is so complex, then I tried to provide a simple example which is similar to my original function. Here is my code:
set.seed(123)
x <- rnorm(10,2,0.5)
ff <- function(x, parOpt){
out <- -sum(log(dnorm(x, parOpt[[1]][1], parOpt[[1]][2]))+log(dnorm(x,parOpt[[2]][1],parOpt[[2]][2])))
return(out)
}
# parameters in mu,sd vectors arranged in list
params <- c(set1 = c(2, 0.2), set2 = c(0.5, 0.3))
xy <- optim(par = params, fn=ff ,x=x)
Which return this error:
Error in optim(par = params, fn = ff, x = x) :
function cannot be evaluated at initial parameters
As I understand, I got this error because optim cannot pass the parameters to each part of my function. So, how can I tell optim that the first vector is the parameter of the first part of my function and the second is for the second part.
You should change method parameter to use initial parameters.
You can read detailed instructions about optim function using ?optim command.
For example you can use "L-BFGS-B" method to use upper and lower constraints.
I have a classification problem with a very skewed class to predict (e.g. 90% / 10% unbalanced binary variable to predict).
In order to deal with that issue, I want to use the SMOTE method to oversample this class variable. However, as I read here (http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation) it is best practice to use SMOTE inside the k-fold loop to avoid overfitting.
As I'm using the caret package to perform my analysis, I'm referring to this link (http://topepo.github.io/caret/sampling.html). I undestand everything perfectly but the last part where it explains how to change the SMOTE parameters:
smotest <- list(name = "SMOTE with more neighbors!",
func = function (x, y) {
library(DMwR)
dat <- if (is.data.frame(x)) x else as.data.frame(x)
dat$.y <- y
dat <- SMOTE(.y ~ ., data = dat, k = 10)
list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)],
y = dat$.y)
},
first = TRUE)
I simply don't understand this. Someone care to explain? Let's say I want to include the SMOTE parameters perc.over, k and perc.under, how would I do that?
Thank you very much.
EDIT:
Actually I realized I could probably just add these parameters inside the "SMOTE" expression in the above function, this would for instance give something like:
smotest <- list(name = "SMOTE with more neighbors!",
func = function (x, y) {
library(DMwR)
dat <- if (is.data.frame(x)) x else as.data.frame(x)
dat$.y <- y
dat <- SMOTE(.y ~ ., data = dat, k = 10, perc.over = 1200, perc.under = 100)
list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)],
y = dat$.y)
},
first = TRUE)
I am not sure to have understood what you do not understand but here is an attempt to clarify what is done in this piece of code.
The smotest object is created as list because it is the way the argument sampling of trainControl function must be represented. The first element of this list is a name used only for display purposes. The second, func, is the actual sampling function. The third, first, is a logical value indicating whether samplin must be done before or after the pre-processing step.
The element func is here only a wrapper of SMOTE function. In this wrapper, line 3 is here because only a data.frame can be passed to SMOTE function. Line 4 is added because a formula combined to a data.frame is used in SMOTE rather than a couple x y. Line 6 is here to ensure that the appropriate format is returned to trainControl.
And, to answer you last question: yes, you can do what you have proposed to set additional parameters to SMOTE.
I'm using the nls.lm function from the minpack.lm package and something "weird" happens when I change the order of the parameters in the residual function
This code works :
install.packages('minpack.lm')
library(minpack.lm)
## values over which to simulate data
x <- seq(0,100,length=100)
## model based on a list of parameters
getPrediction <- function(parameters, x)
parameters$A*exp(-parameters$alpha*x) + parameters$B*exp(-parameters$beta*x)
## parameter values used to simulate data
pp <- list(A = 2, B = 0.8, alpha = 0.6, beta = 0.01)
## simulated data, with noise
simDNoisy <- getPrediction(pp,x) + rnorm(length(x),sd=.01)
#simDNoisy[seq(1,10)] = rep(10,11)
simDNoisy[1] = 4
## plot data
plot(x,simDNoisy, main="data")
## residual function
residFun <- function(parameters, observed, xx)
sqrt(abs(observed - getPrediction(parameters, xx)))
## starting values for parameters
parStart <- list(Ar = 3, Br = 2, alphar = 1, betar = 0.05)
## perform fit
rm(nls.out)
nls.out <- nls.lm(par=parStart,
fn = residFun,
observed = simDNoisy,
xx = x,
control = nls.lm.control(nprint=1))
nls.out
It doesn't work if I replace the residual function by this (just change parameters order)
residFun <- function(xx, parameters, observed )
sqrt(abs(observed - getPrediction(xx, parameters)))
Error in parameters$A : $ operator is invalid for atomic vectors
Why does it cause this error ?
Parameters should match the order of the parameters as defined in the function. The only exception you should use is if you explicitly name them out of order. Consider this example of what the function thinks are two parameters
theParameters=function(X,Y){
print(paste("I think X is",X))
print(paste("I think Y is",Y))
}
theParameters(X=2,Y=10)
theParameters(Y=10,X=2)
#you can change the parameter order if you identify them with parameter=...
#but if you don't, it assumes it's in the order of how the function is defined.
# which of these is X and which is Y?
theParameters(10,2)
It's preferable to always identify the parameters, but nececessary if it's out of order. (Other languages don't even let you change the order of parameters when you call them).
getPrediction(xx=xx,parameters=parameters)
In this case the reason is the function treats xx and parameters as if it had created its own local copy. Without the identification this line
getPrediction(xx,parameters)
means this to R
getPrediction(parameters=xx,xx=parameters)
because that matches the original signature of the function.
So the function's version of parameters is what you pass in as xx, and so on.
Because you called the parameters the same thing as variable names, it can be confusing. It works easier if you vary the dummy version of variable names slightly. Alternatively if scopes of variables allow it, you don't even have to pass in the parameters, but be careful with that practice because it can cause tracing headaches.
I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?
When you created your formula, lm.cars, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows
Another problem that can cause this is if one passes a character (string vector) to lm instead of a formula. vectors have no environment, and so when lm converts the character to a formula, it apparently also has no environment instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame, but is in the local function argument, one gets a not found error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser, one can see the weights vector just fine, but it somehow is not found in the lm call!
The bug becomes even harder to debug if one used the name weights for the weights variable. In this case, since lm can't find the weights object, it defaults to the function weights() from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.