Trouble Using Function which takes Formula Argument in R - r

I have a function implementing the Fisher algorithm in R for a GLM which takes formula as an argument. However when attempting to run it I get the error: Error in model.frame.default(formula = formula, drop.unused.levels = TRUE) : invalid type (closure) for variable 't'
I tried calling it in several ways (e.g using as.formula(y~t) but nothing seems to work.
myglm <- function(formula,data,start = 0) {
X = model.matrix(formula,data) #It appears that the issue comes from this line
Y = data[,1]
n = dim(X)[1]
p <- dim(X)[2]
beta_0 = rep(1,p)
M = t(X)%*%X
beta = rep(0,p)#Least Squares Estimate
epsilon = 0.01
#Run Fisher Iterations
while (norm(beta-beta_0,type = "2")/norm(beta_0, type = "2") > epsilon) {
beta_0 = beta
eta = X %*% beta
lambda = exp(eta)
F = t(X) %*% diag(as.vector(lambda)) %*% X #Fisher information matrix
s = t(X) %*% (Y - exp(eta)) #Score function
beta = beta + solve(F) %*% s
}
vcov = solve(F)
coef = matrix(c(0,0,0,0),nrow = 2, ncol = 2)
coef[,1] = beta
coef[,2] = t(sqrt(diag(vcov)))
colnames(coef) = c("Coefficients","Standard error")
rownames(coef) = c("beta1", "beta2")
#Calculate Deviance
mod_sat = glm(formula, family = poisson(link = "log"))
log_likelihood = Y %*% eta - exp(eta)
deviance = 2*(LogLik(mod_sat) - log_likelihood)
return(list(coef,deviance,vcov))
}
f = formula(y ~ t)
load(url("https://www.math.ntnu.no/emner/TMA4315/2020h/hoge-veluwe.Rdata")) #This is stored as "data"
myglm(f, data)

Your issue is in this line:
mod_sat = glm(formula, family = poisson(link = "log"))
You need to specify a data = argument to glm() so it knows how to interpret the formula.

Related

Nonlinear regression in R error in step factor

I need aproximate datapoints by exponential function with some type of lower limit (variable y is price in time and I need fix minimal value, so asymptote of exponential function cant be at 0). For some "y" is my code function, but at others return error. How can I solve it? Thanks
R code:
y <- c(26973, 24907, 22999, 21236, 19609, 18107, 16720, 15439, 14256, 13163,
12155, 11224, 10364, 9570, 8836)
x <- c(1:15)
train <- data.frame(x, y)
colnames(train) <- c("x", "y")
# Select an approximate $\theta$, since theta must be lower than min(y), and greater than zero
theta.0 <- min(train$y) * 0.5 #min(data.df$y) * 0.5
# Estimate the rest parameters using a linear model
model.0 <- lm(log(price - theta.0) ~ age, data = train)
alpha.0 <- exp(coef(model.0)[1])
beta.0 <- coef(model.0)[2]
# Starting parameters
start <- list(alpha = alpha.0, beta = beta.0, theta = theta.0)
print(start)
model <- nls(y ~ alpha * exp(beta * x) + theta , data = train, start = start)
plot(train$x, train$y)
lines(train$x, predict(model, list(x = train$x)), col = 'skyblue', lwd = 3)
Output:
Error in nls(y ~ alpha * exp(beta * x) + theta, data = train, start = start) :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562

Data is too long Error in R FlexmixNL package

I tried to search this online, but couldn't exactly figure out what my issue was. Here is my code:
n = 10000
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y1 <- 10*sin(x1/10) + 10 + rnorm(n, sd = 1)
y2 <- x2 * cos(x2) - 2 * rnorm(n, sd = 2)
x <- c(x1, x2)
y <- c(x1, x2)
start1 = list(a = 10, b = 5)
start2 = list(a = 30, b = 5)
library(flexmix)
library(flexmixNL)
modelNL <- flexmix(y~x, k =2,
model = FLXMRnlm(formula = y ~ a*x/(b+x),
family = "gaussian",
start = list(start1, start2)))
plot(x, y, col = clusters(modelNL))
and before the plot, it gives me this error:
Error in matrix(1, nrow = sum(groups$groupfirst)) : data is too long
I checked google for similar errors, but I don't quite understand what is wrong with my own code that results in this error.
As you can already tell, I am very new to R, so please explain it in the most layman terms possible. Thank you in advance.
Ironically (in the context of an error message saying data is "too long") I think the proximate cause of that error is no data argument. If you give it the data in the form of a dataframe, you still get an error but its not the same one as you are experiencing. When you plot the data, you get a rather bizarre set of values at least from a statistical distribution standpoint and it's not clear why you are trying to model this with this formula. Nonetheless, with those starting values and a dataframe argument to data, one sees results.
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x/(b+x),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
6664 13336
convergence after 20 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.436 6664 20000 0.333
Comp.2 0.564 13336 16306 0.818
'log Lik.' -91417.03 (df=7)
AIC: 182848.1 BIC: 182903.4
Most R regression functions first check for the matchng names in formulae within the data= argument. Apparently this function fails when it needs to go out to the global environment to match formula tokens.
I tried a formula suggested by the plot of the data and get convergent results:
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x*cos(x+b),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
9395 10605
convergence after 17 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.521 9395 18009 0.522
Comp.2 0.479 10605 13378 0.793
'log Lik.' -78659.85 (df=7)
AIC: 157333.7 BIC: 157389
The reduction in AIC seems huge compare to the first formula.

Simulating datasets in R for model selection

I made a code to simulate a dataset in R to see how backward selection works in machine learning. And I generated poly() function to write polynomial function and then wanted to choose the suitable polynomial using Cp, BIC, adjusted R^2.
The code is:
###Generating dataset
set.seed(1)
X = rnorm(100)
eps = rnorm(100)
beta0 = 3
beta1 = 2
beta2 = -3
beta3 = 0.3
Y = beta0 + beta1 * X + beta2 * X^2 + beta3 * X^3 + eps
library(leaps)
data.full = data.frame(y = Y, x = X)
mod.full = regsubsets(y ~ poly(x, 10, raw = T), data = data.full, nvmax = 10)
mod.summary = summary(mod.full)
### Find the model size for best cp, BIC and adjr2
which.min(mod.summary$cp)
For cp, BIC and adjusted R^2 I get model with polynomial 3 as it should be
However, now I want to simulate 100 datasets and see in how many datasets do I get the right model. I simulated 100 datasets but now I am not getting polynomial 3 for each of the measures. And I don't quite understand what I'm doing wrong. My code for simulation is:
###Generating 100 datasets
data <- replicate(100, rnorm(n=100))
epsilon <- replicate(100,rnorm(n=100))
###Formula (same as before)
Y = beta0 + beta1 * data + beta2 * data^2 + beta3 * data^3 + epsilon
data.full = data.frame(y = Y, x = data)
###Using polynomial terms
mod.bwd = regsubsets(data.full$y.1 ~ poly(data.full$x.1, 10, raw = T), data = data.full, nvmax = 10,
method = "backward")
bwd.summary = summary(mod.bwd)
which.min(bwd.summary$cp)
which.min(bwd.summary$bic)
which.max(bwd.summary$adjr2)
For a given subset cp, Bic, adjr2 are giving me different results. For example, using y.1 and x.1 (first dataset in simulation) gives following results:
which.min(bwd.summary$cp): 7
which.min(bwd.summary$bic): 4
which.max(bwd.summary$adjr2): 9
Can someone help me what I'm doing wrong in simulating these 100 datasets.
If I've read your code correctly you run the model on the same simulated dataset 100 times instead of all 100 simulated datasets, this should do the trick:
set.seed(42)
###Generating 100 datasets
data <- replicate(100, rnorm(n=100))
epsilon <- replicate(100,rnorm(n=100))
###Formula (same as before)
Y = beta0 + beta1 * data + beta2 * data^2 + beta3 * data^3 + epsilon
data.full = data.frame(y = Y, x = data)
res <- lapply(1:100, function(i){
###Using polynomial terms
mod.bwd = regsubsets(data.full[[i]] ~ poly(data.full[[100+i]], 10, raw = T), data = data.full, nvmax = 10,
method = "backward")
bwd.summary = summary(mod.bwd)
c(
which.min(bwd.summary$cp),
which.min(bwd.summary$bic),
which.max(bwd.summary$adjr2)
)
})
res <- do.call(rbind, res)
With this rng-seed this gives some lines where all cirteria select the correct model.

How to define function arguments based on data.frame columns (R)?

I have a script that runs maximum likelihood estimation for a linear model. The model has several variables and I need to vary them occasionally, maybe add or drop some. The usual way to define the likelihood function is like this:
LL <- function(beta0, beta1, beta2, mu, sigma){
R = y - beta0*X$x0 + beta1*X$x1 + beta2*X$x2
R = dnorm(R, mu, sigma, log = T)
-sum(R)
}
I have dependent variable in vector y and covariates in data.frame X:
X <- data.frame(x0 = 1, x1 = runif(100), x2 = runif(100)*2)
y <- X$x0 + X$x1 + X$x2 + rnorm(100)
Now the amount of variables is subject to change by application and I need to reformulate the function so that it will take as many covariates as there are columns in the data.frame X. I was already able to reformulate this to a more general form:
cols <- 0:(ncol(X)-1)
betas <- paste0("beta", cols)
eqR <- paste0("y - ", paste0(betas, "*X$x", cols, collapse = " - "))
LL <- function(beta0, beta1, beta2, mu, sigma){
R = as.formula(eqR)
R = dnorm(R, mu, sigma, log = T)
-sum(R)
}
I'm still struggling to find a way to dynamically define the function so that it would take the same number of beta arguments as there are columns in the covariate matrix. Ellipsis is perhaps useful here? I also tried with do.call:
LL <- function(betas, mu, sigma){
R <- do.call(dnorm(as.formula(eqR), mu, sigma, log = T), betas)
-sum(R)
}
That doesn't work when you fit the model, which has another stumbling block in the list of initial values:
require(stats4)
fit <- mle(LL, start = list(beta0 = 0, beta1 = 0, beta2 = 0, mu = 0, sigma = 1))
Any ideas for this?
EDIT:
I made some advance with bbmle package:
require(bbmle)
dfModel <- cbind(y, X)
cols <- 0:(ncol(X)-1)
betas <-paste0("beta",cols)
betaList <- as.list(rep(0), length(betas)))
names(betaList) <- betas
initList <- c(betaList, mu = 0, sigma = 1)
fitML <- mle2(mu ~ dnorm(mean = y - beta0*x0 - beta1*x1 - beta2*x2, sd = sigma),
start = initList,
data = dfModel)
The above example works. But when I try to define the function beforehand with as.formula, I can't get it working. So the following does not work.
eqR <- paste0("y - ", paste0(betas, "*x", cols, collapse = " - "))
fitML <- mle2(mu ~ dnorm(mean = as.formula(eqR), sd = sigma),
start = initList,
data = dfModel)
The error message is:
Error in eval(expr, envir, enclos) : object 'beta0' not found
I suspect that this might have something to do with scoping - conflict between dnorm and as.formula? I just can't find workaround for that.
Try this:
betas = c(0,0,0)
X <- data.frame(x0 = 1, x1 = runif(100), x2 = runif(100)*2)
y <- apply(X,1,sum) + rnorm(100)
where betas is (b0, b1, b2, ...etc) and its length must be equal to the number of columns of X.
Since X could have a different number of columns y should be defined as above.
Your LL function should change to:
LL <- function(betas, mu, sigma){
R = y - as.matrix(X) %*% as.matrix(betas)
R = dnorm(R, mu, sigma, log = T)
-sum(R)
}
where %*% is the matrix product. This is the same as doing b[1]*X[,1] + b[2]*X[,2] + b[3]*X[,3] + ... + b[n]*X[,n]
With these changes, you could have data frame X with any number of columns, betas an array of the same length as columns of X.
I hope I understood what you needed.

fitting function for a given data set

I'm trying to fitting the following function y(x)=a*( 1 + (x^2)/(b^2) )^t to a particular set of data , where, a, b and t are constants that want to determine by fitting.
I try the following, for example
len <- 24
x = runif(len)
y = x^3 + runif(len, min = -0.1, max = 0.1)
plot(x, y)
s <- seq(from = 0, to = 1, length = 50)
lines(s, s^3, lty = 2)
df <- data.frame(x, y)
m <- nls(y~a*( 1 + (x^2)/(b^2) )^t, data = df, start = list(a=1,t=0, b=1), trace = T)
> Error in nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
Can someone help me to set this function to these points, even if the fitting becomes bad, the important is to get fit this function, ie that she run on the data
thanks everyone
Because your data are changing randomly, for some situations the value of a is close to zero and your function becomes zero. The curve fit procedure fails at that point. Randomizing the start parameters might work for some situations.
A slightly more stable output can be computed using the LM algorithm:
require("minpack.lm")
LMCurveFit <- function(df) {
# The function to be fit
FitFunction <- function(params, x) {
with (
as.list(params), {
a*(1 + x^2/b^2)^t
}
)
}
# Residual
Residual <- function(params, x, y) {
FitFunction(params, x) - y
}
# Sum of squares of residuals
ssqfun <- function(params, x, y) {
sum(Residual(params, x, y)^2)
}
# Normalize the data
x_max = max(x)
y_max = max(y)
df$x = df$x/x_max
df$y = df$y/y_max
# Define start parameters
a_start = 0.1
b_start = 1.0
t_start = 1.0
param_start = c(a = a_start,
b = b_start,
t = t_start)
# Do LM fit
nls.out <- nls.lm(par = param_start,
fn = Residual,
control = nls.lm.control(nprint=0,
ftol=.Machine$double.eps,
ptol=.Machine$double.eps,
maxfev=10000, maxiter=1024),
x = df$x,
y = df$y)
# Revert scaling
nls.out$par[1] = nls.out$par[1]*y_max
nls.out$par[2] = nls.out$par[2]*x_max
# Get the parameters
params_fit = coef(nls.out)
print(params_fit)
# Compute predicted values
predicted = FitFunction(as.list(params_fit), df$x*x_max)
}
# LM fit
pred_y = LMCurveFit(df)
lines(x, pred_y)

Resources