Data is too long Error in R FlexmixNL package - r

I tried to search this online, but couldn't exactly figure out what my issue was. Here is my code:
n = 10000
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y1 <- 10*sin(x1/10) + 10 + rnorm(n, sd = 1)
y2 <- x2 * cos(x2) - 2 * rnorm(n, sd = 2)
x <- c(x1, x2)
y <- c(x1, x2)
start1 = list(a = 10, b = 5)
start2 = list(a = 30, b = 5)
library(flexmix)
library(flexmixNL)
modelNL <- flexmix(y~x, k =2,
model = FLXMRnlm(formula = y ~ a*x/(b+x),
family = "gaussian",
start = list(start1, start2)))
plot(x, y, col = clusters(modelNL))
and before the plot, it gives me this error:
Error in matrix(1, nrow = sum(groups$groupfirst)) : data is too long
I checked google for similar errors, but I don't quite understand what is wrong with my own code that results in this error.
As you can already tell, I am very new to R, so please explain it in the most layman terms possible. Thank you in advance.

Ironically (in the context of an error message saying data is "too long") I think the proximate cause of that error is no data argument. If you give it the data in the form of a dataframe, you still get an error but its not the same one as you are experiencing. When you plot the data, you get a rather bizarre set of values at least from a statistical distribution standpoint and it's not clear why you are trying to model this with this formula. Nonetheless, with those starting values and a dataframe argument to data, one sees results.
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x/(b+x),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
6664 13336
convergence after 20 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x/(b + x), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.436 6664 20000 0.333
Comp.2 0.564 13336 16306 0.818
'log Lik.' -91417.03 (df=7)
AIC: 182848.1 BIC: 182903.4
Most R regression functions first check for the matchng names in formulae within the data= argument. Apparently this function fails when it needs to go out to the global environment to match formula tokens.
I tried a formula suggested by the plot of the data and get convergent results:
> modelNL <- flexmix(y~x, k =2, data=data.frame(x=x,y=y),
+ model = FLXMRnlm(formula = y ~ a*x*cos(x+b),
+ family = "gaussian",
+ start = list(start1, start2)))
> modelNL
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
Cluster sizes:
1 2
9395 10605
convergence after 17 iterations
> summary(modelNL)
Call:
flexmix(formula = y ~ x, data = data.frame(x = x, y = y), k = 2, model = FLXMRnlm(formula = y ~
a * x * cos(x + b), family = "gaussian", start = list(start1, start2)))
prior size post>0 ratio
Comp.1 0.521 9395 18009 0.522
Comp.2 0.479 10605 13378 0.793
'log Lik.' -78659.85 (df=7)
AIC: 157333.7 BIC: 157389
The reduction in AIC seems huge compare to the first formula.

Related

How to perform a nonlinear regression of a complex function that has a summation using R?

I have the following function:
Of this function, the parameter R is a constant with a value of 22.5. I want to estimate parameters A and B using nonlinear regression (nls() function). I made a few attempts, but all were unsuccessful. I'm not very familiar with this type of operations in R, so I would like your help.
Additionally, if possible, I would also like to plot this function using ggplot2.
# Initial data
x <- c(0, 60, 90, 120, 180, 240)
y <- c(0, 0.967676, 1.290101, 1.327099, 1.272404, 1.354246)
R <- 22.5
df <- data.frame(x, y)
f <- function(x) (1/(n^2))*exp((-B*(n^2)*(pi^2)*x)/(R^2))
# First try
nls(formula = y ~ A*(1-(6/(pi^2))*sum(f, seq(1, Inf, 1))),
data = df,
start = list(A = 1,
B = 0.7))
Error in seq.default(1, Inf, 1) : 'to' must be a finite number
# Second try
nls(formula = y ~ A*(1-(6/(pi^2))*integrate(f, 1, Inf)),
data = df,
start = list(A = 1,
B = 0.7))
Error in f(x, ...) : object 'n' not found
You can use a finite sum approximation. Using 25 terms:
f <- function(x, B, n = 1:25) sum((1/(n^2))*exp((-B*(n^2)*(pi^2)*x)/(R^2)))
fm <- nls(formula = y ~ cbind(A = (1-(6/pi^2))* Vectorize(f)(x, B)),
data = df,
start = list(B = 0.7),
alg = "plinear")
fm
giving:
Nonlinear regression model
model: y ~ cbind(A = (1 - (6/pi^2)) * Vectorize(f)(x, B))
data: df
B .lin.A
-0.00169 1.39214
residual sum-of-squares: 1.054
Number of iterations to convergence: 12
Achieved convergence tolerance: 9.314e-06
The model does not seem to fit the data very well (solid line in graph below); however, a logistic model seems to work well (dashed line).
fm2 <- nls(y ~ SSlogis(x, Asym, xmid, scal), df)
plot(y ~ x, df)
lines(fitted(fm) ~ x, df)
lines(fitted(fm2) ~ x, df, lty = 2)
legend("bottomright", c("fm", "fm2"), lty = 1:2)

Plotting with GLMMadaptive for zero-inflated, semi-continuous data?

I'm trying to utilize the effectPlotData as described here: https://cran.r-project.org/web/packages/GLMMadaptive/vignettes/Methods_MixMod.html
But, I'm trying to apply it to a model (two-part mixed model for zero-inflated semi-continuous data) that includes random/fixed effects for both a linear and logistic portion (hurdle lognormal). I get the following error:
'Error in Qs[1, ] : incorrect number of dimensions'
Which, I think is from having more than one set of random/fixed effect outcomes, but if anyone else has come across this error or can advise, it would be appreciated! I've tried changing the terms in the new data frame and tried a couple of different options with length.out (attempted this as number of subjects and then number of total observations across all subjects), but get the same error each time.
Code below, specifies the model into m and new data frame into nDF:
m = mixed_model(Y~X, random = ~1|Subject,
data = data_combined_temp_Fix_Num3,
family = hurdle.lognormal,
n_phis = 1, zi_fixed = ~X , zi_random = ~1|Subject,
na.action = na.exclude)
nDF <- with(data_combined_temp_Fix_Num3,
expand.grid(X = seq(min(X), max(X), length.out = 908),
Y = levels(Y)))
effectPlotData(m, nDF)
It seems to work for with the following example:
library("GLMMadaptive")
set.seed(1234)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time
# we constuct a data frame with the design:
# everyone has a baseline measurment, and then measurements at random follow-up times
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
# design matrices for the fixed and random effects non-zero part
X <- model.matrix(~ sex * time, data = DF)
Z <- model.matrix(~ time, data = DF)
# design matrices for the fixed and random effects zero part
X_zi <- model.matrix(~ sex, data = DF)
Z_zi <- model.matrix(~ 1, data = DF)
betas <- c(-2.13, -0.25, 0.24, -0.05) # fixed effects coefficients non-zero part
sigma <- 0.5 # standard deviation error terms non-zero part
gammas <- c(-1.5, 0.5) # fixed effects coefficients zero part
D11 <- 0.5 # variance of random intercepts non-zero part
D22 <- 0.1 # variance of random slopes non-zero part
D33 <- 0.4 # variance of random intercepts zero part
# we simulate random effects
b <- cbind(rnorm(n, sd = sqrt(D11)), rnorm(n, sd = sqrt(D22)), rnorm(n, sd = sqrt(D33)))
# linear predictor non-zero part
eta_y <- as.vector(X %*% betas + rowSums(Z * b[DF$id, 1:2, drop = FALSE]))
# linear predictor zero part
eta_zi <- as.vector(X_zi %*% gammas + rowSums(Z_zi * b[DF$id, 3, drop = FALSE]))
# we simulate log-normal longitudinal data
DF$y <- exp(rnorm(n * K, mean = eta_y, sd = sigma))
# we set the zeros from the logistic regression
DF$y[as.logical(rbinom(n * K, size = 1, prob = plogis(eta_zi)))] <- 0
###############################################################################
km1 <- mixed_model(y ~ sex * time, random = ~ 1 | id, data = DF,
family = hurdle.lognormal(),
zi_fixed = ~ sex)
km1
nDF <- with(DF, expand.grid(time = seq(min(time), max(time), length.out = 15),
sex = levels(sex)))
plot_data <- effectPlotData(km1, nDF)
library("lattice")
xyplot(pred + low + upp ~ time | sex, data = plot_data,
type = "l", lty = c(1, 2, 2), col = c(2, 1, 1), lwd = 2,
xlab = "Follow-up time", ylab = "")
local({
km1$Funs$mu_fun <- function (eta) {
pmax(exp(eta + 0.5 * exp(2 * km1$phis)), .Machine$double.eps)
}
km1$family$linkfun <- function (mu) log(mu)
plot_data <- effectPlotData(km1, nDF)
xyplot(exp(pred) + exp(low) + exp(upp) ~ time | sex, data = plot_data,
type = "l", lty = c(1, 2, 2), col = c(2, 1, 1), lwd = 2,
xlab = "Follow-up time", ylab = "")
})
In case someone comes across the same error, I was filtering data from my data frame within the model -- which caused the dimensions of the model and the variable from the data frame to not match. I applied the same filtering to the new data frame (I've also moved forward with a completely new data frame that only includes trials that are actually used by the model so that no filtering has to be used at any step).
m = mixed_model(Y~X, random = ~1|Subject,
data = data_combined_temp_Fix_Num3[data_combined_temp_Fix_Num3$Z>=4 &
data_combined_temp_Fix_Num3$ZZ>= 4,],
family = hurdle.lognormal,
n_phis = 1, zi_fixed = ~X , zi_random = ~1|Subject,
na.action = na.exclude)`
nDF <- with(data_combined_temp_Fix_Num3,
expand.grid(X = seq(min(X[data_combined_temp_Fix_Num3$Z>= 4 &
data_combined_temp_Fix_Num3$ZZ>= 4])),
max(X[data_combined_temp_Fix_Num3$Z>= 4 &
data_combined_temp_Fix_Num3$ZZ>= 4])), length.out = 908),
Y = levels(Y)))`
effectPlotData(m, nDF)

Using effects package inside a function

I have a large frame with lots of variables which I'm going to analyze in the same way. Specifically, I want to plot effect confidence intervals in mixed effect model. I want to write function which make a custom plot for one dependent variable. Direct application of effect() function goes well. But the same code inside function cause error.
I tried two variants of function. Both cause errors.
Here is my reproducible example:
library(nlme)
library(effects)
df <- data.frame(y = rnorm(90), x = gl(3, 30), b = factor(rep(1:30, 3)))
fit <- lme(fixed = y ~ x, random = ~ 1 | b, data = df, method = "REML")
ef <- effect("x", fit)
bp <- barplot(as.vector(ef$fit), col = c("tomato", "skyblue", "limegreen"),
ylim = c(min(ef$lower), max(ef$upper) + (max(ef$upper) - min(ef$lower)) * 0.2 ))
arrows(x0 = bp, y0 = ef$lower, y1 = ef$upper, code = 3, angle = 90)
test1 <- function(y, x, b)
{
fit <- lme(fixed = y ~ x, random = ~ 1 | b, method = "REML")
ef <- effect("x", fit)
bp <- barplot(as.vector(ef$fit), col = c("tomato", "skyblue", "limegreen"),
ylim = c(min(ef$lower), max(ef$upper) + (max(ef$upper) - min(ef$lower)) * 0.2 ))
arrows(x0 = bp, y0 = ef$lower, y1 = ef$upper, code = 3, angle = 90)
}
test1(df$y, df$x, df$b)
# Error in eval(predvars, data, env) : object 'y' not found
test2 <- function(y, x, b)
{
frame <- data.frame(y, x, b)
fit <- lme(fixed = y ~ x, random = ~ 1 | b, frame, method = "REML")
ef <- effect("x", fit)
bp <- barplot(as.vector(ef$fit), col = c("tomato", "skyblue", "limegreen"),
ylim = c(min(ef$lower), max(ef$upper) + (max(ef$upper) - min(ef$lower)) * 0.2 ))
arrows(x0 = bp, y0 = ef$lower, y1 = ef$upper, code = 3, angle = 90)
}
test2(df$y, df$x, df$b)
# Error in as.data.frame.default(data, optional = TRUE) :
# cannot coerce class ‘"function"’ to a data.frame
Simpler:
function(df) {
fit <- lme(fixed = y ~ x, random = ~ 1 | b, data = df, method = "REML")
ef <- effect("x", fit)
bp <- barplot(as.vector(ef$fit), col = c("tomato", "skyblue", "limegreen"),
ylim = c(min(ef$lower),
max(ef$upper) + (max(ef$upper) - min(ef$lower)) * 0.2 ))
arrows(x0 = bp, y0 = ef$lower, y1 = ef$upper, code = 3, angle = 90)
}
You need to pass data to lme, the formula doesn't actually pass any data.
That said, your test2 should work. I can replicate your error but it is really very weird. Somehow the code works in the global env but not in the closure. Very surprising.

How to fit exponential regression in r?(a.k.a changing power of base)

I am making exponential regressions in r.
Actually I want to compare y = exp^(ax+b) with y = 5^(ax+b).
# data
set.seed(1)
y <- c(3.5, 2.9, 2.97,4.58,6.18,7.11,9.50,9.81,10.17,10.53,
12.33,14.14,18, 22, 25, 39, 40, 55, 69, 72) + rnorm(20, 10, 1)
x <- 1:length(y)
df = data.frame(x = x, y = y)
predata = data.frame(x = 1:20)
# plot
plot(df, ylim = c(0,100), xlim = c(0,40))
# simple linear regression
fit_sr = lm(y~x, data = df)
pre_sr = predict(fit_sr, newdata = predata,
interval ='confidence',
level = 0.90)
lines(pre_sr[,1], col = "red")
# exponential regression 1
fit_er1 = lm(log(y, base = exp(1))~x, data = df)
pre_er1 = predict(fit_er1, newdata = predata,
interval ='confidence',
level = 0.90)
pre_er1 = exp(1)^pre_er1 # correctness
lines(pre_er1[,1], col = "dark green")
# exponential regression 2
fit_er2 = lm(log(y, base = 5) ~ x, data = df)
pre_er2 = predict(fit_er2, newdata = predata,
interval ='confidence',
level = 0.90)
pre_er2 = 5^pre_er2 # correctness
lines(pre_er2[,1], col = "blue")
I expect something like this(plot1), but exponential regression 1 and 2 are totally the same(plot2).
plot1
plot2
The two regression should be different because of the Y value is different.
Also, I am looking for how to make y = exp(ax+b) + c fitting in R.
Your code is correct, your theory is where the problem is. The models should be the same.
Easiest way is to think on the log scale, as you've done in your code. Starting with y = exp(ax + b) we can get to log(y) = ax + b, so a linear model with log(y) as the response. With y = 5^(cx + d), we can get log(y) = (cx + d) * log(5) = (c*log(5)) * x + (d*log(5)), also a linear model with log(y) as the response. Yhe model fit/predictions will not be any different with a different base, you can transform the base e coefs to base 5 coefs by multiplying them by log(5). a = c*log(5) and b = d*log(5).
It's a bit like wanting to compare the linear models y = ax + b where x is measured in meters vs y = ax + b where x is measured in centimeters. The coefficients will change to accommodate the scale, but the fit isn't any different.
The first part is already answered by #gregor, the second part "...I am looking for how to make y = exp(ax+b) + c fitting in R" can be done with nls:
fit_er3 <- nls(y ~ exp(a*x+b) + c, data = df, start=list(a=1,b=0,c=0))

Fitting a curve to specific data

I have the following data in my thesis:
28 45
91 14
102 11
393 5
4492 1.77
I need to fit a curve into this. If I plot it, then this is what I get.
I think some kind of exponential curve should fit this data. I am using GNUplot. Can someone tell me what kind of curve will fit this and what initial parameters I can use?
Just in case R is an option, here's a sketch of two methods you might use.
First method: evaluate the goodness of fit of a set of candidate models
This is probably the best way as it takes advantage of what you might already know or expect about the relationship between the variables.
# read in the data
dat <- read.table(text= "x y
28 45
91 14
102 11
393 5
4492 1.77", header = TRUE)
# quick visual inspection
plot(dat); lines(dat)
# a smattering of possible models... just made up on the spot
# with more effort some better candidates should be added
# a smattering of possible models...
models <- list(lm(y ~ x, data = dat),
lm(y ~ I(1 / x), data = dat),
lm(y ~ log(x), data = dat),
nls(y ~ I(1 / x * a) + b * x, data = dat, start = list(a = 1, b = 1)),
nls(y ~ (a + b * log(x)), data = dat, start = setNames(coef(lm(y ~ log(x), data = dat)), c("a", "b"))),
nls(y ~ I(exp(1) ^ (a + b * x)), data = dat, start = list(a = 0,b = 0)),
nls(y ~ I(1 / x * a) + b, data = dat, start = list(a = 1,b = 1))
)
# have a quick look at the visual fit of these models
library(ggplot2)
ggplot(dat, aes(x, y)) + geom_point(size = 5) +
stat_smooth(method = lm, formula = as.formula(models[[1]]), size = 1, se = FALSE, color = "black") +
stat_smooth(method = lm, formula = as.formula(models[[2]]), size = 1, se = FALSE, color = "blue") +
stat_smooth(method = lm, formula = as.formula(models[[3]]), size = 1, se = FALSE, color = "yellow") +
stat_smooth(method = nls, formula = as.formula(models[[4]]), data = dat, method.args = list(start = list(a = 0,b = 0)), size = 1, se = FALSE, color = "red", linetype = 2) +
stat_smooth(method = nls, formula = as.formula(models[[5]]), data = dat, method.args = list(start = setNames(coef(lm(y ~ log(x), data = dat)), c("a", "b"))), size = 1, se = FALSE, color = "green", linetype = 2) +
stat_smooth(method = nls, formula = as.formula(models[[6]]), data = dat, method.args = list(start = list(a = 0,b = 0)), size = 1, se = FALSE, color = "violet") +
stat_smooth(method = nls, formula = as.formula(models[[7]]), data = dat, method.args = list(start = list(a = 0,b = 0)), size = 1, se = FALSE, color = "orange", linetype = 2)
The orange curve looks pretty good. Let's see how it ranks when we measure the relative goodness of fit of these models are...
# calculate the AIC and AICc (for small samples) for each
# model to see which one is best, ie has the lowest AIC
library(AICcmodavg); library(plyr); library(stringr)
ldply(models, function(mod){ data.frame(AICc = AICc(mod), AIC = AIC(mod), model = deparse(formula(mod))) })
AICc AIC model
1 70.23024 46.23024 y ~ x
2 44.37075 20.37075 y ~ I(1/x)
3 67.00075 43.00075 y ~ log(x)
4 43.82083 19.82083 y ~ I(1/x * a) + b * x
5 67.00075 43.00075 y ~ (a + b * log(x))
6 52.75748 28.75748 y ~ I(exp(1)^(a + b * x))
7 44.37075 20.37075 y ~ I(1/x * a) + b
# y ~ I(1/x * a) + b * x is the best model of those tried here for this curve
# it fits nicely on the plot and has the best goodness of fit statistic
# no doubt with a better understanding of nls and the data a better fitting
# function could be found. Perhaps the optimisation method here might be
# useful also: http://stats.stackexchange.com/a/21098/7744
Second method: use genetic programming to search a vast amount of models
This seems to be a kind of wild shot in the dark approach to curve-fitting. You don't have to specify much at the start, though perhaps I'm doing it wrong...
# symbolic regression using Genetic Programming
# http://rsymbolic.org/projects/rgp/wiki/Symbolic_Regression
library(rgp)
# this will probably take some time and throw
# a lot of warnings...
result1 <- symbolicRegression(y ~ x,
data=dat, functionSet=mathFunctionSet,
stopCondition=makeStepsStopCondition(2000))
# inspect results, they'll be different every time...
(symbreg <- result1$population[[which.min(sapply(result1$population, result1$fitnessFunction))]])
function (x)
tan((x - x + tan(x)) * x)
# quite bizarre...
# inspect visual fit
ggplot() + geom_point(data=dat, aes(x,y), size = 3) +
geom_line(data=data.frame(symbx=dat$x, symby=sapply(dat$x, symbreg)), aes(symbx, symby), colour = "red")
Actually a very poor visual fit. Perhaps there's a bit more effort required to get quality results from genetic programming...
Credits: Curve fitting answer 1, curve fitting answer 2 by G. Grothendieck.
Do you know some analytical function that the data should adhere to? If so, it could help you choose the form of the function, to fit to the data.
Otherwise, since the data looks like exponential decay, try something like this in gnuplot, where a function with two free parameters is fitted to the data:
f(x) = exp(-x*c)*b
fit f(x) "data.dat" u 1:2 via b,c
plot "data.dat" w p, f(x)
Gnuplot will vary parameters named after the 'via' clause for the best fit. Statistics are printed to stdout, as well as a file called 'fit.log' in the current working directory.
The c variable will determine the curvature (decay), while the b variable will scale all values linearly to get the correct magnitude of the data.
For more info, see the Curve fit section in the Gnuplot documentation.

Resources