Fitting Exponential Distribution to Task Duration Counts - r

In my dataset, I have ants that switch between one state (in this case a resting state) and all other states over a period of time. I am attempting to fit an exponential distribution to the number of times an ant spends in a resting state for some duration of time (for instance, the ant may rest for 5 seconds 10 times, or it could rest for 6 seconds 5 times, etc.). While subjectively this distribution of durations seems to be exponential, I can't fit a single parameter exponential distribution (where the one parameter is rate) to the data. Is this possible to do with my dataset, or do I need to use a two parameter exponential distribution?
I am attempting to fit the data to the following equation (where lambda is rate):
lambda * exp(-lambda * x).
This, however, doesn't seem to be mathematically possible to fit to either the counts of my data or the probability density of my data. In R I attempt to fit the data with the following code:
fit = nls(newdata$x.counts ~ (b*exp(b*newdata$x.mids)), start =
list(x.counts = 1, x.mids = 1, b = 1))
When I do this, though, I get the following message:
Error in parse(text= x, keep.source = FALSE):
<text>:2:0: unexpected end of input
1: ~
^
I believe I am getting this because its mathematically impossible to fit this particular equation to my data. Am I correct in this, or is there a way to transform the data or alter the equation so I can make it fit? I can also make it fit with the equation lambda * exp(mu * x) where mu is another free parameter, but my goal is to make this equation as simple as possible, so I would prefer to use the one parameter version.
Here is the data, as I can't seem to find a way to attach it as a csv:
https://docs.google.com/spreadsheets/d/1euqdgHfHoDmQKXHrtOLcn5x5o81zY1sr9Kq6NCbisYE/edit?usp=sharing

First, you have a typo in your formula, you forgot the - sign in
(b*exp(b*newdata$x.mids))
But this is not what is throwing the error. The start parameter should be a list that initializes only the parameter value, not x.counts nor x.mids.
So the correct version would be:
fit = nls(newdata$x.counts ~ b*exp(-b*newdata$x.mids), start = list(b = 1))

Related

Would nonidentifiability create an inconsistent response from optim in R?

My objective is to use a kinetic model to describe reaction data. The application is for a fuel and the model is widely accepted as one of the more accurate ones given the setup of my problem. I may have a nonidentifiability issue, but it bothers me that the response from optim is given such an inconsistent response.
Take the two graphs, , I have picked that point based on its low squared error. The second is what optim selected (I don't have enough rep for picture 2, I will try to post comment, hint hint, its not close to lining up). When I ran the numbers that optim gave me it did not match the expected response. I wish I could paste the exact values, but the optimization takes more than two hours each run so I have been tuning it as much as I can with the time I can get. I can say that R is settling on the boundaries. The bounds are set to physical limits at room temperature one can obtain from the pure compound (i.e. the molarity at room temperature). I can be flexible, but not too much as the point of the project was to limit the model parameters to observed physical parameters.
This is all to prep it for an MCMC to add Bayesian elements to this. If my first guess is junk so is the whole project.
To sum it up I would like to know why the errors are inconsistent and if it is coming from nonidentifiability if improving the initial guess can fix that or if I need to remove a variable.
Code for reference.
Objective function
init = function(param){
#Solve for displacement of triglycerides
T.mcmc2 = T.hat.isf
T.mcmc2 = T.mcmc2 - min(T.mcmc2)
A.mcmc2 = T.mcmc2
A.mcmc2[1] <- (6*1.02*.200)/.250
B.mcmc2 = T.mcmc2
primes <- Bprime(x.fine1, param, T.mcmc2, A.mcmc2, B.mcmc2)
B.mcmc2 <- as.numeric(unlist(primes[1]))
A.mcmc2 <- as.numeric(unlist(primes[2]))
res = t(B.obs-B.mcmc2[x.points])%*%(B.obs-B.mcmc2[x.points])
#print(res)
return (res)
}
Optimization with parameters
l = c(1e-8,1e-8, 1e-8, 1e-8)
u = c(2,1.2,24,24)
th0=c(.052, 0.19, .50, 8)
op = optim(th0[1:3], init, method="L-BFGS-B", lower=l, upper = u)
Once run, this message often occurs "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Fixing a coefficient on variable in MNL [duplicate]

This question already has an answer here:
Set one or more of coefficients to a specific integer
(1 answer)
Closed 6 years ago.
In R, how can I set weights for particular variables and not observations in lm() function?
Context is as follows. I'm trying to build personal ranking system for particular products, say, for phones. I can build linear model based on price as dependent variable and other features such as screen size, memory, OS and so on as independent variables. I can then use it to predict phone real cost (as opposed to declared price), thus finding best price/goodness coefficient. This is what I have already done.
Now I want to "highlight" some features that are important for me only. For example, I may need a phone with large memory, thus I want to give it higher weight so that linear model is optimized for memory variable.
lm() function in R has weights parameter, but these are weights for observations and not variables (correct me if this is wrong). I also tried to play around with formula, but got only interpreter errors. Is there a way to incorporate weights for variables in lm()?
Of course, lm() function is not the only option. If you know how to do it with other similar solutions (e.g. glm()), this is pretty fine too.
UPD. After few comments I understood that the way I was thinking about the problem is wrong. Linear model, obtained by call to lm(), gives optimal coefficients for training examples, and there's no way (and no need) to change weights of variables, sorry for confusion I made. What I'm actually looking for is the way to change coefficients in existing linear model to manually make some parameters more important than others. Continuing previous example, let's say we've got following formula for price:
price = 300 + 30 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
This formula describes best possible linear model for dependence between price and phone parameters. However, now I want to manually change number 30 in front of memory variable to, say, 60, so it becomes:
price = 300 + 60 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
Of course, this formula doesn't reflect optimal relationship between price and phone parameters any more. Also dependent variable doesn't show actual price, just some value of goodness, taking into account that memory is twice more important for me than for average person (based on coefficients from first formula). But this value of goodness (or, more precisely, value of fraction goodness/price) is just what I need - having this I can find best (in my opinion) phone with best price.
Hope all of this makes sense. Now I have one (probably very simple) question. How can I manually set coefficients in existing linear model, obtained with lm()? That is, I'm looking for something like:
coef(model)[2] <- 60
This code doesn't work of course, but you should get the idea. Note: it is obviously possible to just double values in memory column in data frame, but I'm looking for more elegant solution, affecting model, not data.
The following code is a bit complicated because lm() minimizes residual sum of squares and with a fixed, non optimal coefficient it is no longed minimal, so that would be against what lm() is trying to do and the only way is to fix all the rest coefficients too.
To do that, we have to know coefficients of the unrestricted model first. All the adjustments have to be done by changing formula of your model, e.g. we have
price ~ memory + screen_size, and of course there is a hidden intercept. Now neither changing the data directly nor using I(c*memory) is good idea. I(c*memory) is like temporary change of data too, but to change only one coefficient by transforming the variables would be much more difficult.
So first we change price ~ memory + screen_size to price ~ offset(c1*memory) + offset(c2*screen_size). But we haven't modified the intercept, which now would try to minimize residual sum of squares and possibly become different than in original model. The final step is to remove the intercept and to add a new, fake variable, i.e. which has the same number of observations as other variables:
price ~ offset(c1*memory) + offset(c2*screen_size) + rep(c0, length(memory)) - 1
# Function to fix coefficients
setCoeffs <- function(frml, weights, len){
el <- paste0("offset(", weights[-1], "*",
unlist(strsplit(as.character(frml)[-(1:2)], " +\\+ +")), ")")
el <- c(paste0("offset(rep(", weights[1], ",", len, "))"), el)
as.formula(paste(as.character(frml)[2], "~",
paste(el, collapse = " + "), " + -1"))
}
# Example data
df <- data.frame(x1 = rnorm(10), x2 = rnorm(10, sd = 5),
y = rnorm(10, mean = 3, sd = 10))
# Writing formula explicitly
frml <- y ~ x1 + x2
# Basic model
mod <- lm(frml, data = df)
# Prime coefficients and any modifications. Note that "weights" contains
# intercept value too
weights <- mod$coef
# Setting coefficient of x1. All the rest remain the same
weights[2] <- 3
# Final model
mod2 <- update(mod, setCoeffs(frml, weights, nrow(df)))
# It is fine that mod2 returns "No coefficients"
Also, probably you are going to use mod2 only for forecasting (actually I don't know where else it could be used now) so that could be made in a simpler way, without setCoeffs:
# Data for forecasting with e.g. price unknown
df2 <- data.frame(x1 = rpois(10, 10), x2 = rpois(5, 5), y = NA)
mat <- model.matrix(frml, model.frame(frml, df2, na.action = NULL))
# Forecasts
rowSums(t(t(mat) * weights))
It looks like you are doing optimization, not model fitting (though there can be optimization within model fitting). You probably want something like the optim function or look into linear or quadratic programming (linprog and quadprog packages).
If you insist on using modeling tools like lm then use the offset argument in the formula to specify your own multiplyer rather than computing one.

Automatically find the scaling factor of the x-axis using LsqFit (or other method)?

I have the following data: a vector B and a vector R. The vector B is the "independent" variable. For this pair, I have two data sets: One is an experimental measurement of Bex, Rex and the other is a simulation produced by me Bsim, Rsim. The simulation does not have any "scale" for the x-axis (the B vector). Therefore when I am trying to fit my curve to the experiment, I have to find out a scaling parameter B0 "by eye", and with this number B0 I multiply the entire Bsim vector and simply plot(Bsim, Rsim, Bex, Rex).
I wanted to use the package LsqFit to make the procedure automatic and more accurate. However I am having trouble in understanding how I could use it to find the scaling on the independent variable.
My first thought was to just "invert" the roles of B and R. However, there are two issues that I think make matters worse: 1) the R curve/data is not monotonous, 2) the experimental data are much more "dense" (they have more data-points: my simulation has 120 points in total, the experiments have some thousands).
Below I give an example if what I am trying to accomplish (of course, the answer need not use LsqFit). I also attach two figures that demonstrate everything very clearly.
#= stuff happened before this point =#
Bsim, Rsim = load(simulation)
Bex, Rex = load(experiment)
#this is what I want to do:
some_model(x, p) = ???
fit = curve_fit(some_model, Bex, Rex, [3.5])
B0 = fit.param[1]
#this is what I currently do by trail and error:
B0 = 3.85 #this is what I currently do by trial and error
plot(B0*Bsim, Rsim, Bex, Rex)
P.S.: The R curves (dependent variables) are both normalized by their maximum value because their scaling is not important.
A simple approach iff you can always expect both your experiment and simulation to feature one high peak, and you're sure that there's only a scaling factor rather than also an offset, is to simply multiply your Bsim vector by mode_rex / mode_rsim (e.g. in your example, mode_rsim = 1, and mode_rex = 4, so multiply Bsim by 4. But I'm sure you've thought of this already.
For a more general approach, one way is as follows:
add and load Interpolations package
Create a grid to interpolate over, e.g. Grid = 0:0.01:Bex[end]
interpolate Rex over that grid, e.g.
RexInterp = interpolate( (Bex,), Rex, Gridded(Linear()));
RexGridVec = RexInterp[Grid];
interpolate Rsim over the same grid, but introduce your multiplier on the Bsim "knots", e.g.
Multiplier = 0.1;
RsimInterp = interpolate( (Multiplier * Bsim,), Rsim, Gridded(Linear()));
RsimGridVec = RsimInterp[Grid]
Now you can calculate a square error value between RsimGridVec and RexGridVec, e.g.
SqErr = sum((RsimGridVec - RexGridVec).^2)
If you follow this technique, then if you create a loop for a multiplier range (say 0:0.01:10), and get the square error associated with each multiplier, you can find out the multiplier for which the square error is the minimum.
In theory if you wanted to find the optimal for a particular offset too, you can make it the outer loop for a range of offsets. Mind you this is a brute force approach, but it be reasonably efficient judging by the vectors in your graph.

Linear Dependence errors in DiceKriging with linearly independent data

The Problem (with minimal working example)
Dicekriging gives linear dependence errors about half the time when data is close to being linearly dependent. This can be seen by the below example which gave an error about half the time when I ran it on both an Ubuntu and windows computer. This occurs when I run it with either genetic or BFGS optimisation.
install.packages("DiceKriging")
library(DiceKriging)
x = data.frame(Param1 = c(0,1,2,2,2,2), Param2 = c(2,0,1,1e-7,0,2))
y = 1:6
duplicated(cbind(x,y))
Model = km( design = x, response = y , optim.method = "gen", control = list(maxit = 1e4), coef.cov = 1)
Model = km( design = x, response = y , optim.method = "BFGS", control = list(maxit = 1e4), coef.cov = 1)
When the data is a a little more dispersed no such errors occur.
# No problems occur if the data is more dispersed.
x = data.frame(Param1 = c(0,1,2,2,2,2), Param2 = c(2,0,1,1e-2,0,2))
y = 1:6
duplicated(cbind(x,y))
Model = km( design = x, response = y , optim.method = "gen", control = list(maxit = 1e4), coef.cov = 1)
Why this is a problem
Using Kriging for optimization of expensive models means that points near the optima will be densely sampled. It is not possible to do this with this error occuring. In addition the closeness points need to be to get this error can be closer than the 1e-7 above when there are multiple parameters that are all close. I got errors (in my actual problem, not MWE above) when the 4 coordinates of a point were around 1e-3 apart from another point and this problem occurred.
Related Questions
There are not many DiceKriging questions on stack overflow. The closest question is this one (from the Kriging package ) in which the problem is genuine linear dependence. Note the Kriging package is not a substitute for DiceKriging in that it is restricted to 2 dimensions.
Desired Solution
I would like either:
A way to change my km call to avoid this problem (preferred)
A way to determine when this problem will occur so that I can drop observations that are too close to each other for the kriging call.
Your problem is not a software problem. It's rather a mathematical one.
Your first data contains the two following points (0 , 2) and (1e-7, 2) that are very very close but correspond to (very) different outputs: 4 and 5. Therefore, you are trying to adjust a Kriging model, which is an interpolation model, to a chaotic response. This cannot work. Kriging/Gaussian process modelling is not the good tool if your response varies a lot between points which are close.
However, when you are optimizing expensive models, things are not like on your example. There is not such a difference in the response for very close input points.
But, there could be indeed numerical problem if your points are very close.
In order to soften these numerical errors, you can add a nugget effect. The nugget is a small constant variance added to the diagonal of the covariance matrix, which allows the points not to be exactly interpolated. Your kriging approximation curve is not forced to pass exactly through the learning points. Thus, the kriging model becomes a regression model.
In DiceKriging, the nugget can be added in two ways. First, you can choose a priori a value and add it "manually" by setting km(..., nugget=you_value), or you can ask km to learn it at the same time it learns the parameters of the covariance function, by setting km(..., nugget.estim=TRUE). I advise you to use the second in general.
Your small example becomes then:
Model = km( design = x, response = y , optim.method = "BFGS", control = list(maxit = 1e4),
coef.cov = 1, nugget.estim=TRUE)
Extract from the DiceKriging documentation:
nugget: an optional variance value standing for the homogeneous nugget
effect.
nugget.estim: an optional boolean indicating whether the nugget
effect should be estimated. Note that this option does not concern the
case of heterogeneous noisy observations (see noise.var below). If
nugget is given, it is used as an initial value. Default is FALSE.
PS: The choice of the covariance kernel can be important in some applications. When the function to approximate is quite rough, the exponential kernel (set covtype="exp" in km) can be a good choice. See the book of Rasmussen and Williams for more information (freely and legaly available at http://www.gaussianprocess.org/)

wrapnls: Error: singular gradient matrix at initial parameter estimates

I have created a loop to fit a non-linear model to six data points by participants (each participant has 6 data points). The first model is a one parameter model. Here is the code for that model that works great. The time variable is defined. The participant variable is the id variable. The data is in long form (one row for each datapoint of each participant).
Here is the loop code with 1 parameter that works:
1_p_model <- dlply(discounting_long, .(Participant), function(discounting_long) {wrapnls(indiff ~ 1/(1+k*time), data = discounting_long, start = c(k=0))})
However, when I try to fit a two parameter model, I get this error "Error: singular gradient matrix at initial parameter estimates" while still using the wrapnls function. I realize that the model is likely over parameterized, that is why I am trying to use wrapnls instead of just nls (or nlsList). Some in my field insist on seeing both model fits. I thought that the wrapnls model avoids the problem of 0 or near-0 residuals. Here is my code that does not work. The start values and limits are standard in the field for this model.
2_p_model <- dlply(discounting_long, .(Participant), function(discounting_long) {nlxb(indiff ~ 1/(1+k*time^s), data = discounting_long, lower = c (s = 0), start = c(k=0, s=.99), upper = c(s=1))})
I realize that I could use nlxb (which does give me the correct parameter values for each participant) but that function does not give predictive values or residuals of each data point (at least I don't think it does) which I would like to compute AIC values.
I am also open to other solutions for running a loop through the data by participants.
You mention at the end that 'nlxb doesn't give you residuals', but it does. If your result from your call to nlxbis called fit then the residuals are in fit$resid. So you can get the fitted values using just by adding them to the original data. Honestly I don't know why nlxb hasn't been made to work with the predict() function, but at least there's a way to get the predicted values.

Resources