How to use robust Fitting of Nonlinear Regression Models in nlslm? - r

My goal is to estimate two parameters of a model (see CE_hat).
I use 7 observations to fit two parameters: (w,a), so overfitting occurs a few times. One idea would be to restrict the influence of each observation so that outliers do not "hijack" the parameter estimates.
The method that has been previously suggested to me was nlrob. The problem with that however is that extreme cases such as the example below, return Missing value or an infinity produced when evaluating the model.
To avoid this I used nlsLM which works towards a convergence at the cost of returning outlandish estimates.
Any ideas as to how I can use robust fitting with this example?
I include below a reproducible example. The observables here are CE, H and L. These three elements are fed into a function (CE_hat) in order to estimate "a" and "w". Values close to 1 for "a" and close to 0.5 for "w" are generally considered to be more reasonable. As you - hopefully - can see, when all observations are included, a=91, while w=next to 0. However, if we were to exclude the 4th (or 7th) observation (for CE, H and L), we get much more sensible estimates. Ideally, I would like to achieve the same result, without excluding these observations. One idea would be to restrict their influence. I understand that it might not be as clear why these observations constitute some sort of "outliers". It's hard to say something about that without saying too much I am afraid but I am happy to go into more details about the model should a question arise.
library("minpack.lm")
options("scipen"=50)
CE<-c(3.34375,6.6875,7.21875,13.375,14.03125,14.6875,12.03125)
H<-c(4,8,12,16,16,16,16)
L<-c(0,0,0,0,4,8,12)
CE_hat<-function(w,H,a,L){(w*(H^a-L^a)+L^a)^(1/a)}
aw<-nlsLM(CE~CE_hat(w,H,a,L),
start=list(w=0.5,a=1),
control = nls.lm.control(nprint=1,maxiter=100))
summary(aw)$parameters

Related

Response variable out of range -> using gamlss in r LOGNO

I am new to the function GAMLSS in r, and when I run my code I always get this error: Response Variable out of range
After looking into the data frame, I realized the issue was one of response variables was 0.0000.
I was wondering if someone could explain to me why 0 is out of range and possible solutions to go around it (ex. such as replacement the values)?
LOGNO family corresponds to the log-normal distribution, which is defined for positive values only.
The possible solutions might be (but highly depend on the context):
use another distribution, which better models the response variable and allows zero values
sometimes zero values are reported if they are below the limit of detection (LOD).
In this case, one has a censored data set, and you may look for review of the methods, how to tackle it. A pragmatic approach is to substitute zeros with values like LOD/2, reviewed, for example, here. However, it may result in a very biased estimation.

LASSO coefficients equal to 0 using opt1D

I have a question about LASSO. I'm getting crazy because it is something that I can not solve only according to my background. I'm a biologist.
Briefly I run LASSO using the R library "penalized". In particular I used the opt1D function with around 500 simulations on a data.frame (numerical) of around 30 columns that are my biomarkers (gene expression). I want to test and 3000 rows that are people of which around 50 are tumours and all the others are normals.
Unfortunately by using L1 regularization, all and really all coefficients of 500 simulations are 0. If I check L2 matrix of coefficients they are close to 0. Now my point is that I cannot think that all my biomarkers are not able to distinguish between Normals and Tumors.
I don't know if what I have done is all I can to check for the discriminatory potential of my molecules. Is there something else I can do to understand why are they all 0 and also is there something else I can do to verify that really they are not able to stratify my cohort?
Did you consider fitting your data without penalization before using regularization? L1 regularization will naturally result in a significant number of zero coefficients.
As a side note I would first run PCA/PCoA and see whether or not your genes separate according to your class variable. This could save you some time and allow you to trim your data set to those genes that show the greatest differences across your class variable. Also if you have relatively little experience with R I would suggest using a linear modeling package such as Limma since it has excellent documentation and many examples that are easy to follow.

Understand Regression results

I have a set of numerical features that describe a phenomenon at different time points. In order to evaluate the individual performance of each feature, I perform a linear regression with a leave one out validation, and I compute the correlations and errors to evaluate the results.
So for a single feature, it would be something like:
Input: Feature F = {F_t1, F_t2, ... F_tn}
Input: Phenomenom P = {P_t1, P_t2, ... P_tn}
Linear Regression of P according to F, plus leave one out.
Evaluation: Compute correlations (linear and spearman) and errors (mean absolute and root mean squared)
For some of the variables, both correlations are really good (> 0.9), but when I take a look to the predictions, I realize that the predictions are all really close to the average (of the values to predict), so the errors are big.
How is that possible?
Is there a way to fix it?
For some technical precisions, I use the weka linear regression with the option "-S 1" in order to avoid the feature selection.
It seems to be because the problem we want to regress is not linear and we use a linear approach. Then it is possible to have good correlations and poor errors. It does not mean that the regression is wrong or really poor, but you have to be really careful and investigate further.
Anyway, a non linear approach that minimizes the errors and maximize the correlation is the way to go.
Moreover, outliers also make this problem occur.

How can optimization be used as a solver?

In a question on Cross Validated (How to simulate censored data), I saw that the optim function was used as a kind of solver instead of as an optimizer. Here is an example:
optim(1, fn=function(scl){(pweibull(.88, shape=.5, scale=scl, lower.tail=F)-.15)^2})
# $par
# [1] 0.2445312
# ...
pweibull(.88, shape=.5, scale=0.2445312, lower.tail=F)
# [1] 0.1500135
I have found a tutorial on optim here, but I am still not able to figure out how to use optim to work as a solver. I have several questions:
What is first parameter (i.e., the value 1 being passed in)?
What is the function that is passed in?
I can understand that it is taking the Weibull probability distribution and subtracting 0.15, but why are we squaring the result?
I believe you are referring to my answer. Let's walk through a few points:
The OP (of that question) wanted to generate (pseudo-)random data from a Weibull distribution with specified shape and scale parameters, and where the censoring would be applied for all data past a certain censoring time, and end up with a prespecified censoring rate. The problem is that once you have specified any three of those, the fourth is necessarily fixed. You cannot specify all four simultaneously unless you are very lucky and the values you specify happen to fit together perfectly. As it happened, the OP was not so lucky with the four preferred values—it was impossible to have all four as they were inconsistent. At that point, you can decide to specify any three and solve for the last. The code I presented were examples of how to do that.
As noted in the documentation for ?optim, the first argument is par "[i]nitial values for the parameters to be optimized over".
Very loosely, the way the optimization routine works is that it calculates an output value given a function and an input value. Then it 'looks around' to see if moving to a different input value would lead to a better output value. If that appears to be the case, it moves in that direction and starts the process again. (It stops when it does not appear that moving in either direction will yield a better output value.)
The point is that is has to start somewhere, and the user is obliged to specify that value. In each case, I started with the OP's preferred value (although really I could have started most anywhere).
The function that I passed in is ?pweibull. It is the cumulative distribution function (CDF) of the Weibull distribution. It takes a quantile (X value) as its input and returns the proportion of the distribution that has been passed through up to that point. Because the OP wanted to censor the most extreme 15% of that distribution, I specified that pweibull return the proportion that had not yet been passed through instead (that is the lower.tail=F part). I then subtracted.15 from the result.
Thus, the ideal output (from my point of view) would be 0. However, it is possible to get values below zero by finding a scale parameter that makes the output of pweibull < .15. Since optim (or really most any optimizer) finds the input value that minimizes the output value, that is what it would have done. To keep that from happening, I squared the difference. That means that when the optimizer went 'too far' and found a scale parameter that yielded an output of .05 from pweibull, and the difference was -.10 (i.e., < 0), the squaring makes the ultimate output +.01 (i.e., > 0, or worse). This would push the optimizer back towards the scale parameter that makes pweibull output (.15-.15)^2 = 0.
In general, the distinction you are making between an "optimizer" and a "solver" is opaque to me. They seem like two different views of the same elephant.
Another possible confusion here involves optimization vs. regression. Optimization is simply about finding an input value[s] that minimizes (maximizes) the output of a function. In regression, we conceptualize data as draws from a data generating process that is a stochastic function. Given a set of realized values and a functional form, we use optimization techniques to estimate the parameters of the function, thus extracting the data generating process from noisy instances. Part of regression analyses partakes of optimization then, but other aspects of regression are less concerned with optimization and optimization itself is much larger than regression. For example, the functions optimized in my answer to the other question are deterministic, and there were no "data" being analyzed.

Fitting a binormal distribution in R

As from title, I have some data that is roughly binormally distributed and I would like to find its two underlying components.
I am fitting to the data distribution the sum of two normal with means m1 and m2 and standard deviations s1 and s2. The two gaussians are scaled by a weight factor such that w1+w2 = 1
I can succeed to do this using the vglm function of the VGAM package such as:
fitRes <- vglm(mydata ~ 1, mix2normal1(equalsd=FALSE),
iphi=w, imu=m1, imu2=m2, isd1=s1, isd2=s2))
This is painfully slow and it can take several minutes depending on the data, but I can live with that.
Now I would like to see how the distribution of my data changes over time, so essentially I break up my data in a few (30-50) blocks and repeat the fit process for each of those.
So, here are the questions:
1) how do I speed up the fit process? I tried to use nls or mle that look much faster but mostly failed to get good fit (but succeeded in getting all the possible errors these function could throw on me). Also is not clear to me how to impose limits with those functions (w in [0;1] and w1+w2=1)
2) how do I automagically choose some good starting parameters (I know this is a $1 million question but you'll never know, maybe someone has the answer)? Right now I have a little interface that allow me to choose the parameters and visually see what the initial distribution would look like which is very cool, but I would like to do it automatically for this task.
I thought of relying on the x corresponding to the 3rd and 4th quartiles of the y as starting parameters for the two mean? Do you thing that would be a reasonable thing to do?
First things first:
did you try to search for fit mixture model on RSeek.org?
did you look at the Cluster Analysis + Finite Mixture Modeling Task View?
There has been a lot of research into mixture models so you may find something.

Resources