In a question on Cross Validated (How to simulate censored data), I saw that the optim function was used as a kind of solver instead of as an optimizer. Here is an example:
optim(1, fn=function(scl){(pweibull(.88, shape=.5, scale=scl, lower.tail=F)-.15)^2})
# $par
# [1] 0.2445312
# ...
pweibull(.88, shape=.5, scale=0.2445312, lower.tail=F)
# [1] 0.1500135
I have found a tutorial on optim here, but I am still not able to figure out how to use optim to work as a solver. I have several questions:
What is first parameter (i.e., the value 1 being passed in)?
What is the function that is passed in?
I can understand that it is taking the Weibull probability distribution and subtracting 0.15, but why are we squaring the result?
I believe you are referring to my answer. Let's walk through a few points:
The OP (of that question) wanted to generate (pseudo-)random data from a Weibull distribution with specified shape and scale parameters, and where the censoring would be applied for all data past a certain censoring time, and end up with a prespecified censoring rate. The problem is that once you have specified any three of those, the fourth is necessarily fixed. You cannot specify all four simultaneously unless you are very lucky and the values you specify happen to fit together perfectly. As it happened, the OP was not so lucky with the four preferred values—it was impossible to have all four as they were inconsistent. At that point, you can decide to specify any three and solve for the last. The code I presented were examples of how to do that.
As noted in the documentation for ?optim, the first argument is par "[i]nitial values for the parameters to be optimized over".
Very loosely, the way the optimization routine works is that it calculates an output value given a function and an input value. Then it 'looks around' to see if moving to a different input value would lead to a better output value. If that appears to be the case, it moves in that direction and starts the process again. (It stops when it does not appear that moving in either direction will yield a better output value.)
The point is that is has to start somewhere, and the user is obliged to specify that value. In each case, I started with the OP's preferred value (although really I could have started most anywhere).
The function that I passed in is ?pweibull. It is the cumulative distribution function (CDF) of the Weibull distribution. It takes a quantile (X value) as its input and returns the proportion of the distribution that has been passed through up to that point. Because the OP wanted to censor the most extreme 15% of that distribution, I specified that pweibull return the proportion that had not yet been passed through instead (that is the lower.tail=F part). I then subtracted.15 from the result.
Thus, the ideal output (from my point of view) would be 0. However, it is possible to get values below zero by finding a scale parameter that makes the output of pweibull < .15. Since optim (or really most any optimizer) finds the input value that minimizes the output value, that is what it would have done. To keep that from happening, I squared the difference. That means that when the optimizer went 'too far' and found a scale parameter that yielded an output of .05 from pweibull, and the difference was -.10 (i.e., < 0), the squaring makes the ultimate output +.01 (i.e., > 0, or worse). This would push the optimizer back towards the scale parameter that makes pweibull output (.15-.15)^2 = 0.
In general, the distinction you are making between an "optimizer" and a "solver" is opaque to me. They seem like two different views of the same elephant.
Another possible confusion here involves optimization vs. regression. Optimization is simply about finding an input value[s] that minimizes (maximizes) the output of a function. In regression, we conceptualize data as draws from a data generating process that is a stochastic function. Given a set of realized values and a functional form, we use optimization techniques to estimate the parameters of the function, thus extracting the data generating process from noisy instances. Part of regression analyses partakes of optimization then, but other aspects of regression are less concerned with optimization and optimization itself is much larger than regression. For example, the functions optimized in my answer to the other question are deterministic, and there were no "data" being analyzed.
Related
The autocorrelation command acf is slightly off the correct value. the command i am giving is acf(X,lag.max = 1) and the output is 0.881 while the same autocorrelation calculation done using the command cor(X[1:41],X[2:42]) gives the value 0.9452. I searched a lot to understand if there is any syntax I am missing or if the two computations have some fundamental difference but could not find suitable resources. Please help.
The equation for autocorrelation used by acf() and in all time series textbooks is
where
See https://otexts.com/fpp2/autocorrelation.html for example.
The equation for regular correlation is slightly different:
where
The simpler formula is better because it uses all available data in the denominator.
Note also that you might expect the top formula to include the multiplier $T/(T-k)$ given the different number of terms in the summations. But this is not included to ensure that the corresponding autocorrelation matrix is nonnegative definite.
I am new to the function GAMLSS in r, and when I run my code I always get this error: Response Variable out of range
After looking into the data frame, I realized the issue was one of response variables was 0.0000.
I was wondering if someone could explain to me why 0 is out of range and possible solutions to go around it (ex. such as replacement the values)?
LOGNO family corresponds to the log-normal distribution, which is defined for positive values only.
The possible solutions might be (but highly depend on the context):
use another distribution, which better models the response variable and allows zero values
sometimes zero values are reported if they are below the limit of detection (LOD).
In this case, one has a censored data set, and you may look for review of the methods, how to tackle it. A pragmatic approach is to substitute zeros with values like LOD/2, reviewed, for example, here. However, it may result in a very biased estimation.
I am using R (RStudio) to construct an index/synthetic indicator to evaluate, say, commercial efficiency. I am using the PCA() command from factorMineR package, and using 7 distinct variables. I have previously created similar indexes by calculating the weight of each particular variable over the first component (which can be obtained through PCA()$var$coord[,1]), with no problems, since each variable has a positive weight. However, there is one particular variable that has a weight with an undesired sign: negative. The variable is ‘delivery speed’ and this sign would imply that the greater the speed the less efficient the process. Then, what is going on? How would you amend this issue, preferably still using PCA?
The sign of variable weights shouldn't matter in PCA. Since on the whole, all of the components perfectly represent the original data (when p < n), for some components it is natural that there will be some positive weights and some negative weights. That doesn't mean that that particular variable has an undesired weight, rather that for that particular extracted signal (say, first principal component) the variable weight is negative.
For a better understanding, let's take the classical 2 dimensional example, which I took from this very useful discussion:
Can you see from the graph that one of the weights will necessary be negative for the 2nd principal component?
Finally, if that variable does actually disturb your analysis, one possible solution would be to apply Sparse PCA. Under cross-validated regularization that method is able to make some of the weights equal to zero. If in your case that negative weight is not significant enough, it might get reduced to zero under SPCA.
Beginner's question ahead!
(after spending much time, could not find straightforward solution..)
After trying all relevant posts I can't seem to find the answer, perhaps because my question is quite basic.
I want to run fisher.test on my data (Whatever data, doesn't really matter to me - mine is Rubin's children TV workshop from QR33 - http://www.stat.columbia.edu/~cook/qr33.pdf) It has to simulate completely randomized experiment.
My assumption is that RCT in this context means that all units have the same probability to be assigned to treatment(1/N). (of course, correct me if I'm wrong. thanks).
I was asked to create a customized function and my function has to include the following arguments:
Treatment observations (vector)
Control observations (vector)
A scalar representing the value, e.g., zero, of the sharp null hypothesis; and
The number of simulated experiments the function should run.
When digging in R's fisher.test I see that I can specify X,Y and many other params, but I'm unsure reg the following:
What's the meaning of Y? (i.e. a factor object; ignored if x is a matrix. is not informative as per the statistical meaning).
How to specify my null hypothesis? (i.e. if I don't want to use 0.) I see that there is a class "htest" with null.value but how can I use it in the function?
Reg number of simulations, my plan is to run everything through a loop - sounds expensive - any ideas how to better write it?
Thanks for helping - this is not an easy task I believe, hopefully will be useful for many people.
Cheers,
NB - Following explanations were found unsatisfying:
https://www.r-bloggers.com/contingency-tables-%E2%80%93-fisher%E2%80%99s-exact-test/
https://stats.stackexchange.com/questions/252234/compute-a-fisher-exact-test-in-r
https://stats.stackexchange.com/questions/133441/computing-the-power-of-fishers-exact-test-in-r
https://stats.stackexchange.com/questions/147559/fisher-exact-test-on-paired-data
It's not completely clear to me that a Fisher test is necessarily the right thing for what you're trying to do (that would be a good question for stats.SE) but I'll address the R questions.
As is explained at the start of the section on "Details", R offers two ways to specify your data.
You can either 1. supply to the argument x a contingency table of counts (omitting anything for y), or you can supply observations on individuals as two vectors that indicate the row and column categories (it doesn't matter which is which); each vector containing factors for x and y. [I'm not sure why it also doesn't let you specify x as a vector of counts and y as a data frame of factors, but it's easy enough to convert]
With a Fisher test, the null hypothesis under which (conditionally on the margins) the observation-categories become exchangeable is independence, but you can choose to make it one or two tailed (via the alternative argument)
I'm not sure I clearly understand the simulation aspect but I almost never use a loop for simulations (not for efficiency, but for clarity and brevity). The function replicate is very good for doing simulations. I use it roughly daily, sometimes many times.
My goal is to estimate two parameters of a model (see CE_hat).
I use 7 observations to fit two parameters: (w,a), so overfitting occurs a few times. One idea would be to restrict the influence of each observation so that outliers do not "hijack" the parameter estimates.
The method that has been previously suggested to me was nlrob. The problem with that however is that extreme cases such as the example below, return Missing value or an infinity produced when evaluating the model.
To avoid this I used nlsLM which works towards a convergence at the cost of returning outlandish estimates.
Any ideas as to how I can use robust fitting with this example?
I include below a reproducible example. The observables here are CE, H and L. These three elements are fed into a function (CE_hat) in order to estimate "a" and "w". Values close to 1 for "a" and close to 0.5 for "w" are generally considered to be more reasonable. As you - hopefully - can see, when all observations are included, a=91, while w=next to 0. However, if we were to exclude the 4th (or 7th) observation (for CE, H and L), we get much more sensible estimates. Ideally, I would like to achieve the same result, without excluding these observations. One idea would be to restrict their influence. I understand that it might not be as clear why these observations constitute some sort of "outliers". It's hard to say something about that without saying too much I am afraid but I am happy to go into more details about the model should a question arise.
library("minpack.lm")
options("scipen"=50)
CE<-c(3.34375,6.6875,7.21875,13.375,14.03125,14.6875,12.03125)
H<-c(4,8,12,16,16,16,16)
L<-c(0,0,0,0,4,8,12)
CE_hat<-function(w,H,a,L){(w*(H^a-L^a)+L^a)^(1/a)}
aw<-nlsLM(CE~CE_hat(w,H,a,L),
start=list(w=0.5,a=1),
control = nls.lm.control(nprint=1,maxiter=100))
summary(aw)$parameters