Behaviour of dfmax in glmnet - r

(NB: This is a slightly modified version of a post I'd made on a different forum. I received no responses there, hence the post here. If this is not allowed, please let me know, will take down the question).
I am new to glmnet, so I do not yet understand fully what the various
parameters do. I am trying to build a multinomial classifier which restricts
the number of features used in the model. From reading the docs and some
answers on this forum, I understand dfmax is the way to do it. I
played around with it a bit; I have a couple of questions and would appreciate some help:
Setup
For a particular dataset, I want to restrict the number of features to 3;
the original data has 126 features. Here's what I run:
fit<-glmnet(data.matrix(X), data.matrix(y), family='multinomial', dfmax=3)
d<-data.frame(tidy(fit))
This is the value of d:
My questions about the output:
I see multiple values of lambda in there; it looks like
glmnet tries to fit lambdas that gets the number of terms close to
dfmax=3. So its less like the LARs algorithm (where we
move stagewise by adding variables and can stop at an exact number of variables) and more about getting the
right lambdas for regularization that lead to the intended dfmax. Is
this right?
I'm guessing alpha plays a role in how close we can get
to dfmax. At alpha=1, where we're doing lasso, and so its easier to
get close to dfmax, compared to when alpha=0 and we're doing ridge.
Is this understanding correct?
A "neighborhood" of dfmax is the
best we can do it'd seem. Or am I missing a parameter that gets me
to the model with the exact dfmax (FYI: alpha=1 doesn't seem to get
me to the precise number of non zero terms either, at least on this
dataset).
In the first solution - step=1, there are no variables used. Does this mean the relative odds equal a constant?
What does pmax do?
Thanks in advance!

Related

Results different after latest JAGS update?

I am running Bayesian Hierarchical Modeling in R using R2jags. When I open code I used a month ago and run it on a dataset I used a month ago (verified by "date modified" in windows explorer), I get different results than I got a month ago. The only difference I can think of is I got a new work computer in the last month and we installed JAGS 4.3.0. I was previously using 4.2.0.
Is it remotely possible to get different results just from updating my version of JAGS? I'm not posting code or results here because I don't need help troubleshooting it - everything is exactly the same.
Edit:
Conversion seems fine - Gewekes, autocorrelation plots, and trace plots look good. That hasn't changed.
I have a seed set both via set.seet () and jags.seed=. Is that enough? I've never had a problem replicating these types of results before.
As far as how different the results are, they are large enough to cause meaningful difference in the inference. I am assessing relationships between 30 chemical exposures and a health outcome in among 336 humans. Here are two examples. Chemical B troubles me the most because of the credible interval shift. Chemical A is another example.
I also doubled the number of iterations from 50k to 100k which resulted in very minor/inconsequential differences.
Edit 2:
I posted at source forge asking about the different default RNGs for versions: https://sourceforge.net/p/mcmc-jags/discussion/610037/thread/52bfef7d17/
There are at least 3 possible reasons for you seeing a difference between results from these models:
One or both of your attempts to fit this model did not converge, and/or your effective sample size is so small that random sampling error is having a large impact on your inference. If you have already checked to ensure convergence and sufficient effective sample size (for both models) then you can rule this out.
You are seeing small differences in the posteriors due to the random sampling inherent to MCMC in otherwise converged results. If these differences are big enough to cause a meaningful difference in inference then your effective sample size is not high enough - so just run the models for longer and the difference should reduce. You can also set the random seed in JAGS using initial values for .RNG.seed and .RNG.name so that successive model runs are numerically identical. If you run the models for longer and this difference does not reduce (or if it is a large difference to begin with) then you can rule this out.
Your model contains a node for which the default sampling scheme changed between JAGS 4.2.0 and 4.3.0 - there were some changes to sampling schemes (and the order of precedence for assigning samplers to nodes) that could conceivably have changed your results (from memory I think this affected GLM particularly, but I can't remember exactly). However, although this may affect the probability of convergence, it should not substantially affect the posterior if the model does converge. It may be contributing to a numerical difference as explained for point (2) though.
I'd recommend first ensuring convergence of both models, and then (assuming they did both converge) looking at exactly how much of a difference you are seeing. If it looks like both models converged AND the difference is more than just random sampling variation, then please reply here and/or update your question (as that shouldn't happen ... i.e. we may need to look into the possibility of a bug in JAGS).
Thanks,
Matt
--------- Edit following additional information added to the question --------
Based on what you have written, it does seem that the difference in inference exceeds what might be expected due to random variation, so there may be some kind of underlying issue here. In order to diagnose this further we would need a minimal reproducible example (https://stackoverflow.com/help/minimal-reproducible-example). This means that you would need to provide not only the model (or preferably a simplified model that still exhibits the problem) but also some data to which we can fit the model. If your data are too sensitive to share then this could be a fictitious dataset for which you also see a difference between JAGS 4.2.0 and JAGS 4.3.0.
The official help forum for JAGS is at https://sourceforge.net/p/mcmc-jags/discussion/610037/ - so you can certainly post there, although we would still need a minimal reproducible example to be able to do anything. If you do so, then please update both posts with a link to the other so that anyone reading either post knows about the cross-posting. You should also note that R2jags is not officially supported on the JAGS forums, so please provide the minimal reproducible example using plain rjags code (or runjags if you prefer) rather than using the R2jags wrapper.
To answer your question in the comments: in order to obtain information on the samplers used you can use rjags::list.samplers() eg:
library(rjags)
# LINE is just a small example model built into rjags:
data(LINE)
LINE$recompile()
# $`bugs::ConjugateGamma`
# [1] "tau"
# $`bugs::ConjugateNormal`
# [1] "alpha"
# $`bugs::ConjugateNormal`
# [1] "beta"

Saving Data in Dymos Changes Optimisation and Simulation Results

I had a similar issue as expressed in this question. I followed Rob Flack's answer but had issues. If anyone could help me out, I would appreciate it.
I used the code suggested in the answer but had an issue: It changed the simulation results. I added a line in the script for the min_time_climb example that goes like this:
phase.add_timeseries_output('aero.mach', units=None, shape=(1,), output_name = "recorded_mach")
I used the name "recorded_mach" so as to not override anything else Dymos may or may not have been recording. The issue is that the default Altitude (h) vs. time graph actually changed, both the discrete points and simulation curve. I ended up recording 4 variables with similar commands to what I have just shown and that somehow made the simulation track better with the discrete optimisation points on the graph. When I recorded another 4 variables on top of that, it made it track worse. I find this very strange because I don't see why recording the simulation should change its output.
Have you ever come across this? Any insight you could provide into the issue would be greatly appreciated.
Notes:
I have somewhat modified the example in order to fit a different sutuation (Different thrust and fuel burn data, different lift and drag polars, different height and speed goals) before implimenting the code described above. However, it was working fine still.
Without some kind of example to look at, I can only make an educated guess. So please take my answer with a grain of salt.
Some optimization problems have very ill conditioned Jacobians and/or KKT matrices (which you as a user would not normally see, but can be problematic none the less). There are many potential causes for this ill conditioning, but some common ones are very large derivatives (i.e. approaching infinity) or very larger ranges in magnitude between different derivatives. Another common cuase is the introduction of a saddle point, where you have infinite numbers of answers that are all equally good. Sometimes you can fix the problem with scaling, other times you need to re-work the problem formulation.
Ill conditioning has two bad effects on the optimizer. First, it makes it very hard for the numerics inside to comput inverses which are needed to compute step sizes. It will get an answer, but may be highly subject to numerical noise. Second, it may prevent certain approximations (like BFGS) from performing well in the first place.
In these cases, small changes in execution order or extra steps (e.g. case recoding) can cause the optimizer to take a different path. If you're finding that the path ultimately leads one case to work and another to fail, then you might have a marginally stable problem where you got lucky one time and not the other.
Look carefully for anything singular-like in your jacobian. 0 rows/columns? a constraint that happens to be satisfied, but still has a 0 row is a problem that comes up in Dymos cases if you forget to add additional degrees of freedom when you add constraints. Saddle points also arise if you're careful with your objective.

Customized Fisher exact test in R

Beginner's question ahead!
(after spending much time, could not find straightforward solution..)
After trying all relevant posts I can't seem to find the answer, perhaps because my question is quite basic.
I want to run fisher.test on my data (Whatever data, doesn't really matter to me - mine is Rubin's children TV workshop from QR33 - http://www.stat.columbia.edu/~cook/qr33.pdf) It has to simulate completely randomized experiment.
My assumption is that RCT in this context means that all units have the same probability to be assigned to treatment(1/N). (of course, correct me if I'm wrong. thanks).
I was asked to create a customized function and my function has to include the following arguments:
Treatment observations (vector)
Control observations (vector)
A scalar representing the value, e.g., zero, of the sharp null hypothesis; and
The number of simulated experiments the function should run.
When digging in R's fisher.test I see that I can specify X,Y and many other params, but I'm unsure reg the following:
What's the meaning of Y? (i.e. a factor object; ignored if x is a matrix. is not informative as per the statistical meaning).
How to specify my null hypothesis? (i.e. if I don't want to use 0.) I see that there is a class "htest" with null.value but how can I use it in the function?
Reg number of simulations, my plan is to run everything through a loop - sounds expensive - any ideas how to better write it?
Thanks for helping - this is not an easy task I believe, hopefully will be useful for many people.
Cheers,
NB - Following explanations were found unsatisfying:
https://www.r-bloggers.com/contingency-tables-%E2%80%93-fisher%E2%80%99s-exact-test/
https://stats.stackexchange.com/questions/252234/compute-a-fisher-exact-test-in-r
https://stats.stackexchange.com/questions/133441/computing-the-power-of-fishers-exact-test-in-r
https://stats.stackexchange.com/questions/147559/fisher-exact-test-on-paired-data
It's not completely clear to me that a Fisher test is necessarily the right thing for what you're trying to do (that would be a good question for stats.SE) but I'll address the R questions.
As is explained at the start of the section on "Details", R offers two ways to specify your data.
You can either 1. supply to the argument x a contingency table of counts (omitting anything for y), or you can supply observations on individuals as two vectors that indicate the row and column categories (it doesn't matter which is which); each vector containing factors for x and y. [I'm not sure why it also doesn't let you specify x as a vector of counts and y as a data frame of factors, but it's easy enough to convert]
With a Fisher test, the null hypothesis under which (conditionally on the margins) the observation-categories become exchangeable is independence, but you can choose to make it one or two tailed (via the alternative argument)
I'm not sure I clearly understand the simulation aspect but I almost never use a loop for simulations (not for efficiency, but for clarity and brevity). The function replicate is very good for doing simulations. I use it roughly daily, sometimes many times.

How can I do blind fitting on a list of x, y value pairs if I don't know the form of f(x) = y?

If I have a function f(x) = y that I don't know the form of, and if I have a long list of x and y value pairs (potentially thousands of them), is there a program/package/library that will generate potential forms of f(x)?
Obviously there's a lot of ambiguity to the possible forms of any f(x), so something that produces many non-trivial unique answers (in reduced terms) would be ideal, but something that could produce at least one answer would also be good.
If x and y are derived from observational data (i.e. experimental results), are there programs that can create approximate forms of f(x)? On the other hand, if you know beforehand that there is a completely deterministic relationship between x and y (as in the input and output of a pseudo random number generator) are there programs than can create exact forms of f(x)?
Soooo, I found the answer to my own question. Cornell has released a piece of software for doing exactly this kind of blind fitting called Eureqa. It has to be one of the most polished pieces of software that I've ever seen come out of an academic lab. It's seriously pretty nifty. Check it out:
It's even got turnkey integration with Amazon's ec2 clusters, so you can offload some of the heavy computational lifting from your local computer onto the cloud at the push of a button for a very reasonable fee.
I think that I'm going to have to learn more about GUI programming so that I can steal its interface.
(This is more of a numerical methods question.) If there is some kind of observable pattern (you can kinda see the function), then yes, there are several ways you can approximate the original function, but they'll be just that, approximations.
What you want to do is called interpolation. Two very simple (and not very good) methods are Newton's method and Laplace's method of interpolation. They both work on the same principle but they are implemented differently (Laplace's is iterative, Newton's is recursive, for one).
If there's not much going on between any two of your data points (ie, the actual function doesn't have any "bumps" whose "peaks" are not represented by one of your data points), then the spline method of interpolation is one of the best choices you can make. It's a bit harder to implement, but it produces nice results.
Edit: Sometimes, depending on your specific problem, these methods above might be overkill. Sometimes, you'll find that linear interpolation (where you just connect points with straight lines) is a perfectly good solution to your problem.
It depends.
If you're using data acquired from the real-world, then statistical regression techniques can provide you with some tools to evaluate the best fit; if you have several hypothesis for the form of the function, you can use statistical regression to discover the "best" fit, though you may need to be careful about over-fitting a curve -- sometimes the best fit (highest correlation) for a specific dataset completely fails to work for future observations.
If, on the other hand, the data was generated something synthetically (say, you know they were generated by a polynomial), then you can use polynomial curve fitting methods that will give you the exact answer you need.
Yes, there are such things.
If you plot the values and see that there's some functional relationship that makes sense, you can use least squares fitting to calculate the parameter values that minimize the error.
If you don't know what the function should look like, you can use simple spline or interpolation schemes.
You can also use software to guess what the function should be. Maybe something like Maxima can help.
Wolfram Alpha can help you guess:
http://blog.wolframalpha.com/2011/05/17/plotting-functions-and-graphs-in-wolframalpha/
Polynomial Interpolation is the way to go if you have a totally random set
http://en.wikipedia.org/wiki/Polynomial_interpolation
If your set is nearly linear, then regression will give you a good approximation.
Creating exact form from the X's and Y's is mostly impossible.
Notice that what you are trying to achieve is at the heart of many Machine Learning algorithm and therefor you might find what you are looking for on some specialized libraries.
A list of x/y values N items long can always be generated by an degree-N polynomial (assuming no x values are the same). See this article for more details:
http://en.wikipedia.org/wiki/Polynomial_interpolation
Some lists may also match other function types, such as exponential, sinusoidal, and many others. It is impossible to find the 'simplest' matching function, but the best you can do is go through a list of common ones like exponential, sinusoidal, etc. and if none of them match, interpolate the polynomial.
I'm not aware of any software that can do this for you, though.

Fitting a binormal distribution in R

As from title, I have some data that is roughly binormally distributed and I would like to find its two underlying components.
I am fitting to the data distribution the sum of two normal with means m1 and m2 and standard deviations s1 and s2. The two gaussians are scaled by a weight factor such that w1+w2 = 1
I can succeed to do this using the vglm function of the VGAM package such as:
fitRes <- vglm(mydata ~ 1, mix2normal1(equalsd=FALSE),
iphi=w, imu=m1, imu2=m2, isd1=s1, isd2=s2))
This is painfully slow and it can take several minutes depending on the data, but I can live with that.
Now I would like to see how the distribution of my data changes over time, so essentially I break up my data in a few (30-50) blocks and repeat the fit process for each of those.
So, here are the questions:
1) how do I speed up the fit process? I tried to use nls or mle that look much faster but mostly failed to get good fit (but succeeded in getting all the possible errors these function could throw on me). Also is not clear to me how to impose limits with those functions (w in [0;1] and w1+w2=1)
2) how do I automagically choose some good starting parameters (I know this is a $1 million question but you'll never know, maybe someone has the answer)? Right now I have a little interface that allow me to choose the parameters and visually see what the initial distribution would look like which is very cool, but I would like to do it automatically for this task.
I thought of relying on the x corresponding to the 3rd and 4th quartiles of the y as starting parameters for the two mean? Do you thing that would be a reasonable thing to do?
First things first:
did you try to search for fit mixture model on RSeek.org?
did you look at the Cluster Analysis + Finite Mixture Modeling Task View?
There has been a lot of research into mixture models so you may find something.

Resources