I am running Bayesian Hierarchical Modeling in R using R2jags. When I open code I used a month ago and run it on a dataset I used a month ago (verified by "date modified" in windows explorer), I get different results than I got a month ago. The only difference I can think of is I got a new work computer in the last month and we installed JAGS 4.3.0. I was previously using 4.2.0.
Is it remotely possible to get different results just from updating my version of JAGS? I'm not posting code or results here because I don't need help troubleshooting it - everything is exactly the same.
Edit:
Conversion seems fine - Gewekes, autocorrelation plots, and trace plots look good. That hasn't changed.
I have a seed set both via set.seet () and jags.seed=. Is that enough? I've never had a problem replicating these types of results before.
As far as how different the results are, they are large enough to cause meaningful difference in the inference. I am assessing relationships between 30 chemical exposures and a health outcome in among 336 humans. Here are two examples. Chemical B troubles me the most because of the credible interval shift. Chemical A is another example.
I also doubled the number of iterations from 50k to 100k which resulted in very minor/inconsequential differences.
Edit 2:
I posted at source forge asking about the different default RNGs for versions: https://sourceforge.net/p/mcmc-jags/discussion/610037/thread/52bfef7d17/
There are at least 3 possible reasons for you seeing a difference between results from these models:
One or both of your attempts to fit this model did not converge, and/or your effective sample size is so small that random sampling error is having a large impact on your inference. If you have already checked to ensure convergence and sufficient effective sample size (for both models) then you can rule this out.
You are seeing small differences in the posteriors due to the random sampling inherent to MCMC in otherwise converged results. If these differences are big enough to cause a meaningful difference in inference then your effective sample size is not high enough - so just run the models for longer and the difference should reduce. You can also set the random seed in JAGS using initial values for .RNG.seed and .RNG.name so that successive model runs are numerically identical. If you run the models for longer and this difference does not reduce (or if it is a large difference to begin with) then you can rule this out.
Your model contains a node for which the default sampling scheme changed between JAGS 4.2.0 and 4.3.0 - there were some changes to sampling schemes (and the order of precedence for assigning samplers to nodes) that could conceivably have changed your results (from memory I think this affected GLM particularly, but I can't remember exactly). However, although this may affect the probability of convergence, it should not substantially affect the posterior if the model does converge. It may be contributing to a numerical difference as explained for point (2) though.
I'd recommend first ensuring convergence of both models, and then (assuming they did both converge) looking at exactly how much of a difference you are seeing. If it looks like both models converged AND the difference is more than just random sampling variation, then please reply here and/or update your question (as that shouldn't happen ... i.e. we may need to look into the possibility of a bug in JAGS).
Thanks,
Matt
--------- Edit following additional information added to the question --------
Based on what you have written, it does seem that the difference in inference exceeds what might be expected due to random variation, so there may be some kind of underlying issue here. In order to diagnose this further we would need a minimal reproducible example (https://stackoverflow.com/help/minimal-reproducible-example). This means that you would need to provide not only the model (or preferably a simplified model that still exhibits the problem) but also some data to which we can fit the model. If your data are too sensitive to share then this could be a fictitious dataset for which you also see a difference between JAGS 4.2.0 and JAGS 4.3.0.
The official help forum for JAGS is at https://sourceforge.net/p/mcmc-jags/discussion/610037/ - so you can certainly post there, although we would still need a minimal reproducible example to be able to do anything. If you do so, then please update both posts with a link to the other so that anyone reading either post knows about the cross-posting. You should also note that R2jags is not officially supported on the JAGS forums, so please provide the minimal reproducible example using plain rjags code (or runjags if you prefer) rather than using the R2jags wrapper.
To answer your question in the comments: in order to obtain information on the samplers used you can use rjags::list.samplers() eg:
library(rjags)
# LINE is just a small example model built into rjags:
data(LINE)
LINE$recompile()
# $`bugs::ConjugateGamma`
# [1] "tau"
# $`bugs::ConjugateNormal`
# [1] "alpha"
# $`bugs::ConjugateNormal`
# [1] "beta"
Related
I want to train SVMs in R and I know there are functions such as e1071::tune.svm() that can be used to find the optimal parameters for the SVM. However, it seems there are some formulas out there (e.g. used in this report) that can give you a reasonable estimate of these parameters.
Since a grid-search for the parameters can take quite a lot of time on larger datasets and usually, one has to provide a range of possible values anyway, I wondered whether there is a package that implements formulas to get a quick estimate for the gamma and cost parameters for the SVM?
So far, I've found out that caret::train() might use such an approach to estimate sigma (which should be the reciprocal of 2*gamma^2) but I haven't tried it yet, since other calculations are still running (and will be, probably for the next days). Is there also an implementation to estimate cost or at least give a range of reasonable values?
I have found a similar question that asks for alternatives to grid-search in general. However, I would be interested in an R implementation of such alternatives and also, I hope things have developed further since the more general question was posted years ago.
(NB: This is a slightly modified version of a post I'd made on a different forum. I received no responses there, hence the post here. If this is not allowed, please let me know, will take down the question).
I am new to glmnet, so I do not yet understand fully what the various
parameters do. I am trying to build a multinomial classifier which restricts
the number of features used in the model. From reading the docs and some
answers on this forum, I understand dfmax is the way to do it. I
played around with it a bit; I have a couple of questions and would appreciate some help:
Setup
For a particular dataset, I want to restrict the number of features to 3;
the original data has 126 features. Here's what I run:
fit<-glmnet(data.matrix(X), data.matrix(y), family='multinomial', dfmax=3)
d<-data.frame(tidy(fit))
This is the value of d:
My questions about the output:
I see multiple values of lambda in there; it looks like
glmnet tries to fit lambdas that gets the number of terms close to
dfmax=3. So its less like the LARs algorithm (where we
move stagewise by adding variables and can stop at an exact number of variables) and more about getting the
right lambdas for regularization that lead to the intended dfmax. Is
this right?
I'm guessing alpha plays a role in how close we can get
to dfmax. At alpha=1, where we're doing lasso, and so its easier to
get close to dfmax, compared to when alpha=0 and we're doing ridge.
Is this understanding correct?
A "neighborhood" of dfmax is the
best we can do it'd seem. Or am I missing a parameter that gets me
to the model with the exact dfmax (FYI: alpha=1 doesn't seem to get
me to the precise number of non zero terms either, at least on this
dataset).
In the first solution - step=1, there are no variables used. Does this mean the relative odds equal a constant?
What does pmax do?
Thanks in advance!
I am using the h2o package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)
I have recently run an ensemble classifier in MLR (R) of a multicenter data set. I noticed that the ensemble over three classifiers (that were trained on different data modalities) was worse than the best classifier.
This seemed to be unexpected to me. I was using logistic regressions (without any parameter optimization) as simple classifier and a Partial Least Squares (PLS) Discriminant Analysis as a superlearner, since the base-learner predictions ought to be correlated. I also tested different superlearners like NB, and logistic regression. The results did not change.
Here are my specific questions:
1) Do you know, whether this can in principle occur?
(I also googled a bit and found this blog that seems to indicate that it can:
https://blogs.sas.com/content/sgf/2017/03/10/are-ensemble-classifiers-always-better-than-single-classifiers/)
2) Especially, if you are as surprised as I was, do you know of any checks I could do in mlr to make sure, that there isnt a bug. I have tried to use a different cross-validation scheme (originally I used leave-center-out CV, but since some centers provided very little data, I wasnt sure, whether this might lead to weird model fits of the super learner), but it still holds. I also tried to combine different data modalities and they give me the same phenomenon.
I would be grateful to hear, whether you have experienced this and if not, whether you know what the problem could be.
Thanks in advance!
Yes, this can happen - ensembles do not always guarantee a better result. More details regarding cases where this can happen are discussed also in this cross-validate question
What I currently have:
I have a data frame with one column of factors called "Class" which contains 160 different classes. I have 1200 variables, each one being an integer and no individual cell exceeding the value of 1000 (if that helps). About 1/4 of the cells are the number zero. The total dataset contains 60,000 rows. I have already used the nearZeroVar function, and the findCorrelation function to get it down to this number of variables. In my particular dataset some individual variables may appear unimportant by themselves, but are likely to be predictive when combined with two other variables.
What I have tried:
First I tried just creating a random forest model then planned on using the varimp property to filter out the useless stuff, gave up after letting it run for days. Then I tried using fscaret, but that ran overnight on a 8-core machine with 64GB of RAM (same as the previous attempt) and didn't finish. Then I tried:
Feature Selection using Genetic Algorithms That ran overnight and didn't finish either. I was trying to make principal component analysis work, but for some reason couldn't. I have never been able to successfully do PCA within Caret which could be my problem and solution here. I can follow all the "toy" demo examples on the web, but I still think I am missing something in my case.
What I need:
I need some way to quickly reduce the dimensionality of my dataset so I can make it usable for creating a model. Maybe a good place to start would be an example of using PCA with a dataset like mine using Caret. Of course, I'm happy to hear any other ideas that might get me out of the quicksand I am in right now.
I have done only some toy examples too.
Still, here are some ideas that do not fit into a comment.
All your attributes seem to be numeric. Maybe running the Naive Bayes algorithm on your dataset will gives some reasonable classifications? Then, all attributes are assumed to be independent from each other, but experience shows / many scholars say that NaiveBayes results are often still useful, despite strong assumptions?
If you absolutely MUST do attribute selection .e.g as part of an assignment:
Did you try to process your dataset with the free GUI-based data-mining tool Weka? There is an "attribute selection" tab where you have several algorithms (or algorithm-combinations) for removing irrelevant attributes at your disposal. That is an art, and the results are not so easy to interpret, though.
Read this pdf as an introduction and see this video for a walk-through and an introduction to the theoretical approach.
The videos assume familiarity with Weka, but maybe it still helps.
There is an RWeka interface but it's a bit laborious to install, so working with the Weka GUI might be easier.