Is it possible to misuse JAGS as a tool for generating data from a model with known parameters? I need to sample data points from a predefined model in order to do a simulation study and test the power of a model I have developed in R.
Unfortunately, the model is somehow tricky (hierarchical structure with AR and VAR component) and I was not able to simulate the data directly in R.
While searching the internet, I found a blog post where the data was generated in JAGS using the data{} Block in JAGS. In the post, the author than estimated the model directly in JAGS. Since I have my model in R, I would like to transfer the data back to R without a model{} block. Is this possible?
Best,
win
There is no particular reason that you need to use the data block for generating data in this way - the model block can just as easily work in 'reverse' to generate data based on fixed parameters. Just specify the parameters as 'data' to JAGS, and monitor the simulated data points (and run for as many iterations as you need datasets - which might only be 1!).
Having said that, in principle you can simulate data using either the data or model blocks (or a combination of both), but you need to have a model block (even if it is a simple and unrelated model) for JAGS to run. For example, the following uses the data block to simulate some data:
txtstring <- '
data{
for(i in 1:N){
Simulated[i] ~ dpois(i)
}
}
model{
fake <- 0
}
#monitor# Simulated
#data# N
'
library('runjags')
N <- 10
Simulated <- coda::as.mcmc(run.jags(txtstring, sample=1, n.chains=1, summarise=FALSE))
Simulated
The only real difference is that the data block is updated only once (at the start of the simulation), whereas the model block is updated at each iteration. In this case we only take 1 sample so it doesn't matter, but if you wanted to generate multiple realisations of your simulated data within the same JAGS run you would have to put the code in the model block. [There might also be other differences between data and model blocks but I can't think of any offhand].
Note that you will get the data back out of JAGS in a different format (a single vector with names giving the indices of any arrays within the monitored data), so some legwork might be required to get that back to a list of vectors / arrays / whatever in R. Edit: unless R2jags provides some utility for this - I'm not sure as I don't use that package.
Using a model block to run a single MCMC chain that simulates multiple datasets would be problematic because MCMC samples are typically correlated. (Each subsequent sample is drawn using the previous sample). For a simulation study, you would want to generate independent samples from your distribution. The way to go would be to use the data or model block recursively, e.g. in a for loop, which would ensure that your samples are independent.
Related
I asked this question on RCommunity but haven't had anyone bite... so I'm here!
My current project involves me predicting whether some trees will survive given future climate change scenarios. Against better judgement (like using Maxent) I've decided to pursue this with a GLM, which requires presence and absence data. Everytime I generate my absence data (as I was only given presence data) using randomPoints from dismo, the resulting GLM model has different significant variables. I found a package called My.stepwise that has a My.stepwise.glm function (here: My.stepwise.glm: Stepwise Variable Selection Procedure for Generalized Linear... in My.stepwise: Stepwise Variable Selection Procedures for Regression Analysis) , and this goes through a forward/backward selection process to find the best variables and returns a model ready for you.
My problem is that I don't want to run My.stepwise.glm just once and use the model it spits out for me. I'd like to run it roughly 100 times with different pseudo-absence data and see which variables it returns, then take the most frequent variables and move forward with building my model using those. The issue is that the My.stepwise.glm function ends by 'print(summary(initial.model))' and I would like to be able to access the output similar to how step() returns a list, where you can then say 'step$coefficients' and have the function coefficients return as numerics. Can anyone help me with this?
I'm looking for algorithms to create bins of variables in order to reduce the noise.
I have found several libraries for that, one if the chi2 library:
https://www.rdocumentation.org/packages/discretization/versions/1.0-1/topics/chi2
The documentation has the following example:
data(iris)
#---cut-points
chi2(iris,0.5,0.05)$cutp
#--discretized dataset using Chi2 algorithm
chi2(iris,0.5,0.05)$Disc.data
This works for this data, but if I train a model after transforming this data in order to make predicction over new records I will have to use the same cuts that were used here. My question is, is there any method or library that stored the cuts of the bins in a way that can be easiy applied to new data similarly to a predict method? whitout any custom function
I have a data set that I would like to stratify sample, create statistical models on using the caret package and then generate predictions.
The problem I am finding is that in different iterations of the stratified data set I get significantly different results (this may be in part due to the relatively small data sample M=1000).
What I want to be able to do is:
Generate the stratified data sample
Create the machine learning model
Repeat 1000 times & take the average model output
I hope that by repeating the steps on the variations of the stratified data set, I am able to avoid the subtle changes in the predictions generated due to a smaller data sample.
For example, it may look something like this in r;
Original.Dataset = data.frame(A)
Stratified.Dataset = stratified(Original.Dataset, group = x)
Model = train(Stratified.Dataset.....other model inputs)
Repeat process with new stratified data set based on the original data and average out.
Thank you in advance for any help, or package suggestions that might be useful. Is it possible to stratify the sample in caret or simulate in caret?
First of all, welcome to SO.
It is hard to understand what you exactly are wondering, your question is very broad.
If you need input on statistics I would suggest you to ask more clearly defined questions in Cross Validated.
Q&A for people interested in statistics, machine learning, data analysis, data mining, and data visualization.
The problem I am finding is that in different iterations of the
stratified data set I get significantly different results (this may be
in part due to the relatively small data sample M=1000).
I assume you are referring to different iterations of your model. This depends on how large your different groups are. E.g. if you are trying to divide your data set consisting of 1000 samples in to groups of 10 samples, your model could very likely be unstable and hence give different results in each iteration. This could also be due to that your model depends on some randomness, and the smaller your data is (and the more groups) your will have larger variation. See here or here for more information on cross validation, stability and bootstrap aggregating.
Generate the stratified data sample
How to generate it: the dplyr package is excellent in grouping data depending on different variables. You might also want to use the split function found in the base package. See here for more information. You could also use the in-built methods found in the caret package, found here.
How to know how to split it: it very much depends on your question you would like to answer, most likely you would like to even out some variables, e.g. gender and age for creating a model for predicting disease. See here for more info.
In the case of having e.g. duplicated observations and you want to create unique subsets with different combinations of replicates with it's unique measurements you would have to use other methods. If the replicates have a common identifier, here sample_names. You could do something like this to select all samples but with different combinations of the replicates:
tg <- data.frame(sample_names = rep(1:5,each=2))
set.seed(10)
tg$values<-rnorm(10)
partition <- lapply(1:100, function(z) {
set.seed(z)
sapply(unique(tg$sample_names), function(x) {
which(x == tg$sample_names)[sample(1:2, 1)]
})
})
#the first partition of your data to train a model.
tg[partition[[1]],]
Create the machine learning model
If you want to use caret, you could go to the caret webpage. And see all the available models. Depending on your research question and/or data you would like to use different types of models. Therefore, I would recommend you to take some online machine learning courses, for instance the Stanford University course given by Andrew Ng (I have taken it myself), to get more familiar with the different major algorithms.If you are familiar with the algorithms, just search for the available models.
Repeat 1000 times & take the average model output
You can either repeat your model 1000 times with different seeds (see set.seed) and different training methods e.g. cross validations or bootstrap aggregation. There are a lot of different training parameters in the caret package:
The function trainControl generates parameters that further control
how models are created, with possible values:
method: The resampling method: "boot", "cv", "LOOCV", "LGOCV",
"repeatedcv", "timeslice", "none" and "oob"
For more information on the methods, see here.
I use the R-package adabag to fit boosted trees to a (large) data set (140 observations with 3 845 predictors).
I executed this method twice with same parameter and same data set and each time different values of the accuracy returned (I defined a simple function which gives accuracy given a data set).
Did I make a mistake or is usual that in each fitting different values of the accuracy return? Is this problem based on the fact that the data set is large?
function which returns accuracy given the predicted values and true test set values.
err<-function(pred_d, test_d)
{
abs.acc<-sum(pred_d==test_d)
rel.acc<-abs.acc/length(test_d)
v<-c(abs.acc,rel.acc)
return(v)
}
new Edit (9.1.2017):
important following question of the above context.
As far as I can see I do not use any "pseudo randomness objects" (such as generating random numbers etc.) in my code, because I essentially fit trees (using r-package rpart) and boosted trees (using r-package adabag) to a large data set. Can you explain me where "pseudo randomness" enters, when I execute my code?
Edit 1: Similar phenomenon happens also with tree (using the R-package rpart).
Edit 2: Similar phenomenon did not happen with trees (using rpart) on the data set iris.
There's no reason you should expect to get the same results if you didn't set your seed (with set.seed()).
It doesn't matter what seed you set if you're doing statistics rather than information security. You might run your model with several different seeds to check its sensitivity. You just have to set it before anything involving pseudo randomness. Most people set it at the beginning of their code.
This is ubiquitous in statistics; it affects all probabilistic models and processes across all languages.
Note that in the case of information security it's important to have a (pseudo) random seed which cannot be easily guessed by brute force attacks, because (in a nutshell) knowing a seed value used internally by a security program paves the way for it to be hacked. In science and statistics it's the opposite - you and anyone you share your code/research with should be aware of the seed to ensure reproducibility.
https://en.wikipedia.org/wiki/Random_seed
http://www.grasshopper3d.com/forum/topics/what-are-random-seed-values
I'm trying to use knn in R (used several packages(knnflex, class)) to predict the probability of default based on 8 variables. The dataset is about 100k lines of 8 columns, but my machine seems to be having difficulty with a sample of 10k lines. Any suggestions for doing knn on a dataset > 50 lines (ie iris)?
EDIT:
To clarify there are a couple issues.
1) The examples in the class and knnflex packages are a bit unclear and I was curious if there was some implementation similar to the randomForest package where you give it the variable you want to predict and the data you want to use to train the model:
RF <- randomForest(x, y, ntree, type,...)
then turn around and use the model to predict data using the test data set:
pred <- predict(RF, testData)
2) I'm not really understanding why knn wants training AND test data for building the model. From what I can tell, the package creates a matrix ~ to nrows(trainingData)^2 which also seems to be an upper limit on the size of the predicted data. I created a model using 5000 rows (above that # I got memory allocation errors) and was unable to predict test sets > 5000 rows. Thus I would need either:
a) find a way to use > 5000 lines in a training set
or
b) find a way to use the model on the full 100k lines.
The reason that knn (in class) asks for both the training and test data is that if it didn't, the "model" it would return would simply be the training data itself.
The training data is the model.
To make predictions, knn calculates the distance between a test observation and each training observation (although I suppose there are some fancy versions for insanely large data sets that don't check every distance). So until you have test observations, there isn't really a model to build.
The ipred package provides functions that appear structured as you describe, but if you look at them, you'll see that there is basically nothing happening in the "training" function. All the work is in the "predict" function. And those are really intended as wrappers to be used for error estimation using cross validation.
As far as limitations on the number of cases, that will be dependent on how much physical memory you have. If you're getting memory allocation errors, then you either need to reduce your RAM usage elsewhere (close apps, etc), buy more RAM, buy a new computer, etc.
The knn function in class runs fine for me with training and test data sets of 10k rows or more, although I have 8gb of RAM. Also, I suspect that knn in class will be faster than in knnflex, but I haven't done extensive testing.