I have an R function that takes some input data that contains missing values, uses Random Forest imputation to impute those values (through the rfImpute function from RandomForest package) and then goes through a RF importance calculation to identify the relative importance of variables (through ranger from the ranger package). The function has the seed 2018.
When I run the function using R with set.seed(2018), I get a set of results. When running the exact same function, the exact same input data and using the exact same seed in PL/R (using Navicat) the results are different.
I am having a really hard time understanding what could be causing this issue as everything is the exact same between the two (except one is R and the other is PL/R). For some input datasets, the results are equivalent but for others they are not. What could the problem be?
Note: I am not able to provide a simple example since my data is confidential.
Related
I asked this question on RCommunity but haven't had anyone bite... so I'm here!
My current project involves me predicting whether some trees will survive given future climate change scenarios. Against better judgement (like using Maxent) I've decided to pursue this with a GLM, which requires presence and absence data. Everytime I generate my absence data (as I was only given presence data) using randomPoints from dismo, the resulting GLM model has different significant variables. I found a package called My.stepwise that has a My.stepwise.glm function (here: My.stepwise.glm: Stepwise Variable Selection Procedure for Generalized Linear... in My.stepwise: Stepwise Variable Selection Procedures for Regression Analysis) , and this goes through a forward/backward selection process to find the best variables and returns a model ready for you.
My problem is that I don't want to run My.stepwise.glm just once and use the model it spits out for me. I'd like to run it roughly 100 times with different pseudo-absence data and see which variables it returns, then take the most frequent variables and move forward with building my model using those. The issue is that the My.stepwise.glm function ends by 'print(summary(initial.model))' and I would like to be able to access the output similar to how step() returns a list, where you can then say 'step$coefficients' and have the function coefficients return as numerics. Can anyone help me with this?
I am researching how to use multiple imputation results. The following is my understanding, and please let me know if there're mistakes.
Suppose you have a data set with missing values, and you want to conduct a regression analysis. You may perform multiple imputation for m = 5 times, and for each imputed data set (5 imputed data sets now) you run a regression analysis, then "pool" the coefficient estimates from these m = 5 models via Rubin's rules (or use R package "pool").
My question is that, in mice you have a function complete(), and the manual says you can extract completed data set by using complete(object).
But if I use mice for m = 5 times, does it still make sense to use complete()? Which imputation results will complete() get for me?
Also, does it make sense if I only use mice with m = 1? Thank you.
You probably overlooked that mice::complete() in arguments uses action=1 as default, which "returns the first imputed data set" (see ?mice::complete) and actually is worthless.
You should definitely use action="long" to take account for the "multiplicity" of the multiple imputation!
No, it makes no sense at all to use m=1 (apart from debugging), because every imputation is based on a random process and you have to pool the results (using any method whatsoever) to account for the variation. Often m>20 is recommended1.
Basically, multiple imputation works as follows:
Create m imputation processes with a random component, to obtain
m slightly different imputed data sets.
Analyze each imputed data set to get slightly different parameter
estimates.
Combine results, calculating the variation in parameter estimates.
(Also see multiple-imputation-in-a-nutshell for a brief overview.)
When you use mice, you get an object that is not the imputed data set. You cannot perform operations on it directly without using the special functions in mice. If you want to extract that actual imputed datasets, you use complete, the output of which is a data.frame with one row per individual per imputation (if using the "long" format). If you are doing any analysis with your imputed data that cannot be performed within mice, you need to create this dataset first.
I hope I have come to the right forum. I'm an ecologist making species distribution models using the maxent (version 3.3.3, http://www.cs.princeton.edu/~schapire/maxent/) function in R, through the dismo package. I have used the argument "replicates = 5" which tells maxent to do a 5-fold cross-validation. When running maxent from the maxent.jar file directly (the maxent software), an html file with statistics will be made, including the prediction maps. In R, an html file is also made, but the prediction maps have to be extracted afterwards, using the function "predict" in the dismo package in r. When I do this, I get 5 maps, due to the 5-fold cross-validation setting. However, (and this is the problem) I want only one output map, one "summary" prediction map. I assume this is possible, although I don't know how maxent computes it. The maxent tutorial (see link above) says that:
"...you may want to avoid eating up disk space by turning off the “write output grids” option, which will suppress writing of output grids for the replicate runs, so that you only get the summary statistics grids (avg, stderr etc.)."
A list of arguments that can be put into R is found in this forum https://groups.google.com/forum/#!topic/maxent/yRBlvZ1_9rQ.
I have tried to use the argument "outputgrids=FALSE" both in the maxent function itself, and in the predict function, but it doesn't work. I still get 5 maps, even though I don't get any errors in R.
So my question is: How do I get one "summary" prediction map instead of the five prediction maps that results from the cross-validation?
I hope someone can help me with this, I am really stuck and haven't found any answers anywhere on the internet. Not even a discussion about this. Hope my question is clear. This is the R-script that I use:
model1<-maxent(x=predvars, p=presence_points, a=target_group_absence, path="//home//...//model1", args=c("replicates=5", "outputgrids=FALSE"))
model1map<-predict(model1, predvars, filename="//home//...//model1map.tif", outputgrids=FALSE)
Best regards,
Kristin
Sorry to be the bearer of bad news, but based on the source code, it looks like Dismo's predict function does not have the ability to generate a summary map.
Nitty-gritty details for those who care: When you call maxent with replicates set to something greater than 1, the maxent function returns a MaxEntReplicates object, rather than a normal MaxEnt object. When predict receives a MaxEntReplicates object, it just iterates through all of the models that it contains and calls predict on them individually.
So, what next? Fortunately, all is not lost! The reason that Dismo doesn't have this functionality is that for most kinds of model-building, there isn't actually a valid way to average parameters across your cross-validation models. I don't want to go so far as to say that that's definitely the case for MaxEnt specifically, but I suspect it is. As such, cross-validation is usually used more as a way of checking that your model building methodology works for your data than as a way of building your model directly (see this question for further discussion of that point). After verifying via cross-validation that models built using a given procedure seem to be accurate for the phenomenon you're modelling, it's customary to build a final model using all of your data. In theory this new model should only be better than models trained on a subset of your data.
So basically, assuming your cross-validated models look reasonable, you can run MaxEnt again with only one replicate. Your final result will be a model accuracy estimate based on the cross-validation and a map based on the second run with all of your data lumped together. Depending on what exactly your question is, there might be other useful summary statistics from the cross-validation that you want to use, but those are all things you've already seen in the html output.
I may have found this a couple of years later. But you could do something like this:
xm <- maxent(predictors, pres_train) # basically the maxent model
px <- predict(predictors, xm, ext=ext, progress= '' ) #prediction
px2 <- predict(predictors, xm2, ext=ext, progress= '' ) #prediction #02
models <- stack(px,px2) # create a stack of prediction from all the models
final_map <- mean(px,px2) # Take a mean of all the prediction
plot(final_map) #plot the averaged map
xm1,xm2,.. would be the maxent models for each partitions in cross-validation, and px, px2,.. would be the predicted maps.
Stata includes a a command (wntestq) that it calls the "portmanteau Q test for white noise." There seem to a variety of related tests in different packages in R. That said, most of these seem designed specifically for data in various time series formats and none that I could find that operate on a single variable.
"Portmanteau" refers to a family of statistical tests. In time series analysis, portmanteau tests are used for testing for autocorrelation of residuals in a model. The most commonly used test is the Ljung-Box test. Although it's buried in a citation in the manual, it seems that is the test that the Stata command wntestq has implemented.
R implements the same test in a function called Box.test() which is in the stats package that comes included with R. As you can see in the documentation for that function, Box.test() actually implements two tests: the Ljung-Box text that Stata uses and the Box-Pierce test. According to some sources, Box-Pierce was found to include a seemingly trivial simplification which can lead to nasty effects.[1][2] For that reasons, and because the defaults are different in R and Stata, it is worth noting that the Box-Pierce version is default in R.
The test will consider a certain number of autocorrelation coefficients (i.e., up to lag h) and there is no obvious default to select (see this question on the statistics StackExchange for a much more detailed discussion). Another important difference that will lead to different results is that the default h or number of lags will be different in Stata and R. By default, R will set h to 1* while Stata will set h to [n/2]-2 or 40, whichever is smaller.
Although there are many reasons you might not want the default, the following R function will reproduce the default behavior of the Stata command:
q.test <- function (x) {
Box.test(x, type="Ljung-Box", lag=min(length(x)/2-2, 40))
}
I'm using randomForest in order to find out the most significant variables. I was expecting some output that defines the accuracy of the model and also ranks the variables based on their importance. But I am a bit confused now. I tried randomForest and then ran importance() to extract the importance of variables.
But then I saw another command rfcv (Random Forest Cross-Valdidation for feature selection), which should be the most appropriate for this purpose I suppose, but the question I have regarding this is: how to get the list of the most important variables? How to see the output after running it? Which command to use?
Another thing: What is the difference between randomForest and predict.randomForest?
I am not very familiar with randomforest and R therefore any help would be appreciated.
Thank you in advance!
After you have made a randomForest model you use predict.randomForest to use the model you created on new data e.g. build a random forest with training data then run your validation data through that model with predict.randomForest.
As for the rfcv there is an option recursive which (from the help):
whether variable importance is (re-)assessed at each step of variable
reduction
Its all in the help file