R - How to get one "summary" prediction map instead for 5 when using 5-fold cross-validation in maxent model? - r

I hope I have come to the right forum. I'm an ecologist making species distribution models using the maxent (version 3.3.3, http://www.cs.princeton.edu/~schapire/maxent/) function in R, through the dismo package. I have used the argument "replicates = 5" which tells maxent to do a 5-fold cross-validation. When running maxent from the maxent.jar file directly (the maxent software), an html file with statistics will be made, including the prediction maps. In R, an html file is also made, but the prediction maps have to be extracted afterwards, using the function "predict" in the dismo package in r. When I do this, I get 5 maps, due to the 5-fold cross-validation setting. However, (and this is the problem) I want only one output map, one "summary" prediction map. I assume this is possible, although I don't know how maxent computes it. The maxent tutorial (see link above) says that:
"...you may want to avoid eating up disk space by turning off the “write output grids” option, which will suppress writing of output grids for the replicate runs, so that you only get the summary statistics grids (avg, stderr etc.)."
A list of arguments that can be put into R is found in this forum https://groups.google.com/forum/#!topic/maxent/yRBlvZ1_9rQ.
I have tried to use the argument "outputgrids=FALSE" both in the maxent function itself, and in the predict function, but it doesn't work. I still get 5 maps, even though I don't get any errors in R.
So my question is: How do I get one "summary" prediction map instead of the five prediction maps that results from the cross-validation?
I hope someone can help me with this, I am really stuck and haven't found any answers anywhere on the internet. Not even a discussion about this. Hope my question is clear. This is the R-script that I use:
model1<-maxent(x=predvars, p=presence_points, a=target_group_absence, path="//home//...//model1", args=c("replicates=5", "outputgrids=FALSE"))
model1map<-predict(model1, predvars, filename="//home//...//model1map.tif", outputgrids=FALSE)
Best regards,
Kristin

Sorry to be the bearer of bad news, but based on the source code, it looks like Dismo's predict function does not have the ability to generate a summary map.
Nitty-gritty details for those who care: When you call maxent with replicates set to something greater than 1, the maxent function returns a MaxEntReplicates object, rather than a normal MaxEnt object. When predict receives a MaxEntReplicates object, it just iterates through all of the models that it contains and calls predict on them individually.
So, what next? Fortunately, all is not lost! The reason that Dismo doesn't have this functionality is that for most kinds of model-building, there isn't actually a valid way to average parameters across your cross-validation models. I don't want to go so far as to say that that's definitely the case for MaxEnt specifically, but I suspect it is. As such, cross-validation is usually used more as a way of checking that your model building methodology works for your data than as a way of building your model directly (see this question for further discussion of that point). After verifying via cross-validation that models built using a given procedure seem to be accurate for the phenomenon you're modelling, it's customary to build a final model using all of your data. In theory this new model should only be better than models trained on a subset of your data.
So basically, assuming your cross-validated models look reasonable, you can run MaxEnt again with only one replicate. Your final result will be a model accuracy estimate based on the cross-validation and a map based on the second run with all of your data lumped together. Depending on what exactly your question is, there might be other useful summary statistics from the cross-validation that you want to use, but those are all things you've already seen in the html output.

I may have found this a couple of years later. But you could do something like this:
xm <- maxent(predictors, pres_train) # basically the maxent model
px <- predict(predictors, xm, ext=ext, progress= '' ) #prediction
px2 <- predict(predictors, xm2, ext=ext, progress= '' ) #prediction #02
models <- stack(px,px2) # create a stack of prediction from all the models
final_map <- mean(px,px2) # Take a mean of all the prediction
plot(final_map) #plot the averaged map
xm1,xm2,.. would be the maxent models for each partitions in cross-validation, and px, px2,.. would be the predicted maps.

Related

How can I get My.stepwise.glm to return the model outside the console?

I asked this question on RCommunity but haven't had anyone bite... so I'm here!
My current project involves me predicting whether some trees will survive given future climate change scenarios. Against better judgement (like using Maxent) I've decided to pursue this with a GLM, which requires presence and absence data. Everytime I generate my absence data (as I was only given presence data) using randomPoints from dismo, the resulting GLM model has different significant variables. I found a package called My.stepwise that has a My.stepwise.glm function (here: My.stepwise.glm: Stepwise Variable Selection Procedure for Generalized Linear... in My.stepwise: Stepwise Variable Selection Procedures for Regression Analysis) , and this goes through a forward/backward selection process to find the best variables and returns a model ready for you.
My problem is that I don't want to run My.stepwise.glm just once and use the model it spits out for me. I'd like to run it roughly 100 times with different pseudo-absence data and see which variables it returns, then take the most frequent variables and move forward with building my model using those. The issue is that the My.stepwise.glm function ends by 'print(summary(initial.model))' and I would like to be able to access the output similar to how step() returns a list, where you can then say 'step$coefficients' and have the function coefficients return as numerics. Can anyone help me with this?

Quantile Regression with Time-Series Models (ARIMA-ARCH) in R

I am working on quantile forecasting with time-series data. The model I am using is ARIMA(1,1,2)-ARCH(2) and I am trying to get quantile regression estimates of my data.
So far, I have found "quantreg" package to perform quantile regression, but I have no idea how to put ARIMA-ARCH models as the model formula in function rq.
rq function seems to work for regressions with dependent and independent variables but not for time-series.
Is there some other package that I can put time-series models and do quantile regression in R? Any advice is welcome. Thanks.
I just put an answer on the Data Science forum.
It basically says that most of the ready made packages are using so called exact test based on assumption on the distribution (independent identical normal-Gauss distribution, or wider).
You also have a family of resampling methods in which you simulate a sample with a similar distribution of your observed sample, perform your ARIMA(1,1,2)-ARCH(2) and repeat the process a great number of times. Then you analyze this great number of forecast and measure (as opposed to compute) your confidence intervals.
The resampling methods differs in the way to generate the simulated samples. The most used are:
The Jackknife: in which you "forget" one point, that is you simulate a n samples of size n-1 (if n is the size of the observed sample).
The Bootstrap: in which you simulate a sample by taking n values of the original sample with replacements: some will be taken once, some twice or more, some never,...
It is a (not easy) theorem that the expectation of the confidence intervals, as most of the usual statistical estimators, are the same on the simulated sample than on the original sample. With the difference that you can measure them with a great number of simulations.
Hello and welcome to StackOverflow. Please take some time to read the help page, especially the sections named "What topics can I ask about here?" and "What types of questions should I avoid asking?". And more importantly, please read the Stack Overflow question checklist. You might also want to learn about Minimal, Complete, and Verifiable Examples.
I can try to address your question, although this is hard since you don't provide any code/data. Also, I guess by "put ARIMA-ARCH models" you actually mean that you want to make an integrated series stationary using an ARIMA(1,1,2) plus an ARCH(2) filters.
For an overview of the R time-series capabilities you can refer to the CRAN task list.
You can easily apply these filters in R with an appropriate function.
For instance, you could use the Arima() function from the forecast package, then compute the residuals with residuals() from the stats package. Next, you can use this filtered series as input for the garch() function from the tseries package. Other possibilities are of course possible. Finally, you can apply quantile regression on this filtered series. For instance, you can check out the dynrq() function from the quantreg package, which allows time-series objects in the data argument.

t-SNE predictions in R

Goal: I aim to use t-SNE (t-distributed Stochastic Neighbor Embedding) in R for dimensionality reduction of my training data (with N observations and K variables, where K>>N) and subsequently aim to come up with the t-SNE representation for my test data.
Example: Suppose I aim to reduce the K variables to D=2 dimensions (often, D=2 or D=3 for t-SNE). There are two R packages: Rtsne and tsne, while I use the former here.
# load packages
library(Rtsne)
# Generate Training Data: random standard normal matrix with J=400 variables and N=100 observations
x.train <- matrix(nrom(n=40000, mean=0, sd=1), nrow=100, ncol=400)
# Generate Test Data: random standard normal vector with N=1 observation for J=400 variables
x.test <- rnorm(n=400, mean=0, sd=1)
# perform t-SNE
set.seed(1)
fit.tsne <- Rtsne(X=x.train, dims=2)
where the command fit.tsne$Y will return the (100x2)-dimensional object containing the t-SNE representation of the data; can also be plotted via plot(fit.tsne$Y).
Problem: Now, what I am looking for is a function that returns a prediction pred of dimension (1x2) for my test data based on the trained t-SNE model. Something like,
# The function I am looking for (but doesn't exist yet):
pred <- predict(object=fit.tsne, newdata=x.test)
(How) Is this possible? Can you help me out with this?
From the author himself (https://lvdmaaten.github.io/tsne/):
Once I have a t-SNE map, how can I embed incoming test points in that
map?
t-SNE learns a non-parametric mapping, which means that it does not
learn an explicit function that maps data from the input space to the
map. Therefore, it is not possible to embed test points in an existing
map (although you could re-run t-SNE on the full dataset). A potential
approach to deal with this would be to train a multivariate regressor
to predict the map location from the input data. Alternatively, you
could also make such a regressor minimize the t-SNE loss directly,
which is what I did in this paper (https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf).
So you can't directly apply new data points. However, you can fit a multivariate regression model between your data and the embedded dimensions. The author recognizes that it's a limitation of the method and suggests this way to get around it.
t-SNE does not really work this way:
The following is an expert from the t-SNE author's website (https://lvdmaaten.github.io/tsne/):
Once I have a t-SNE map, how can I embed incoming test points in that
map?
t-SNE learns a non-parametric mapping, which means that it does not
learn an explicit function that maps data from the input space to the
map. Therefore, it is not possible to embed test points in an existing
map (although you could re-run t-SNE on the full dataset). A potential
approach to deal with this would be to train a multivariate regressor
to predict the map location from the input data. Alternatively, you
could also make such a regressor minimize the t-SNE loss directly,
which is what I did in this paper.
You may be interested in his paper: https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf
This website in addition to being really cool offers a wealth of info about t-SNE: http://distill.pub/2016/misread-tsne/
On Kaggle I have also seen people do things like this which may also be of intrest:
https://www.kaggle.com/cherzy/d/dalpozz/creditcardfraud/visualization-on-a-2d-map-with-t-sne
This the mail answer from the author (Jesse Krijthe) of the Rtsne package:
Thank you for the very specific question. I had an earlier request for
this and it is noted as an open issue on GitHub
(https://github.com/jkrijthe/Rtsne/issues/6). The main reason I am
hesitant to implement something like this is that, in a sense, there
is no 'natural' way explain what a prediction means in terms of tsne.
To me, tsne is a way to visualize a distance matrix. As such, a new
sample would lead to a new distance matrix and hence a new
visualization. So, my current thinking is that the only sensible way
would be to rerun the tsne procedure on the train and test set
combined.
Having said that, other people do think it makes sense to define
predictions, for instance by keeping the train objects fixed in the
map and finding good locations for the test objects (as was suggested
in the issue). An approach I would personally prefer over this would
be something like parametric tsne, which Laurens van der Maaten (the
author of the tsne paper) explored a paper. However, this would best
be implemented using something else than my package, because the
parametric model is likely most effective if it is selected by the
user.
So my suggestion would be to 1) refit the mapping using all data or 2)
see if you can find an implementation of parametric tsne, the only one
I know of would be Laurens's Matlab implementation.
Sorry I can not be of more help. If you come up with any other/better
solutions, please let me know.
t-SNE fundamentally does not do what you want. t-SNE is designed only for visualizing a dataset in a low (2 or 3) dimension space. You give it all the data you want to visualize all at once. It is not a general purpose dimensionality reduction tool.
If you are trying to apply t-SNE to "new" data, you are probably not thinking about your problem correctly, or perhaps simply did not understand the purpose of t-SNE.

Using a 'gbm' model created in R package 'dismo' with functions in R package 'gbm'

This is a follow-up to a previous question I asked a while back that was recently answered.
I have built several gbm models with dismo::gbm.step, which relies on the gbm fitting functions found in R package gbm, as well as cross validation tools from R package splines.
As part of my analysis, I would like to use some of the graphical tools available in R (e. g. perspective plots) to visualize pairwise interactions in the data. Both the gbm and the dismo packages have functions for detecting and modelling interactions in the data.
The implementation in dismo is explained in Elith et. al (2008) and returns a statistic which indicates departures of the model predictions from a linear combination of the predictors, while holding all other predictors at their means.
The implementation in gbm uses Friedman`s H statistic (Friedman & Popescue, 2005), and returns a different metric, and also does NOT set the other variables at their means.
The interactions modelled and plotted with dismo::gbm.interactions are great and have been very informative. However, I would also like to use gbm::interact.gbm, partly for publication strength and also to compare the results from the two methods.
If I try to run gbm::interact.gbm in a gbm.object created with dismo, an error is returned…
"Error in is.factor(data[, x$var.names[j]]) :
argument "data" is missing, with no default"
I understand dismo::gmb.step adds extra data the authors thought would be useful to the gbm model.
I also understand that the answer to my question lies somewherein the source code.
My questions is...
Is it possible to modify a gbm object created in dismo to be used in gbm::gbm.interact? If so, would this be accomplished by...
a. Modifying the gbm object created in dismo::gbm.step?
b. Modifying the source code for gbm::interact.gbm?
c. Doing something else?
I will be going through the source code trying to solve this myself, if I come up with a solution before anyone answers I will answer my own question.
The gbm::interact.gbm function requires data as an argument interact.gbm <- function(x, data, i.var = 1, n.trees = x$n.trees).
The dismo gbm.object is essentially the same as the gbm gbm.object, but with extra information attached so I don't imagine changing the gbm.object would help.

random forest package in R

I use random forest package in R for regression, it gives me two kind of information: Mean of squared residuals and % Var explained. But I wanna calculate the RMSE and R^2 of the training and test sets, can anyone help me how can I find these kind of information?
Sorry this is not a specific answer, but I do not have enough cred to leave a comment.
It is tough to say how you may get at what you want without a reproducible example. However, if you used the xtest= and ytest= arguments in the call to randomForest (assuming you are using the "randomForest" package), then what you are looking for should be a part of the resulting randomForest object. What you want to look in is the test part of the resulting random forest list.
An attempted example:
rf.results <- randomForest( whatever arguments )
rf.results$test$mse # mse (maybe you can take the square root to get rmse)
rf.results$test$rsq # pseudo-R2 for random forest
If you have the random forest package loaded you can validate this information as well as do some exploration yourself with ?randomForest. The "Value" section of the documentation details the object that results from a call to randomForest and where you can find various performance metrics.

Resources