t-SNE predictions in R - r

Goal: I aim to use t-SNE (t-distributed Stochastic Neighbor Embedding) in R for dimensionality reduction of my training data (with N observations and K variables, where K>>N) and subsequently aim to come up with the t-SNE representation for my test data.
Example: Suppose I aim to reduce the K variables to D=2 dimensions (often, D=2 or D=3 for t-SNE). There are two R packages: Rtsne and tsne, while I use the former here.
# load packages
library(Rtsne)
# Generate Training Data: random standard normal matrix with J=400 variables and N=100 observations
x.train <- matrix(nrom(n=40000, mean=0, sd=1), nrow=100, ncol=400)
# Generate Test Data: random standard normal vector with N=1 observation for J=400 variables
x.test <- rnorm(n=400, mean=0, sd=1)
# perform t-SNE
set.seed(1)
fit.tsne <- Rtsne(X=x.train, dims=2)
where the command fit.tsne$Y will return the (100x2)-dimensional object containing the t-SNE representation of the data; can also be plotted via plot(fit.tsne$Y).
Problem: Now, what I am looking for is a function that returns a prediction pred of dimension (1x2) for my test data based on the trained t-SNE model. Something like,
# The function I am looking for (but doesn't exist yet):
pred <- predict(object=fit.tsne, newdata=x.test)
(How) Is this possible? Can you help me out with this?

From the author himself (https://lvdmaaten.github.io/tsne/):
Once I have a t-SNE map, how can I embed incoming test points in that
map?
t-SNE learns a non-parametric mapping, which means that it does not
learn an explicit function that maps data from the input space to the
map. Therefore, it is not possible to embed test points in an existing
map (although you could re-run t-SNE on the full dataset). A potential
approach to deal with this would be to train a multivariate regressor
to predict the map location from the input data. Alternatively, you
could also make such a regressor minimize the t-SNE loss directly,
which is what I did in this paper (https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf).
So you can't directly apply new data points. However, you can fit a multivariate regression model between your data and the embedded dimensions. The author recognizes that it's a limitation of the method and suggests this way to get around it.

t-SNE does not really work this way:
The following is an expert from the t-SNE author's website (https://lvdmaaten.github.io/tsne/):
Once I have a t-SNE map, how can I embed incoming test points in that
map?
t-SNE learns a non-parametric mapping, which means that it does not
learn an explicit function that maps data from the input space to the
map. Therefore, it is not possible to embed test points in an existing
map (although you could re-run t-SNE on the full dataset). A potential
approach to deal with this would be to train a multivariate regressor
to predict the map location from the input data. Alternatively, you
could also make such a regressor minimize the t-SNE loss directly,
which is what I did in this paper.
You may be interested in his paper: https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf
This website in addition to being really cool offers a wealth of info about t-SNE: http://distill.pub/2016/misread-tsne/
On Kaggle I have also seen people do things like this which may also be of intrest:
https://www.kaggle.com/cherzy/d/dalpozz/creditcardfraud/visualization-on-a-2d-map-with-t-sne

This the mail answer from the author (Jesse Krijthe) of the Rtsne package:
Thank you for the very specific question. I had an earlier request for
this and it is noted as an open issue on GitHub
(https://github.com/jkrijthe/Rtsne/issues/6). The main reason I am
hesitant to implement something like this is that, in a sense, there
is no 'natural' way explain what a prediction means in terms of tsne.
To me, tsne is a way to visualize a distance matrix. As such, a new
sample would lead to a new distance matrix and hence a new
visualization. So, my current thinking is that the only sensible way
would be to rerun the tsne procedure on the train and test set
combined.
Having said that, other people do think it makes sense to define
predictions, for instance by keeping the train objects fixed in the
map and finding good locations for the test objects (as was suggested
in the issue). An approach I would personally prefer over this would
be something like parametric tsne, which Laurens van der Maaten (the
author of the tsne paper) explored a paper. However, this would best
be implemented using something else than my package, because the
parametric model is likely most effective if it is selected by the
user.
So my suggestion would be to 1) refit the mapping using all data or 2)
see if you can find an implementation of parametric tsne, the only one
I know of would be Laurens's Matlab implementation.
Sorry I can not be of more help. If you come up with any other/better
solutions, please let me know.

t-SNE fundamentally does not do what you want. t-SNE is designed only for visualizing a dataset in a low (2 or 3) dimension space. You give it all the data you want to visualize all at once. It is not a general purpose dimensionality reduction tool.
If you are trying to apply t-SNE to "new" data, you are probably not thinking about your problem correctly, or perhaps simply did not understand the purpose of t-SNE.

Related

How to interpret semivariogram with range but no partial sill?

I have been using the variofit function in R's gstat package to fit semivariogram models to some spatial data I have, and I am confused by a couple of the models that have been generated. Basically for these few models, I will get a model that has a range for autocorrelation, but not a partial sill. I was told that even without a sill, though, the model should still have some sort of shape to reflect the range, but plotting this model results in the flat lines that are shown in the attached screenshot. I do not think it is a matter of bad initial values as I let variofit parse out the best initial values from a matrix of many values made by expand.grid. I wanted to know whether this is being plotted correctly contrary to what I've been told, and what exactly it means to have a range but no partial sill value. I know when I used an alternative model fitting function from geoR (fit.variogram), these models could be fit to a periodic or wave distribution, though poorly so/probably overfit — so would this be some indication of that, which variofit just cannot plot? I unfortunately can't share the data, but I included an example of the code I have used to make these models if it will help to answer my question:
geo.entPC <- as.geodata(cbind(jitteryPC, log.PC[,5], coords.col=1:2, data.col=5))
test.pc.grid2 <- expand.grid(seq(0,2,0.2),seq(0,100,10))
variog.function.col2 <-function (x) {
vario.cloud <- variog(x, estimator.type = "classical", option="bin")
variogram.mod <- variofit(vario.cloud , ini.cov.pars=test.pc.grid2, fix.nug=FALSE, weights="equal")
plot(vario.cloud)
lines(variogram.mod, col="red")
summary(x)
}
variog.function.col2(geo.entPC)
From the attached plot showing the empirical variogram, I would not expect to find any sensible spatial correlation. This is in accordance with the fitted variogram, which is essentially a pure nugget model. The spatial range might be a relic of the numerical optimization, or the partial spatial sill might (numerically) differ from 0 at a digit that is not shown in the summary of the fitted variogram. However, no matter what the range is for an irrelevant small partial sill, the spatial correlation is neglectable.
Depending on the data, it is sometimes beneficial to limit the maximum distance of pairs used to calculate the empirical variogram - but make sure to have "enough" pairs in each bin.

Area under ROC in R

Is there a way of calculating or estimating the area under the curve as an external metric, using base R, from confusion matrices alone?
If not, how would I do it, given the clustering object?
e.g. we can start from
cutree(hclust(dist(iris[,1:4])),method="average"),3))
or, from a diagonal-maximized version of
table(iris$Species, cutree(hclust(dist(iris[,1:4])),method="average"),3))
the latter being the confusion matrix. I would much, much prefer a solution that goes from the confusion matrix but if it's impossible we can use the clustering object itself.
I read the comments here: Calculate AUC in R? -- the top solution looks good, but it's unclear to me how to generalise it for multi-class data like iris.
(No packages, obviously, I want to find out how to do it by hand in base R)

Which methods can I use to calculate correlation among words in quanteda?

My question is a continuation of this.
After cleaning my text data and visualizing it using a wordcloud, I want to see which words are correlated to each other. Here comes the problem:
quantedahas the function textstat_simil, but it says
similarity. So, are "similarity" and "correlation" in this case the same thing? (Is distance also related?).
Moreover, my dfm looks like a binary matrix. Is in this case phi
correlation (from chi'squared statistics) more indicated? Can I
calculate this via quanteda?
Do you guys have any other content rather than the source code of
github that explain in more detail the methods to calculate
similarity or distance measures? (I couldn't understand from
this
code, sorry).
Thanks for you patient!
To compute Pearson’s product-moment correlations among features, you would use:
textstat_simil(x, method = “correlation”, margin = “features”)
The documentation makes this pretty clear, and the correlation method is the default.
Pearson’s correlation would not be the most appropriate for binary data, and we currently do not implement Spearman’s or other correlation methods more appropriate for categorical or ordinal data. However you can always coerce the dfm to an ordinary matrix (use as.matrix()) and then use the stats::cor() methods, which include Spearman’s.
As for the last question, we use the standard implementation of these measures. If you want more clarity on what they mean, I suggest asking on Cross-Validated.

Quantile Regression with Time-Series Models (ARIMA-ARCH) in R

I am working on quantile forecasting with time-series data. The model I am using is ARIMA(1,1,2)-ARCH(2) and I am trying to get quantile regression estimates of my data.
So far, I have found "quantreg" package to perform quantile regression, but I have no idea how to put ARIMA-ARCH models as the model formula in function rq.
rq function seems to work for regressions with dependent and independent variables but not for time-series.
Is there some other package that I can put time-series models and do quantile regression in R? Any advice is welcome. Thanks.
I just put an answer on the Data Science forum.
It basically says that most of the ready made packages are using so called exact test based on assumption on the distribution (independent identical normal-Gauss distribution, or wider).
You also have a family of resampling methods in which you simulate a sample with a similar distribution of your observed sample, perform your ARIMA(1,1,2)-ARCH(2) and repeat the process a great number of times. Then you analyze this great number of forecast and measure (as opposed to compute) your confidence intervals.
The resampling methods differs in the way to generate the simulated samples. The most used are:
The Jackknife: in which you "forget" one point, that is you simulate a n samples of size n-1 (if n is the size of the observed sample).
The Bootstrap: in which you simulate a sample by taking n values of the original sample with replacements: some will be taken once, some twice or more, some never,...
It is a (not easy) theorem that the expectation of the confidence intervals, as most of the usual statistical estimators, are the same on the simulated sample than on the original sample. With the difference that you can measure them with a great number of simulations.
Hello and welcome to StackOverflow. Please take some time to read the help page, especially the sections named "What topics can I ask about here?" and "What types of questions should I avoid asking?". And more importantly, please read the Stack Overflow question checklist. You might also want to learn about Minimal, Complete, and Verifiable Examples.
I can try to address your question, although this is hard since you don't provide any code/data. Also, I guess by "put ARIMA-ARCH models" you actually mean that you want to make an integrated series stationary using an ARIMA(1,1,2) plus an ARCH(2) filters.
For an overview of the R time-series capabilities you can refer to the CRAN task list.
You can easily apply these filters in R with an appropriate function.
For instance, you could use the Arima() function from the forecast package, then compute the residuals with residuals() from the stats package. Next, you can use this filtered series as input for the garch() function from the tseries package. Other possibilities are of course possible. Finally, you can apply quantile regression on this filtered series. For instance, you can check out the dynrq() function from the quantreg package, which allows time-series objects in the data argument.

R - How to get one "summary" prediction map instead for 5 when using 5-fold cross-validation in maxent model?

I hope I have come to the right forum. I'm an ecologist making species distribution models using the maxent (version 3.3.3, http://www.cs.princeton.edu/~schapire/maxent/) function in R, through the dismo package. I have used the argument "replicates = 5" which tells maxent to do a 5-fold cross-validation. When running maxent from the maxent.jar file directly (the maxent software), an html file with statistics will be made, including the prediction maps. In R, an html file is also made, but the prediction maps have to be extracted afterwards, using the function "predict" in the dismo package in r. When I do this, I get 5 maps, due to the 5-fold cross-validation setting. However, (and this is the problem) I want only one output map, one "summary" prediction map. I assume this is possible, although I don't know how maxent computes it. The maxent tutorial (see link above) says that:
"...you may want to avoid eating up disk space by turning off the “write output grids” option, which will suppress writing of output grids for the replicate runs, so that you only get the summary statistics grids (avg, stderr etc.)."
A list of arguments that can be put into R is found in this forum https://groups.google.com/forum/#!topic/maxent/yRBlvZ1_9rQ.
I have tried to use the argument "outputgrids=FALSE" both in the maxent function itself, and in the predict function, but it doesn't work. I still get 5 maps, even though I don't get any errors in R.
So my question is: How do I get one "summary" prediction map instead of the five prediction maps that results from the cross-validation?
I hope someone can help me with this, I am really stuck and haven't found any answers anywhere on the internet. Not even a discussion about this. Hope my question is clear. This is the R-script that I use:
model1<-maxent(x=predvars, p=presence_points, a=target_group_absence, path="//home//...//model1", args=c("replicates=5", "outputgrids=FALSE"))
model1map<-predict(model1, predvars, filename="//home//...//model1map.tif", outputgrids=FALSE)
Best regards,
Kristin
Sorry to be the bearer of bad news, but based on the source code, it looks like Dismo's predict function does not have the ability to generate a summary map.
Nitty-gritty details for those who care: When you call maxent with replicates set to something greater than 1, the maxent function returns a MaxEntReplicates object, rather than a normal MaxEnt object. When predict receives a MaxEntReplicates object, it just iterates through all of the models that it contains and calls predict on them individually.
So, what next? Fortunately, all is not lost! The reason that Dismo doesn't have this functionality is that for most kinds of model-building, there isn't actually a valid way to average parameters across your cross-validation models. I don't want to go so far as to say that that's definitely the case for MaxEnt specifically, but I suspect it is. As such, cross-validation is usually used more as a way of checking that your model building methodology works for your data than as a way of building your model directly (see this question for further discussion of that point). After verifying via cross-validation that models built using a given procedure seem to be accurate for the phenomenon you're modelling, it's customary to build a final model using all of your data. In theory this new model should only be better than models trained on a subset of your data.
So basically, assuming your cross-validated models look reasonable, you can run MaxEnt again with only one replicate. Your final result will be a model accuracy estimate based on the cross-validation and a map based on the second run with all of your data lumped together. Depending on what exactly your question is, there might be other useful summary statistics from the cross-validation that you want to use, but those are all things you've already seen in the html output.
I may have found this a couple of years later. But you could do something like this:
xm <- maxent(predictors, pres_train) # basically the maxent model
px <- predict(predictors, xm, ext=ext, progress= '' ) #prediction
px2 <- predict(predictors, xm2, ext=ext, progress= '' ) #prediction #02
models <- stack(px,px2) # create a stack of prediction from all the models
final_map <- mean(px,px2) # Take a mean of all the prediction
plot(final_map) #plot the averaged map
xm1,xm2,.. would be the maxent models for each partitions in cross-validation, and px, px2,.. would be the predicted maps.

Resources