Area under ROC in R - r

Is there a way of calculating or estimating the area under the curve as an external metric, using base R, from confusion matrices alone?
If not, how would I do it, given the clustering object?
e.g. we can start from
cutree(hclust(dist(iris[,1:4])),method="average"),3))
or, from a diagonal-maximized version of
table(iris$Species, cutree(hclust(dist(iris[,1:4])),method="average"),3))
the latter being the confusion matrix. I would much, much prefer a solution that goes from the confusion matrix but if it's impossible we can use the clustering object itself.
I read the comments here: Calculate AUC in R? -- the top solution looks good, but it's unclear to me how to generalise it for multi-class data like iris.
(No packages, obviously, I want to find out how to do it by hand in base R)

Related

How can one calculate ROC's AUCs in complex designs with clustering in R?

The packages that calculate AUCs I've found so far do not contemplate sample clustering, which increases standard errors compared to simple random sampling. I wonder if the ones provided by these packages could be recalculated to allow for clustering.
Thank you.
Your best bet is probably replicate weights, as long as you can get point estimates of AUC that incorporate weights.
If you convert your design into a replicate-weights design object (using survey::as.svrepdesign()), you can then run any R function or expression using the replicate weights using survey::withReplicates() and return a standard error.

PCA in R - Do we need to reassign the elements of "prcomp" by multiplying negative sign?

A trainer did this in a video. He just gave a quick explanation that he does this because of R's default nature. However, I have never seen this application before. Is it correct, and why he does this?
pca <- prcomp(data, scale=TRUE)
pca$rotation <- -pca$rotation
pca$x <- -pca$x
It's a bit odd, there's nothing to do with R's nature etc. It might have something to do with the data, and what the pca is used for. Since the data is scaled, by adding taking the negative, you are just flipping the principal component scores:
data = iris[,1:4]
pca <- prcomp(data, scale=TRUE)
par(mfrow=c(1,2))
plot(pca$x[,1:2],col=factor(iris$Species),main="original")
plot(-pca$x[,1:2],col=factor(iris$Species),main="negative")
If you use it for regression etc.. it doesn't really quite matter.
The PCA is an unsupervised-learning algorithm and it doesn't identify the information about the group.
You should use the rotation matrix to try to undestand better the results.
This why?
Because you assumed that underlying factors are orthonormal. The solution is a set of factor with unique variances. The maximum likelihood solution is an orthogonal transformation of the factor. It makes sense to rotate the matrix to maximises ease of interpretation.
If you want to deep the argument about the factor analysis, I suggest you to visit this.
The link that I wrote above is the official page of the FactoMiner package.
Here, instead, you can find the videos (that the autor of the package did) about explanation of the FactoMiner package and all the functions about the factor analysis and so on.

Which methods can I use to calculate correlation among words in quanteda?

My question is a continuation of this.
After cleaning my text data and visualizing it using a wordcloud, I want to see which words are correlated to each other. Here comes the problem:
quantedahas the function textstat_simil, but it says
similarity. So, are "similarity" and "correlation" in this case the same thing? (Is distance also related?).
Moreover, my dfm looks like a binary matrix. Is in this case phi
correlation (from chi'squared statistics) more indicated? Can I
calculate this via quanteda?
Do you guys have any other content rather than the source code of
github that explain in more detail the methods to calculate
similarity or distance measures? (I couldn't understand from
this
code, sorry).
Thanks for you patient!
To compute Pearson’s product-moment correlations among features, you would use:
textstat_simil(x, method = “correlation”, margin = “features”)
The documentation makes this pretty clear, and the correlation method is the default.
Pearson’s correlation would not be the most appropriate for binary data, and we currently do not implement Spearman’s or other correlation methods more appropriate for categorical or ordinal data. However you can always coerce the dfm to an ordinary matrix (use as.matrix()) and then use the stats::cor() methods, which include Spearman’s.
As for the last question, we use the standard implementation of these measures. If you want more clarity on what they mean, I suggest asking on Cross-Validated.

K-Means Distance Measure - Large Data and mixed Scales

I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
Cheers,
Martin
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at: http://varsellcm.r-forge.r-project.org/

t-SNE predictions in R

Goal: I aim to use t-SNE (t-distributed Stochastic Neighbor Embedding) in R for dimensionality reduction of my training data (with N observations and K variables, where K>>N) and subsequently aim to come up with the t-SNE representation for my test data.
Example: Suppose I aim to reduce the K variables to D=2 dimensions (often, D=2 or D=3 for t-SNE). There are two R packages: Rtsne and tsne, while I use the former here.
# load packages
library(Rtsne)
# Generate Training Data: random standard normal matrix with J=400 variables and N=100 observations
x.train <- matrix(nrom(n=40000, mean=0, sd=1), nrow=100, ncol=400)
# Generate Test Data: random standard normal vector with N=1 observation for J=400 variables
x.test <- rnorm(n=400, mean=0, sd=1)
# perform t-SNE
set.seed(1)
fit.tsne <- Rtsne(X=x.train, dims=2)
where the command fit.tsne$Y will return the (100x2)-dimensional object containing the t-SNE representation of the data; can also be plotted via plot(fit.tsne$Y).
Problem: Now, what I am looking for is a function that returns a prediction pred of dimension (1x2) for my test data based on the trained t-SNE model. Something like,
# The function I am looking for (but doesn't exist yet):
pred <- predict(object=fit.tsne, newdata=x.test)
(How) Is this possible? Can you help me out with this?
From the author himself (https://lvdmaaten.github.io/tsne/):
Once I have a t-SNE map, how can I embed incoming test points in that
map?
t-SNE learns a non-parametric mapping, which means that it does not
learn an explicit function that maps data from the input space to the
map. Therefore, it is not possible to embed test points in an existing
map (although you could re-run t-SNE on the full dataset). A potential
approach to deal with this would be to train a multivariate regressor
to predict the map location from the input data. Alternatively, you
could also make such a regressor minimize the t-SNE loss directly,
which is what I did in this paper (https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf).
So you can't directly apply new data points. However, you can fit a multivariate regression model between your data and the embedded dimensions. The author recognizes that it's a limitation of the method and suggests this way to get around it.
t-SNE does not really work this way:
The following is an expert from the t-SNE author's website (https://lvdmaaten.github.io/tsne/):
Once I have a t-SNE map, how can I embed incoming test points in that
map?
t-SNE learns a non-parametric mapping, which means that it does not
learn an explicit function that maps data from the input space to the
map. Therefore, it is not possible to embed test points in an existing
map (although you could re-run t-SNE on the full dataset). A potential
approach to deal with this would be to train a multivariate regressor
to predict the map location from the input data. Alternatively, you
could also make such a regressor minimize the t-SNE loss directly,
which is what I did in this paper.
You may be interested in his paper: https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf
This website in addition to being really cool offers a wealth of info about t-SNE: http://distill.pub/2016/misread-tsne/
On Kaggle I have also seen people do things like this which may also be of intrest:
https://www.kaggle.com/cherzy/d/dalpozz/creditcardfraud/visualization-on-a-2d-map-with-t-sne
This the mail answer from the author (Jesse Krijthe) of the Rtsne package:
Thank you for the very specific question. I had an earlier request for
this and it is noted as an open issue on GitHub
(https://github.com/jkrijthe/Rtsne/issues/6). The main reason I am
hesitant to implement something like this is that, in a sense, there
is no 'natural' way explain what a prediction means in terms of tsne.
To me, tsne is a way to visualize a distance matrix. As such, a new
sample would lead to a new distance matrix and hence a new
visualization. So, my current thinking is that the only sensible way
would be to rerun the tsne procedure on the train and test set
combined.
Having said that, other people do think it makes sense to define
predictions, for instance by keeping the train objects fixed in the
map and finding good locations for the test objects (as was suggested
in the issue). An approach I would personally prefer over this would
be something like parametric tsne, which Laurens van der Maaten (the
author of the tsne paper) explored a paper. However, this would best
be implemented using something else than my package, because the
parametric model is likely most effective if it is selected by the
user.
So my suggestion would be to 1) refit the mapping using all data or 2)
see if you can find an implementation of parametric tsne, the only one
I know of would be Laurens's Matlab implementation.
Sorry I can not be of more help. If you come up with any other/better
solutions, please let me know.
t-SNE fundamentally does not do what you want. t-SNE is designed only for visualizing a dataset in a low (2 or 3) dimension space. You give it all the data you want to visualize all at once. It is not a general purpose dimensionality reduction tool.
If you are trying to apply t-SNE to "new" data, you are probably not thinking about your problem correctly, or perhaps simply did not understand the purpose of t-SNE.

Resources