Predict values in Multidimensional Scaling (MDS) in R - r

I’m trying to use multidimensional scaling (MDS) in R.
Can I predict new values on test set based on the values that I receive from my training set?
I’m looking for something similar to what I’ve done in PCA for example:
prin_comp <- prcomp(pca.train, scale. = FALSE)
test.data <- predict(prin_comp, newdata = pca.test)
Thank you,
Ittai

You can use MDS as the first of a three step process.
Generate the MDS coordinates
Apply a traditional clustering algorithm to the generated coordinates
E.g. Kmeans with kmeans(x, K) where you will need to supply the K=number of clusters
Note that you will probably want to do some metrics of the generated clusters by
cross validation to ensure they are providing good labels for your existing data.
Use the kmeans clusters to find the nearest centroid/cluster for each of
the new data
Then you have a decision to make (as the modeler): do you apply the mode of the chosen cluster as the label for your new data? That is the simplest solution - but there can be other approaches.

In addition to what you wrote, can't I use the predict function based on the coefficient from the training model, and use the test data to predict new MDS values?

Related

How to do a leave-one-out cross validation for CAP/capscale in R vegan?

I would like to perform a "leave-one-out cross validation" (LOO-CV) for a CAP in R. The CAP was calculated by using capscale in R package vegan and is a canonical analysis of principal coordinates, similar to an rda or cca, but based on another similarity matrix, in my case Bray-Curtis. I have found that within predict.cca there is the function calibrate.cca but I cannot make it work.
https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/predict.cca
This is what I have (based on the sample data mite available in vegan)
library(vegan)
data(mite, mite.env)
str(mite.env) #"SubsDens", "WatrCont", "Substrate", "Shrub", "Topo"
miteBC <- vegdist(mite, method="bray") #Bray-Curtis similarity matrix
miteCAP <-capscale(miteBC~Substrate + Shrub + Topo, data=mite.env, #CAP in capscale
distance = "bray", metaMDSdist = F)
summary(miteCAP)
anova(miteCAP)
anova(miteCAP, by = "axis")
anova(miteCAP, by = "margin")
calibrate.cca(miteCAP, type = c("response")) #error cannot find function calibrate.cca
In the program Primer it is done automatically within the CAP function ("Leave-one-out Allocation of Observations to Groups"), where it assigns each sample automatically to a group and get a mis-classification error (similar to a classification randomForest, which I have already done), but I would like to use R, and it should be possible with vegan::capscale.
Any help is very much appreciated!
Function vegan::calibrate does not have argument type and never returns "response". Check its documentation. It does the environmental calibration, and returns the predicted values of constraints (Substrate, Shrub, Topo) in the scale of model matrix, and with factors these hardly make sense directly.
There is no direct option of LOO: you got to do it by hand cycling through points, and using the complete left-out-point as the newdata. However, I'd suggest k-fold cross-validation as a better alternative for estimation of predictive power: LOO changes data too little, and gives over-optimistic view of predictive power.

Low-pass fltering of a matrix

I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.

Normalize z-axis values onto [0, 1] when using vis.gam for a mgcv GAM

I have just finished fitting a GAM using the mgcv package (I will call this model gam1.5). I have been playing around with the vis.gam function and I have a question I have not been able to solve.
I would like to normalize the fitted values of my model so when I use vis.gam, the z-axis has limits [0, 1].
My idea was to apply the normalization formula in the $fitted.values of my GAM model as follows:
gam1.5$fitted.values<-(gam1.5$fitted.values-min(gam1.5$fitted.values))/(max(gam1.5$fitted.values)-min(gam1.5$fitted.values))
However, when I run the vis.gam, it does not change the scale of the z-axis. I was wondering if I am applying the normalization formula to the incorrect object (a different one to $fitted.values) within the GAM object.
Yes. Because vis.gam is based on predict.gam and your change to $fitted.values has no effect!
In fact, you can't achieve your goal with vis.gam. This function simply produces a plot and returns nothing for user to later reproduce the plot (unless vis.gam is called again). This means, we will need to work with predict.gam. Here are the basic steps.
Set up a 2D grid / mesh. You may want to use exclude.too.far to filter data far away from training data to avoid ridiculous spline / polynomial extrapolation (as vis.gam does);
Construct a new data frame newdat (from the above grid) and call oo <- predict.gam(gam1.5, newdat, type = "terms") to obtain term-wise prediction. This is a matrix. You need to retain only the column associated with the 2D smooth you want to plot. Let's say this column is stored into a vector z;
Augment z into a matrix by padding NA for those too far data.
Normalize z onto [0, 1].
Use image or contour to produce the plot yourself.
Ideally we should take an example (maybe from ?vis.gam) and work through the above steps. However, you returned to me saying that you quickly sorted out the problem using predict.gam. Then I will not add examples for demonstration.

Silhouette plot in R

I have a set of data containing:
item, associated cluster, silhouette coefficient. I can further augment this data set with more information if necessary.
I would like to generate a silhouette plot in R. I am having trouble with this because examples I came across use the built-in kmeans (or related) clustering function and plot the result. I want to bypass this step and produce the plot for my own clustering algorithm but I'm ending up short on providing the correct arguments to the plot function.
Thank you.
EDIT
Data set example https://pastebin.mozilla.org/8853427
What I've tried is loading the dataset and passing it to the plot function using various arguments based on https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/silhouette.html
Function silhouette in package cluster can do the plots for you. It just needs a vector of cluster membership (produced from whatever algorithm you choose) and a dissimilarity matrix (probably best to use the same one used in producing the clusters). For example:
library (cluster)
library (vegan)
data(varespec)
dis = vegdist(varespec)
res = pam(dis,3) # or whatever your choice of clustering algorithm is
sil = silhouette (res$clustering,dis) # or use your cluster vector
windows() # RStudio sometimes does not display silhouette plots correctly
plot(sil)
EDIT: For k-means (which uses squared Euclidean distance)
library (vegan)
library (cluster)
data(varespec)
dis = dist(varespec)^2
res = kmeans(varespec,3)
sil = silhouette (res$cluster, dis)
windows()
plot(sil)

Preprocess data in R

Im using R to create logistic regression classifier model.
Here is the code sample:
library(ROCR)
DATA_SET <- read.csv('E:/1.csv')
classOneCount= 4000
classZeroCount = 4000
sample.churn <- sample(which(DATA_SET$Class==1),classOneCount)
sample.nochurn <- sample(which(DATA_SET$Class==0),classZeroCount )
train.set <- DATA_SET[c(sample.churn,sample.nochurn),]
test.set <- DATA_SET[c(-sample.churn,-sample.nochurn),]
full.logit <- glm(Class~., data = train.set, family = binomial)
And it works fine, but I would like to preprocess the data to see if it improves classification model.
What I would like to do would be to divide input vector variables which are continuoes into intervals. Lets say that one variable is height in centimeters in float.
Sample values of height:
183.23
173.43
163.53
153.63
193.27
and so on, and I would like to split it into lets say 3 different intervals: small, medium, large.
And do it with all variables from my set - there are 32 variables.
What's more I would like to see at the end correlation between value of the variables (this intervals) and classification result class.
Is this clear?
Thank you very much in advance
The classification model creates some decision boundary and existing algorithms are rather good at estimating it. Let's assume that you have one variable - height - and linear decision boundary. Your algorithm can then decide between what values put decision boundary by estimating error on training set. If you perform quantization and create few intervals your algorithm have fewer places to put boundary(data loss). It will likely perform worse on such cropped dataset than on original one. It could help if your learning algorithm is suffering from high variance (is overfitting data) but then you could also try getting more training examples, use smaller set (subset) of features or use algorithm with regularization and increase regularization parameter
There are also many questions about how to choose number of intervals and how to divide data into them like: should all intervals be equally frequent or of equal width or most similar to each other inside each interval?
If you want just to experiment use some software like f.e. free version of RapidMiner Studio (it can read CSV and Excel files and have some quick quantization options) to convert your data

Resources