I have conducted an NMDS analysis and have plotted the output too. However, I am unsure how to actually report the results from R. Which parts from the following output are of most importance? The graph that is produced also shows two clear groups, how are you supposed to describe these results?
MDS.out
Call:
metaMDS(comm = dgge2, distance = "bray")
global Multidimensional Scaling using monoMDS
Data: dgge2
Distance: bray
Dimensions: 2
Stress: 0
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘dgge2’
The most important pieces of information are that stress=0 which means the fit is complete and there is still no convergence. This happens if you have six or fewer observations for two dimensions, or you have degenerate data. You should not use NMDS in these cases. Current versions of vegan will issue a warning with near zero stress. Perhaps you had an outdated version.
I think the best interpretation is just a plot of principal component. yOu can use plot and text provided by vegan package. Here I am creating a ggplot2 version( to get the legend gracefully):
library(vegan)
library(ggplot2)
data(dune)
ord = metaMDS(comm = dune)
ord_spec <- scores(ord, "spec")
ord_spec <- cbind.data.frame(ord_spec,label=rownames(ord_spec))
ord_sites <- scores(ord, "sites")
ord_sites <- cbind.data.frame(ord_sites,label=rownames(ord_sites))
ggplot(data=ord_spec,aes(x=NMDS1,y=NMDS2)) +
geom_text(aes(label=label,col='species')) +
geom_text(data=ord_sites,aes(label=label,col='sites'))
Related
I am using the vegan package to do RDA and want to plot the data using biplot. In my data I have hundreds of values. What I would like to do is limit the variance explained to a set limit so in the example below to 0.1. So instead of having 44 of arrows I might only have say 8
library (vegan) # Load library
library(MASS) # load library
data(varespec) # Dummy data
vare.pca <- rda(varespec, scale = TRUE) # RDA anaylsis
biplot(vare.pca, scaling = 3,display = "species") # Plot data but includes all
## extracts the percentage##
x =(sort(round(100*scores(vare.pca, display = "sp", scaling = 0)[,1]^2, 3), decreasing = TRUE))
## Plot percentage
plot(length(x):1,sort(x)) # plot rank on value of y
Any help would be appreciated :)
Depending on the size of the data-set it would be possible to use either ordistep or ordiR2step to reducing the amount of "unimportant" variables in your plot (see https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/ordistep). However, these functions use step-wise selection, which need to be used cautiously. Step-wise selection can select your included parameters based on AIC values, R2 values or p-values. It does not not select values based on the importance of these for the purpose of your question. It also does not mean that these variables have any meaning towards organisms or biochemical interactions. Nevertheless, step-wise selection can be helpful giving an idea on which parameters might be of strong influence on the overall variation in the data-set. Simple example below.
rda0 <- rda(varespec ~1, varespec)
rda1 <- rda(varespec ~., varespec)
rdaplotp <- ordistep(rda0, scope = formula(rda1))
plot(rdaplotp, display = "species", type = "n")
text(rdaplotp, display="bp")
Thus, by using the ordistep function the number of species displayed in the plot has been greatly reduced (see Fig 1 below). If you want to remove more variables (which I do not suggest) an option could be to look at the output of the biplot and throw out the variables which have the least amount of correlation with the principle components (see below), but I would advise against it.
sumrda <- summary(rdaplotp)
sumrda$biplot
What would be wise, is to first check which question you want to answer and see if any of the included variables could be left out on forehand. This would already reduce the amount. Minor edit: I am also a bit confused why you want to remove parameters strongly contributing to your captured variation.
I have performed PCA using prcomp in R with my databases of 75-76 indicator variables and 7232 companies, including NAs. Before applying the function, I centred my data, but did not rescale them because they are all indicator variables. (Is my reasoning correct?)
After that I varimax-rotated the loadings of the 2 or 3 first principal components following the instructions by amoeba here.
Since I had centred, but not rescaled my data, I changed the code to:
Varimax_results <- varimax(rawLoadings,normalize = FALSE)
invLoadings <- t(pracma::pinv(VarimaxLoadings))
scores <- scale(DatosPCA, scale = FALSE) %*% invLoadings
Now I am trying to figure out why the scores given by "prcomp" and the scores obtained using the code above are not the same.
I am probably missing some theoretical background, so I would be grateful if someone could tell me if the scores are supposed to be the same and, in that case, what I am doing wrong in my code. If they are not supposed to be the same, which ones should I use?
Thank you very much!
Here is my sample code for SVM classification.
train <- read.csv("traindata.csv")
test <- read.csv("testdata.csv")
svm.fit=svm(as.factor(value)~ ., data=train, kernel="linear", method="class")
svm.pred = predict(svm.fit,test,type="class")
The feature value in my example is a factor which gives two levels (either true or false). I wanted to
plot a graph of my svm classifier and group them into two groups. One group
those with a "true" and another group as false. How do we produce a 3D or 2D SVM plot? I tried with plot(svm.fit, train) but it doesn't seem to work out for me.
There is this answer i found on SO but I am not clear with what t, x, y, z, w, and cl are in the answer.
Plotting data from an svm fit - hyperplane
i have about 50 features in my dataset which the last column is a factor. Any simple way of doing it or if any one could help me explain his answer.
The short answer is: you cannot. Your data is 50 dimensional. You cannot plot 50 dimensions. The only thing you can do are some rough approximations, reductions and projections, but none of these can actually represent what is happening inside. In order to plot 2D/3D decision boundary your data has to be 2D/3D (2 or 3 features, which is exactly what is happening in the link provided - they only have 3 features so they can plot all of them). With 50 features you are left with statistical analysis, no actual visual inspection.
You can obviously take a look at some slices (select 3 features, or main components of PCA projections). If you are not familiar with underlying linear algebra you can simply use gmum.r package which does this for you. Simply train svm and plot it forcing "pca" visualization, like here: http://r.gmum.net/samples/svm.basic.html.
library(gmum.r)
# We will perform basic classification on breast cancer dataset
# using LIBSVM with linear kernel
data(svm_breast_cancer_dataset)
# We can pass either formula or explicitly X and Y
svm <- SVM(X1 ~ ., svm.breastcancer.dataset, core="libsvm", kernel="linear", C=10)
## optimization finished, #iter = 8980
pred <- predict(svm, svm.breastcancer.dataset[,-1])
plot(svm, mode="pca")
which gives
for more examples you can refer to project website http://r.gmum.net/
However this only shows points projetions and their classification - you cannot see the hyperplane, because it is highly dimensional object (in your case 49 dimensional) and in such projection this hyperplane would be ... whole screen. Exactly no pixel would be left "outside" (think about it in this terms - if you have 3D space and hyperplane inside, this will be 2D plane.. now if you try to plot it in 1D you will end up with the whole line "filled" with your hyperplane, because no matter where you place a line in 3D, projection of the 2D plane on this line will fill it up! The only other possibility is that the line is perpendicular and then projection is a single point; the same applies here - if you try to project 49 dimensional hyperplane onto 3D you will end up with the whole screen "black").
I ran a GAMM model with a large dataset (over 20,000 cases) using mgcv. Because of the large number of data points, it is very difficult to see the smoothed lines among the residual points in the plot. Is it possible to specify different colors for the points and the smoothed fit lines?
Here is an example adopted from the mgcv documentation:
library(mgcv)
## simple examples using gamm as alternative to gam
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
b <- gamm(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
plot(b$gam, pages=1, residuals=T, col='#FF8000', shade=T, shade.col='gray90')
plot(absbmidiffLog.GAMM$gam, pages=1, residuals=T, pch=19, cex=0.01, scheme=1,
col='#FF8000', shade=T,shade.col='gray90')
I have looked into the visreg package, but it does not seem to work with gamma objects.
I also found it surprisingly difficult/impossible to choose 2 different colors.
A workaround for me is to add the points later, i.e first call plot.gam with residuals=FALSE and then add the points with the base R points() call.
This only works properly though if you shift the gam plot to its proper mean. Here is the code for one of the terms. (Use a for loop to get all four on one page)
library(mgcv)
## simple examples using gamm as alternative to gam
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
b <- gamm(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
plot(b$gam, select=3, shift = coef(b$gam)[1], residuals=FALSE, col='#FF8000', shade=T, shade.col='gray90')
points(y~x3, data=dat,pch=20,cex=0.75,col=rgb(1,0.65,0,0.25))
Thanks very much Winchester for the kind help! I also saw the tutorial and that work for me! In the past two days I explored the output of both MaxEnt and BIOMOD, and I think I am still a little bit confused by the terms used within the two.
From Philips' code, it seems that he used the Sample points and backaground point to calculate ROC, while in BIOMOD, there is only prediction from the presence and pseudo absence points. which means, for the same dataset, I have the same number of presence/sample data, but different absence/background data for the two models, respectively. And when I recalculate the ROC, it is usually inconsistent with the values reported by the model themselves.
I think I still didnot get some of the point of model evaluation, concerning what is been evaluated and how to generate the evaluation dataset, ie. comfusion matrix, and which part of the data was selected as evaluation.
Thanks everybody for the kind reply! I am very sorry for the inconvenience. I appended a few more sentences to the post for BIOMOD to make it runable, as for MaxEnt, you can use the tutorial data.
Actually, the intend of my post is to find someone who have had the experience to work with both the presence/absence dataset and the presence-only dataset. I probably know how to deal with them separately , but not altogether.
I am using both MaxEnt and a few algorithms under BIOMOD for the distribution of my species, and I would like to plot the ROC/AUC in the same figue, anybody have done this before?
As far as I know, for MaxEnt, the ROC can be plotted using the ROCR and vcd library, which was given in the tutorial of MaxEnt by Philips:
install.packages("ROCR", dependencies=TRUE)
install.packages("vcd", dependencies=TRUE)
library(ROCR)
library(vcd)
library(boot)
setwd("c:/maxent/tutorial/outputs")
presence <- read.csv("bradypus_variegatus_samplePredictions.csv")
background <- read.csv("bradypus_variegatus_backgroundPredictions.csv")
pp <- presence$Logistic.prediction # get the column of predictions
testpp <- pp[presence$Test.or.train=="test"] # select only test points
trainpp <- pp[presence$Test.or.train=="train"] # select only test points
bb <- background$logistic
combined <- c(testpp, bb) # combine into a single vector
label <- c(rep(1,length(testpp)),rep(0,length(bb))) # labels: 1=present, 0=random
pred <- prediction(combined, label) # labeled predictions
perf <- performance(pred, "tpr", "fpr") # True / false positives, for ROC curve
plot(perf, colorize=TRUE) # Show the ROC curve
performance(pred, "auc")#y.values[[1]] # Calculate the AUC
While for BIOMOD, they require presence/absence data, so I used 1000 pseudo.absence points, and there is no background. I found another script given by Thuiller himself:
library(BIOMOD)
library(PresenceAbsence)
data(Sp.Env)
Initial.State(Response=Sp.Env[,12:13], Explanatory=Sp.Env[,4:10],
IndependentResponse=NULL, IndependentExplanatory=NULL)
Models(GAM = TRUE, NbRunEval = 1, DataSplit = 80,
Yweights=NULL, Roc=TRUE, Optimized.Threshold.Roc=TRUE, Kappa=F, TSS=F, KeepPredIndependent = FALSE, VarImport=0,
NbRepPA=0, strategy="circles", coor=CoorXY, distance=2, nb.absences=1000)
load("pred/Pred_Sp277")
data=cbind(Sp.Env[,1], Sp.Env[,13], Pred_Sp277[,3,1,1]/1000)
plotroc <- roc.plot.calculate(data)
plot(plotroc$threshold, plotroc$sensitivity, type="l", col="blue ")
lines(plotroc$threshold, plotroc$specificity)
lines(plotroc$threshold, (plotroc$specificity+plotroc$sensitivity)/2, col="red")
Now, the problem is, how could I plot them altogether? I have tried both, they work well for both seperately, but exclusively. Maybe I need some one to help me understand the underling philosiphy of ROC.
Thanks in advance~
Marco
Ideally, if you are going to compare methods, you should probably generate predictions from MaxEnt and BIOMOD for each location of the testing portion of your data set (observed presences and absences). As Christian mentioned, pROC is a nice package, especially for comparing ROC curves. Although I don't have access to the data, I've generated a dummy data set which should illustrate plotting two roc curves and calculating the difference in AUC.
library(pROC)
#Create dummy data set for test observations
obs<-rep(0:1, each=50)
pred1<-c(runif(50,min=0,max=0.8),runif(50,min=0.3,max=0.6))
pred2<-c(runif(50,min=0,max=0.6),runif(50,min=0.4,max=0.9))
roc1<-roc(obs~pred1) # Calculate ROC for each method
roc2<-roc(obs~pred2)
#Plot roc curves for each method
plot(roc1)
lines(roc2,col="red")
#Compare differences in area under ROC
roc.test(roc1,roc2,method="bootstrap",paired=TRUE)
I still couldnt get your code to work, but here is an example with the demonstration data from the package PresenceAbsence. I've plotted your lines, then added a bold line for the ROC. If you were labelling it, the false positive rate is on the x-axis, with the false negative rate on the y-axis, but I think that would not be accurate with the other lines that are present. Is this what you wanted to do?
data(SIM3DATA)
plotroc <- roc.plot.calculate(SIM3DATA,which.model=2, xlab = NULL, ylab = NULL)
plot(plotroc$threshold, plotroc$sensitivity, type="l", col="blue ")
lines(plotroc$threshold, plotroc$specificity)
lines(plotroc$threshold, (plotroc$specificity+plotroc$sensitivity)/2, col="red")
lines(1 - plotroc$specificity, plotroc$sensitivity, lwd = 2, lty = 5)
I have been using the pROC package. It has a lot of nice features when it comes to plotting ROC and AUC in the same graph. Furthermore it is very use.