Calculating ROC/AUC for MaxEnt and BIOMOD - r

Thanks very much Winchester for the kind help! I also saw the tutorial and that work for me! In the past two days I explored the output of both MaxEnt and BIOMOD, and I think I am still a little bit confused by the terms used within the two.
From Philips' code, it seems that he used the Sample points and backaground point to calculate ROC, while in BIOMOD, there is only prediction from the presence and pseudo absence points. which means, for the same dataset, I have the same number of presence/sample data, but different absence/background data for the two models, respectively. And when I recalculate the ROC, it is usually inconsistent with the values reported by the model themselves.
I think I still didnot get some of the point of model evaluation, concerning what is been evaluated and how to generate the evaluation dataset, ie. comfusion matrix, and which part of the data was selected as evaluation.
Thanks everybody for the kind reply! I am very sorry for the inconvenience. I appended a few more sentences to the post for BIOMOD to make it runable, as for MaxEnt, you can use the tutorial data.
Actually, the intend of my post is to find someone who have had the experience to work with both the presence/absence dataset and the presence-only dataset. I probably know how to deal with them separately , but not altogether.
I am using both MaxEnt and a few algorithms under BIOMOD for the distribution of my species, and I would like to plot the ROC/AUC in the same figue, anybody have done this before?
As far as I know, for MaxEnt, the ROC can be plotted using the ROCR and vcd library, which was given in the tutorial of MaxEnt by Philips:
install.packages("ROCR", dependencies=TRUE)
install.packages("vcd", dependencies=TRUE)
library(ROCR)
library(vcd)
library(boot)
setwd("c:/maxent/tutorial/outputs")
presence <- read.csv("bradypus_variegatus_samplePredictions.csv")
background <- read.csv("bradypus_variegatus_backgroundPredictions.csv")
pp <- presence$Logistic.prediction # get the column of predictions
testpp <- pp[presence$Test.or.train=="test"] # select only test points
trainpp <- pp[presence$Test.or.train=="train"] # select only test points
bb <- background$logistic
combined <- c(testpp, bb) # combine into a single vector
label <- c(rep(1,length(testpp)),rep(0,length(bb))) # labels: 1=present, 0=random
pred <- prediction(combined, label) # labeled predictions
perf <- performance(pred, "tpr", "fpr") # True / false positives, for ROC curve
plot(perf, colorize=TRUE) # Show the ROC curve
performance(pred, "auc")#y.values[[1]] # Calculate the AUC
While for BIOMOD, they require presence/absence data, so I used 1000 pseudo.absence points, and there is no background. I found another script given by Thuiller himself:
library(BIOMOD)
library(PresenceAbsence)
data(Sp.Env)
Initial.State(Response=Sp.Env[,12:13], Explanatory=Sp.Env[,4:10],
IndependentResponse=NULL, IndependentExplanatory=NULL)
Models(GAM = TRUE, NbRunEval = 1, DataSplit = 80,
Yweights=NULL, Roc=TRUE, Optimized.Threshold.Roc=TRUE, Kappa=F, TSS=F, KeepPredIndependent = FALSE, VarImport=0,
NbRepPA=0, strategy="circles", coor=CoorXY, distance=2, nb.absences=1000)
load("pred/Pred_Sp277")
data=cbind(Sp.Env[,1], Sp.Env[,13], Pred_Sp277[,3,1,1]/1000)
plotroc <- roc.plot.calculate(data)
plot(plotroc$threshold, plotroc$sensitivity, type="l", col="blue ")
lines(plotroc$threshold, plotroc$specificity)
lines(plotroc$threshold, (plotroc$specificity+plotroc$sensitivity)/2, col="red")
Now, the problem is, how could I plot them altogether? I have tried both, they work well for both seperately, but exclusively. Maybe I need some one to help me understand the underling philosiphy of ROC.
Thanks in advance~
Marco

Ideally, if you are going to compare methods, you should probably generate predictions from MaxEnt and BIOMOD for each location of the testing portion of your data set (observed presences and absences). As Christian mentioned, pROC is a nice package, especially for comparing ROC curves. Although I don't have access to the data, I've generated a dummy data set which should illustrate plotting two roc curves and calculating the difference in AUC.
library(pROC)
#Create dummy data set for test observations
obs<-rep(0:1, each=50)
pred1<-c(runif(50,min=0,max=0.8),runif(50,min=0.3,max=0.6))
pred2<-c(runif(50,min=0,max=0.6),runif(50,min=0.4,max=0.9))
roc1<-roc(obs~pred1) # Calculate ROC for each method
roc2<-roc(obs~pred2)
#Plot roc curves for each method
plot(roc1)
lines(roc2,col="red")
#Compare differences in area under ROC
roc.test(roc1,roc2,method="bootstrap",paired=TRUE)

I still couldnt get your code to work, but here is an example with the demonstration data from the package PresenceAbsence. I've plotted your lines, then added a bold line for the ROC. If you were labelling it, the false positive rate is on the x-axis, with the false negative rate on the y-axis, but I think that would not be accurate with the other lines that are present. Is this what you wanted to do?
data(SIM3DATA)
plotroc <- roc.plot.calculate(SIM3DATA,which.model=2, xlab = NULL, ylab = NULL)
plot(plotroc$threshold, plotroc$sensitivity, type="l", col="blue ")
lines(plotroc$threshold, plotroc$specificity)
lines(plotroc$threshold, (plotroc$specificity+plotroc$sensitivity)/2, col="red")
lines(1 - plotroc$specificity, plotroc$sensitivity, lwd = 2, lty = 5)

I have been using the pROC package. It has a lot of nice features when it comes to plotting ROC and AUC in the same graph. Furthermore it is very use.

Related

Issues with ROC curves of logistic regression model in R

I did some analysis with my data but I found that all the ROC plots have the threshold points consolidated at the base of the graphs. Is the issue from the data itself or from the package used?
library(ROCR)
ROCRPred = prediction(res2, test_set$WRF)
ROCRPref <- performance(ROCRPred,"tpr","fpr")
plot(ROCRPref, colorize=TRUE, print.cutoffs.at=seq(0.1,by = 0.1))
Why did you choose cutoffs between 0.1 and 1?
print.cutoffs.at=seq(0.1,by = 0.1)
You need to adapt them to your data. For instance you could use the quantiles:
plot(ROCRPref, colorize=TRUE, print.cutoffs.at=quantile(res2))
I think "the threshold points are consolidated at the base of the graphs" due to the parameter 'print.cutoffs.at' in plot.
According to the documentation
print.cutoffs.at:-This vector specifies the cutoffs which should be printed as text along the curve at the corresponding curve positions.
As mentioned by #Calimo in his answer to adapt the threshold according to the data.

exponential regression with R ( and negative values)

I am trying to fit a curve to a set of data points but did not succeed. So I ask you.
plot(time,val) # look at data
exponential.model <- lm(log(val)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit,lwd=2) # show the fitted line
My only problem is, that my data contains negative values and so log(val) produces a lot of NA making the model computation crash.
I know that my data does not necessarily look like exponential , but I want to see the fit anyway. I also used another program which shows me val=27.1331*exp(-time/2.88031) is a nice fit but I do not know, what I am doing wrong.
I want to compute it with R.
I had the idea to shift data so no negative values remain, but result is poor and quite sure wrong.
plot(time,val+20) # look at data
exponential.model <- lm(log(val+20)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit-20,lwd=2) # show the (BAD) fitted line
Thank you!
I figured some things out and have a satisfying solution.
exponential.model <- lm(log(val)~ a) # compute model
The log(val) term is trying to rescale the values, so a linear model can be applied. Since this not possible to my values, you have to use a non-linear model (nls).
exponential.model <- nls(val ~ a*exp(b*time), start=c(b=-0.1,h=30))
This worked fine for me.
satisfying fit

Differing color of points & line in plot.gamm()

I ran a GAMM model with a large dataset (over 20,000 cases) using mgcv. Because of the large number of data points, it is very difficult to see the smoothed lines among the residual points in the plot. Is it possible to specify different colors for the points and the smoothed fit lines?
Here is an example adopted from the mgcv documentation:
library(mgcv)
## simple examples using gamm as alternative to gam
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
b <- gamm(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
plot(b$gam, pages=1, residuals=T, col='#FF8000', shade=T, shade.col='gray90')
plot(absbmidiffLog.GAMM$gam, pages=1, residuals=T, pch=19, cex=0.01, scheme=1,
col='#FF8000', shade=T,shade.col='gray90')
I have looked into the visreg package, but it does not seem to work with gamma objects.
I also found it surprisingly difficult/impossible to choose 2 different colors.
A workaround for me is to add the points later, i.e first call plot.gam with residuals=FALSE and then add the points with the base R points() call.
This only works properly though if you shift the gam plot to its proper mean. Here is the code for one of the terms. (Use a for loop to get all four on one page)
library(mgcv)
## simple examples using gamm as alternative to gam
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
b <- gamm(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
plot(b$gam, select=3, shift = coef(b$gam)[1], residuals=FALSE, col='#FF8000', shade=T, shade.col='gray90')
points(y~x3, data=dat,pch=20,cex=0.75,col=rgb(1,0.65,0,0.25))

NMDS ordination interpretation from R output

I have conducted an NMDS analysis and have plotted the output too. However, I am unsure how to actually report the results from R. Which parts from the following output are of most importance? The graph that is produced also shows two clear groups, how are you supposed to describe these results?
MDS.out
Call:
metaMDS(comm = dgge2, distance = "bray")
global Multidimensional Scaling using monoMDS
Data: dgge2
Distance: bray
Dimensions: 2
Stress: 0
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘dgge2’
The most important pieces of information are that stress=0 which means the fit is complete and there is still no convergence. This happens if you have six or fewer observations for two dimensions, or you have degenerate data. You should not use NMDS in these cases. Current versions of vegan will issue a warning with near zero stress. Perhaps you had an outdated version.
I think the best interpretation is just a plot of principal component. yOu can use plot and text provided by vegan package. Here I am creating a ggplot2 version( to get the legend gracefully):
library(vegan)
library(ggplot2)
data(dune)
ord = metaMDS(comm = dune)
ord_spec <- scores(ord, "spec")
ord_spec <- cbind.data.frame(ord_spec,label=rownames(ord_spec))
ord_sites <- scores(ord, "sites")
ord_sites <- cbind.data.frame(ord_sites,label=rownames(ord_sites))
ggplot(data=ord_spec,aes(x=NMDS1,y=NMDS2)) +
geom_text(aes(label=label,col='species')) +
geom_text(data=ord_sites,aes(label=label,col='sites'))

How to draw the regression line and the scatterplot between observed and predicted in R

How can I draw observed vs predicted scatterplot in R? I also want to show the regression line in it.
library(car)
summary(Prestige)
head(Prestige)
testidx <- which(1:nrow(Prestige)%%4==0)
prestige_train <- Prestige[-testidx,]
prestige_test <- Prestige[testidx,]
model <- glm(prestige~., data=prestige_train)
# Use the model to predict the output of test data
prediction <- predict(model, newdata=prestige_test)
# Check for the correlation with actual result
plot(prediction, prestige_test$prestige)
Edit:
abline(model)
The above function abline does not seem to work. It does draw a line but it seem to be incorrect.
Thanks
I don't think there is much need for me to write out the answer in its entirety, when there exists a much more beautiful (and probably more informed) explanation by Hadley Wickham. It is to be found here (first part) and here (second, more advanced part).
EDIT: My apologies for not giving a more specific answer from the start. Here how the plot can be made:
# define training sample
sample <- sample(nrow(Prestige),40,replace=FALSE)
training.set <- Prestige[sample,]
test.set <- Prestige[-sample,]
model <- glm(prestige ~ .,data=training.set)
# getting predicted values
test.set$predicted <- predict(model,newdata=test.set)
# plotting with base plot
with(test.set,plot(prestige,predicted))
# plotting with ggplot2
require(ggplot2)
qplot(prestige,predicted,data=test.set)
Hope this answers your question.
P.S. The question is indeed more appropriate for SO, it would have been answered there in a matter of minutes :)

Resources