How to plot ROC curve for cross validation from Weka output for binary class and multiclass data? - r

I have tried different matlab functions like plotroc and packages in R like pROC, ROCR and cvAUC. Each package or function produces different graph and gives different AUC than Weka result.
I would like to compare multiple classifier using 10-fold-cross-validation and would like to pot ROC of each. I have collected results in Weka but I don't want to plot it in Weka itself.
My experiments are based on both binary class and multi-class data.
My Weka output cross-validated instance predictions are at https://drive.google.com/folderview?id=0BzRIG3kN-8Z9fnh5OElKTExNT2NuZUVna2tKcmMzU1VBankwdVc2OGxBSXFnaFJqSEhHYVE&usp=sharing
Please, suggest me how can I plot graph for attached results for binaryclass as well as multiclass.

This is a placeholder answer, but the first thing to note is that one your observations got cross-validated less than 10 times:
library(pROC)
library(dplyr)
filenameROC = "Data/term3_IBk_3_multiclass.txt"
fileROC = readLines(filenameROC)
dfCV = read.csv2(text = fileROC,
nrows = length(fileROC) - 51 - 19,
header = TRUE,
sep = ",",
skip = 19, stringsAsFactors = FALSE)
dfCV %>%
group_by(inst.) %>%
tally() %>%
filter(n < 10)
Which gives:
> dfCV %>%
+ group_by(inst.) %>%
+ tally() %>%
+ filter( n < 10)
Source: local data frame [1 x 2]
inst. n
1 773 4
Can you explain this?
Additionally, you also need to add a cross-validation iteration identifier. Once you do that it is simply a question of running multiclass.roc from the pROC package by CV iteration.
Edit:
OP claims that there are 7724 *observations` whereas it is easy to see that there are 773 observations repeated 10 times in 772 cases and 4 times for observation number 772 -- consistent with 10-fold cross-validation data:
> dfCV %>%
+ group_by(inst.) %>%
+ tally()
Source: local data frame [773 x 2]
inst. n
1 1 10
2 2 10
3 3 10
4 4 10
5 5 10
6 6 10
7 7 10
8 8 10
9 9 10
10 10 10
.. ... ..
Edit 2:
Here is the code to produce the multi-class ROC by CV fold:
dfCVROC = dfCV %>%
dplyr::filter(inst. != 773) %>%
arrange(inst.) %>%
dplyr::mutate(cvfold = rep.int(1:10, 772)) %>%
group_by(cvfold) %>%
do(multiclass_roc = multiclass.roc(as.factor(.$actual), as.numeric(.$prediction)))
# get the AUCs by CV fold
sapply(dfCVROC$multiclass_roc, function(x) x$auc)

I didn't find exact solution for the issue. However, here are some points that I observe from Weka output
While weka plots the ROC it get predictions directly from classifier evaluation output.
Weka uses predictions upto 6 decimal points number for calculating threshold values (more precision helps in calculating more number of threshold values for ROC curve).
By default in Weka explorer, classifier outputs prediction upto 3 decimal points only (as in my attached experiments results).
Apart from this I didn't understand how Weka calculates the threshold values form the predictions. I observe that with same Weka prediction output I found different threshold values in Weka and R (and Matlab).
Finally, I used Weka API code for plotting ROC Generate ROC Curve and extract the TPR and FPR for the experiments (I re-run all the experiments). After extracting TPR and FPR I can plot graphs in any tool like Excel, gnuplot, Matlab or R.

Related

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

different output for PR AUC for different R packages

I find different numeric values for the computation of the Area Under the Precision Recall Curve (PRAUC) with the dataset I am working on when computed via 2 different R packages: yardstick and caret.
I am afraid I was not able to reproduce this mismatch with synthetic data, but only with my dataset (this is strange as well)
In order to make this reproducible, I am sharing the prediction output of my model, you can download it here https://drive.google.com/open?id=1LuCcEw-RNRcdz6cg0X5bIEblatxH4Rdz (don't worry, it's a small csv).
The csv contains a dataframe with 4 columns:
yes probability estimate of being in class yes
no = 1 - yes
obs actual class label
pred predicted class label (with .5 threshold)
here follows the code to produce the 2 values of PRAUC
require(data.table)
require(yardstick)
require(caret)
pr <- fread('pred_sample.csv')
# transform to factors
# put the positive class in the first level
pr[, obs := factor(obs, levels = c('yes', 'no'))]
pr[, pred := factor(pred, levels = c('yes', 'no'))] # this is actually not needed
# compute yardstick PRAUC
pr_auc(pr, obs, yes) # 0.315
# compute caret PRAUC
prSummary(pr, lev = c('yes', 'no')) # 0.2373
I could understand a little difference, due to the approximation when computing the area (interpolating the curve), but this seems way too high.
I even tried a third package, PRROC, and the result is still different, namely around .26.

Error in generating a poisson point pattern with calculated lambda

I have a two-dimensional point pattern (no marks) and I am trying to test for clustering in the presence of spatial inhomogeneity using envelopes and the inhomogenous pair correlation function. I am estimating an inhomogenous intensity function for the data using the density.ppp function. Here is some sample data:
x y
1 533.03 411.58
2 468.39 622.92
3 402.86 530.94
4 427.13 616.81
5 495.20 680.62
6 566.61 598.99
7 799.03 585.16
8 1060.09 544.23
9 144.66 747.40
10 138.14 752.92
11 449.49 839.15
12 756.45 713.72
13 741.01 728.41
14 760.22 740.28
15 802.34 756.21
16 799.04 764.89
17 773.81 771.97
18 768.41 720.07
19 746.14 754.11
20 815.40 765.14
There are ~1700 data points overall
Here is my code:
library("spatstat")
WT <- read.csv("Test.csv")
colnames(WT) <- c("x","y")
#determine bounding window
win <- ripras(WT)
unitname(win) <- c("micrometer")
#convert to ppp data class
WT.ppp <- as.ppp(WT, win)
plot(WT.ppp)
#estimate intensity function using cross validation
I <- density.ppp(WT.ppp,sigma=bw.diggle(WT.ppp),adjust=0.3,kernal="epanechnikov")
plot(I)
#predetermined r values for PCF
radius <- seq(from = 0, to = 50, by = 0.5)
#use envelopes to test the null hypothesis (ie. inhomogenous poisson process)
PCF_envelopes <- envelope(WT.ppp,divisor="d", pcfinhom,r = radius,nsim=10,simulate=expression(rpoispp(I)) )
When I run rpoisspp(I), I get the following error:
Error in sample.int(npix, size = ni, replace = TRUE, prob = lpix) :
negative probability
I can't seem to figure out what the issue is....any suggestions?
Thanks for your help!
This is happening because the image I contains some negative values, probably very small values but negative. You can check that by computing range(I) or min(I) or any(I < 0).
The help for density.ppp says that the result may contain negative values (very small ones) due to numerical error. To remove these, you need to set positive=TRUE in the call to density.ppp.
By the way, the argument kernel has been mis-spelt in the code above. Also the vector r is too coarsely spaced - you would be better to leave this argument un-specified. Also you don't need to type density.ppp, just density.

Generate a crude incidence rate table (stratified by a factor variable) from a Lexis Model

I am using the 'Epi' package in R to model follow-up data from a study.
I am having no issues with declaring the Lexis model or running Poisson and (combined with the survival package) Cox regressions.
As part of the initial data review I want to find a simple way to make a table of crude unadjusted incidence/event rates from data in a lexis model in R (pre-fitting any poisson/cox models).
I have found a coded approach which allows me to do this and to stratify by a variable as part of exploratory data analysis:
#Generic Syntax Example
total <-cbind(tapply(lexis_model$lex.Xst,lexis_model$stratifying_var,sum),tapply(lexis_model$lex.dur,lexis_model$stratifying_var,sum))
#Add up the number of events within the stratifying variable
#Add up the amount of follow-up time within the stratifying the variable
rates <- tapply(lexis_model$lex.Xst,lexis_model$stratifying_var,sum)/tapply(lexis_model$lex.dur,lexis_model$stratifying_var,sum)*10^3
#Given rates per 1,000 person years
ratetable <- (cbind(totals,rates))
#Specific Example based on the dataset
totals <-cbind(tapply(lexis_model$lex.Xst,lexis_model$grade,sum),tapply(lexis_model$lex.dur,lexis_model$grade,sum))
rates <- tapply(lexis_model$lex.Xst,lexis_model$grade,sum)/tapply(lexis_model$lex.dur,lexis_model$grade,sum)*10^3
ratetable <- (cbind(totals,rates))
ratetable
rates
1 90 20338.234 4.4251630
2 64 7265.065 8.8092811
#Shows number of events, years follow-up, number of events per 1000 years follow-up, stratified by the stratifying variable
Note this is crude unadjusted/absolute rates - not the output of a Poisson model. Whilst I appreciate that the code above does indeed produce the desired output (and is pretty straightforward) I wanted to see if people were aware of a command which can take a lexis dataset and output this. I've had a look at the available commands in the Epi and epitools package - may have missed somthing but could not see an obvious way to do this.
As this is a quite common thing to want to do I wondered if anyone was aware of a package/function that could do this by specifying the simply the lexis dataset and the stratification variable (or indeed a single function to the steps above in a single go).
Ideally the output would look something like the below (which is taken from STATA which I am trying to move away from in favour of R!):
A copy of the first twenty rows or so of the actual data is here (the data has already been put in to a lexis model using Epi package so all relevant lexis variables are there):
https://www.dropbox.com/s/yjyz1kzusysz941/rate_table_data.xlsx?dl=0
I would do this simply using the tidyverse R package as such:
library(tidyverse)
lexis_model %>%
group_by(grade) %>%
summarise(sum_Xst = sum(lex.Xst), sum_dur = sum(lex.dur)) %>%
mutate(rate = sum_Xst/sum_dur*10^3) -> rateable
rateable
# A tibble: 2 x 4
# grade sum_Xst sum_dur rate
# <dbl> <int> <dbl> <dbl>
# 1 1 2 375.24709 5.329821
# 2 2 0 92.44079 0.000000
And you could wrap this into a function yourself:
rateFunc <- function(data, strat_var)
{
lexis_model %>%
group_by_(strat_var) %>%
summarise(sum_Xst = sum(lex.Xst), sum_dur = sum(lex.dur)) %>%
mutate(rate = sum_Xst/sum_dur*10^3)
}
which you would then call:
rateFunc(lexis_model, "grade")
This is useful because, using the combination of tidyverse summarise and mutate it is very easy to add more summary statistics to the table.
EDIT:
After clarification on the question, this can be done using the popEpi package using the rate command:
popEpi::rate(lexis_model, obs = lex.Xst, pyrs = lex.dur, print = grade)
# Crude rates and 95% confidence intervals:
# grade lex.Xst lex.dur rate SE.rate rate.lo rate.hi
# 1: 1 2 375.2472 0.00532982 0.003768752 0.001332942 0.0213115
# 2: 2 0 92.4408 0.00000000 0.000000000 0.000000000 NaN

Why does gstat.predict() function often return NaN values (GSTAT Package)? (R version 3.3.2, Windows 10)

I am trying to simulate a combination of two different random fields (yy1 and yy2 with different mean and correlation length) with an irregular boundary using Gstat package in R. I have attached the picture of my expected outcome. The code is not giving such output consistently and I am frequently getting atleast one of the yy1 and yy2 as NaNs, which results in the Undesired output as shown in image.
The key steps I used are:
1) Created two gstat objects with different means and psill (rf1 and rf2)
2) Created two computational grids (one for each random field) in the form of data frame with two variables “x” and “y” coordinates.
3) Predicted two random fields using unconditional simulation.
Any help in this regard would be highly appreciated.
Attachments: 2 images (link provided) and 1 R code
1) Expected Outcome
2) Undesired Outcome
library(gstat)
xy <- expand.grid(1:150, 1:200) # 150 x 200 grid is created in the form of a dataframe with x and y vectors
names(xy)<-c('x','y') # giving names to the variables
# creating gsat objects
rf1<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(1,0,0), model=vgm(psill=0.025, range=5, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
rf2<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(4,0,0), model=vgm(psill=0.025, range=10, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
# creating two computational grid
rows<-nrow(xy)
xy_shift <- expand.grid(60:90, 75:100)
names(xy_shift)<-c('x','y')
library(dplyr) # for antijoin
xy1<-xy[1:(rows/2),]
xy1<-anti_join(xy1, xy_shift, by = c("x","y")) # creating the irregular boundary
xy2<-rbind(xy[(rows/2+1):rows,],xy_shift)
library(sp)
yy1<- predict(rf1, newdata=xy1, nsim=1) # random field 1
yy2<- predict(rf2, newdata=xy2, nsim=1) # random field 2
rf1_label<-gl(1,length(yy1[,1]),labels="field1")
rf2_label<-gl(1,length(yy2[,1]),labels="field2")
yy1<-cbind(yy1,field=rf1_label)
yy2<-cbind(yy2,field=rf2_label)
yy<-rbind(yy1,yy2)
yyplot<-yy[,c(1,2,3)]
# plotting the field
gridded(yyplot) = ~x+y
spplot(obj=yyplot[1],scales=list(draw = TRUE))

Resources