I am using ROCR package to generate ROC curves. I already have a loop to generate multiple ROC plots from multiple files. I have 30 files. But I would like to combine all 30 ROC curves in one plot (different colors if possible). I know that there are few posts about it, but I would like to use my own loop and modify it. Any suggestions?
My script:
library(ROCR)
labels <- read.table(file="c:/data1/input")
files <- list.files(path="c:/data2/", pattern="*.txt")
for(i in files){
predictions <- read.table(paste("c:/data2/",i,sep=""))
pred <- prediction(predictions, labels)
perf <- performance(pred,"tpr","fpr")
pdf(paste0("c:/data2/", i,"ROC.pdf"))
plot(perf)
dev.off()
}
When you call plot() on a prediction object, it generates a new plot. One way is to use lines() and to call the values directly. First I generate 30 models, since you did not provide your data:
library(gbm)
library(mlbench)
library(colorspace)
data(DNA)
Pal = qualitative_hcl(30)
dat = DNA[,-c(1:2)]
dat$Class = as.numeric(dat$Class == "ei")
idx = split(sample(nrow(dat)),1:nrow(dat) %% 30)
mdl = lapply(1:30,function(i){
mod = gbm(Class~.,data=dat[idx[[i]],])
predictions = predict(mod,dat[-idx[[i]],],n.trees=10,type="response")
labels = dat[-idx[[i]],"Class"]
list(labels=labels,predictions=predictions)
})
names(mdl) = paste("model",1:30)
Now we start a empty plot, iterate through the 30 models, essentially the part you need is lines(...) :
plot(NULL,xlim=c(0,1),ylim=c(0,1),
xlab="False positive rate",ylab="True positive rate")
for(i in 1:30){
predictions = mdl[[i]]$predictions
labels = mdl[[i]]$labels
pred = prediction(predictions, labels)
perf <- performance(pred,"tpr","fpr")
lines(perf#x.values[[1]],perf#y.values[[1]],col=Pal[i])
}
legend("bottomright",fill=Pal,names(mdl),ncol=5,cex=0.7)
For your example trying something like this:
files <- list.files(path="c:/data2/", pattern="*.txt")
mdl = lapply(files,read.table)
names(mdl) = gubs(".txt","",files)
plot(NULL,xlim=c(0,1),ylim=c(0,1),
xlab="False positive rate",ylab="True positive rate")
for(i in 1:30){
pred = prediction(mdl[[i]], labels)
perf <- performance(pred,"tpr","fpr")
lines(perf#x.values[[1]],perf#y.values[[1]],col=Pal[i])
}
legend("bottomright",fill=Pal,names(mdl),ncol=5,cex=0.7)
Related
I wish to simulate the central limit theorem in order to demonstrate it, and I am not sure how to do it in R. I want to create 10,000 samples with a sample size of n (can be numeric or a parameter), from a distribution I will choose (uniform, exponential, etc...). Then I want to graph in one plot (using the par and mfrow commands) the original distribution (histogram), the distribution of the means of all samples, a Q-Q plot of the means, and in the 4th graph (there are four, 2X2), I am not sure what to plot. Can you please assist me in starting to program it in R ? I think once I have the simulated data I should be fine. Thank you.
My initial attempt is below, it is too simple and I am not sure even correct.
r = 10000;
n = 20;
M = matrix(0,n,r);
Xbar = rep(0,r);
for (i in 1:r)
{
M[,i] = runif(n,0,1);
}
for (i in 1:r)
{
Xbar[i] = mean(M[,i]);
}
hist(Xbar);
The CLT states that given i.i.d. samples from a distribution with mean and variance, the sample mean (as a random variable) has a distribution that converges to a Gaussian as the number of samples n increase. Here, I will assume that you want to generate r sample sets containing n samples each to create r samples of the sample mean. Some code to do that is as follows:
set.seed(123) ## set the seed for reproducibility
r <- 10000
n <- 200 ## I use 200 instead of 20 to enhance convergence to Gaussian
## this function computes the r samples of the sample mean from the
## r*n original samples
sample.means <- function(samps, r, n) {
rowMeans(matrix(samps,nrow=r,ncol=n))
}
For generating the plots, we use ggplot2 and Aaron's qqplot.data function from here. We also use gridExtra to plot multiple plots in one frame.
library(ggplot2)
library(gridExtra)
qqplot.data <- function (vec) {
# following four lines from base R's qqline()
y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
d <- data.frame(resids = vec)
ggplot(d, aes(sample = resids)) + stat_qq() + geom_abline(slope = slope, intercept = int, colour="red") + ggtitle("Q-Q plot")
}
generate.plots <- function(samps, samp.means) {
p1 <- qplot(samps, geom="histogram", bins=30, main="Sample Histogram")
p2 <- qplot(samp.means, geom="histogram", bins=30, main="Sample Mean Histogram")
p3 <- qqplot.data(samp.means)
grid.arrange(p1,p2,p3,ncol=2)
}
Then we can use these functions with the uniform distribution:
samps <- runif(r*n) ## uniform distribution [0,1]
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Or, with the poisson distribution with mean = 3:
samps <- rpois(r*n,lambda=3)
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Or, with the exponential distribution with mean = 1/1:
samps <- rexp(r*n,rate=1)
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Note that the mean of the sample mean histograms all look like Gaussians with mean that is very similar to the mean of the original generating distribution, whether this is uniform, poisson, or exponential, as predicted by the CLT (also its variance will be 1/(n=200) the variance of the original generating distribution).
Maybe this can help you get started. I have hard-coded the normal distribution and only shown two of your suggested plots: a the histogram of a randomly selected sample, and a histogram of all sample means.
I guess my main suggestion is using a list to store the samples instead of a matrix.
r <- 10000
my.n <- 20
simulation <- list()
for (i in 1:r) {
simulation[[i]] <- rnorm(my.n)
}
sample.means <- sapply(simulation, mean)
selected.sample <- runif(1, min = 1, max = r)
dev.off()
par(mfrow = c(1, 2))
hist(simulation[[selected.sample]])
hist(sample.means)
I am new to R. And I already have a SVM model in R. Right now, I have two raster image, one is the elevation, another one is the slope. The elevation and slope would be used as the predictors for SVM. And I also want to plot the results as a map.
Right now my code is as follow, but the predict for the two raster image input return all 0. It should be 0 or 1. Anything wrong?
library("e1071")
tornado=read.csv(file="~/Desktop/new.csv",header=TRUE,sep=",")
err<- rep(0,5)
m<-0
for (i in c(1:5)) {
#split the data sets into testing and training
training.indices <- sample(nrow(tornado), 1800)
training <- rep(FALSE, nrow(tornado))
training[training.indices] <- TRUE
tornado.input<- tornado[training,]
tornado.input=data.frame(tornado.input)
tornado=data.frame(tornado)
tornado$Sig <- factor(tornado$Sig)
model <- svm(Sig~slope+elevation, data=tornado.input)
pred<- predict(model, tornado[!training,] )
ConfM1<- table(tornado$Sig[!training], pred=pred)
err[i]<-(sum(ConfM1)-sum(diag(ConfM1)))/sum(ConfM1)
}
library("raster")
library("rgdal")
elevation <- raster("~/Desktop/elevation.tif")
slope<- raster("~/Desktop/slope.tif")
#plot(elevation)
#plot(slope)
logo <- brick(elevation, slope)
r1 <- predict(logo,model)
plot(r1)
Maybe it is a bit late to answer this question but I have had the same issue. The raster::predict function does not seem to provide the same output as stats:predict.
My alternative solution is simply to extract the values from your predictor rasters (slope and elevation), then use ggplot to project the results spatially.
####Convert raster into dataframe
logo_df <- as.data.frame(values(logo))
logo_df[c("x","y")] <- coordinates(logo)
logo_df <- logo_df[complete.cases(logo_df),] # in case you had holes in your raster
#### predict to this new data
pred <- predict(model, logo_df, probability = T)
logo_df$svm.fit <- attr(pred, "probabilities")[,2]
###map the predictions
ggplot(logo_df, aes(x,y,fill=svm.fit)) +
geom_tile() +
scale_fill_gradientn(colours = rev(colorRamps::matlab.like(100))) +
coord_fixed()
I was having this problem and found that when I renamed the layers of my RasterStack to be their variable names, and added the type option, it worked!
e.g.
names(logo)<-c("elevation","slope")
r1<-predict(logo,model,type="response")
I am looking to plot a FPR vs TPR point on an AUC graph for different thresholds.
For example, if data$C2 is my data frame with the true response column (either 0 or 1), I want to make a vector with predicted values (0 or 1) when data$C1 (a different measurement column) is above or below the specified threshold. Here is the function I've attempted with the ROCR package.
fun <- function (data, col1, col2){
perfc <- NULL #Create null vectors for prediction and performance
perfs <- NULL
temp <- NULL
d <- seq(0.10,0.30,0.01) ##Various thresholds to be tested
for (i in length(d){
temp <- ifelse(data[,col1] > d, 1 , 0) ##Create predicted responses
pred <- prediction(temp, data[,col2]) #Predict responses over true values
perf <- performance(pred, "tpr","fpr") #Store performance information
predc[i] <- pred #Do this i times for every d in the sequence
perfc[i] <- perf
preds <- prediction.class(predc, col2) #Combine to make prediction class
perfs <- performance.class(preds, "tpr","fpr") #Combine to make performance class
}
plot(perfs) #Plot TPR against FPR
}
Is the problem because temp is a list vector and the true labels are from a matrix? Am I applying this for loop incorrectly?
Thanks in advance!
Edit: Here's my attempt to do this manually without the ROC package.
for(t in seq(0.40,0.60,0.01)) #I want to do this for every t in the sequence
{
t <- t
TP <- 0
FP <- 0
p <- sum(data$C2==1, na.rm=TRUE) #Total number of true positives
n <- sum(data$C2==0, na.rm=TRUE) #Total number of true negatives
list <- data$C1 #Column to vector
test <- ifelse(list > t, 1, 0) #Make prediction vector
for(i in 1:nrow(data))
{if(test==1 & data$C2==1)
{TP <- TP + 1} #Count number of correct predictions
if(test==1 & data$C2==0)
{FP <- FP + 1} #Count number of false positives
}
plot(x=(FP/n),y=(TP/p)) #Plot every FP,TP pair
}
I hope I understand your question right, but I think that by AUC graph you mean ROC curve. The ROC curve already takes into account different thresholds to make those classification decisions. See this wikipedia page. I found this picture particularly helpful.
If the above is right, then all you need to do in your code is:
pred <- prediction(data[,col1], data[,col2])
perf <- performance(pred, "tpr","fpr")
plot(perf)
If you would like to 'add' a different curve to that plot, perhaps because you used a different classification technique (e.g. decision tree instead of logistic regression. Then use plot(perf2,add=TRUE). Where perf2 is created in a same way as perf. See the documentation.
Hello stackoverflow community,
I'm running kmeans (stats package) and Kmeans (amap package) on the Iris dataset. In both cases, I use the same algorithm (Lloyd–Forgy), the same distance (euclidean), the same number of initial random sets (50), the same maximal number of iterations (1000), and I test for the same set of k values (from 2 to 15). I also use the same seed for both cases (4358).
I don't understand why under these conditions I'm getting different wss curves, in particular: the "elbow" using the stats package is much less accentuated than when using the amap package.
Could you please help me to understand why? Thanks much!
Here the code:
# data load and scaling
newiris <- iris
newiris$Species <- NULL
newiris <- scale(newiris)
# using kmeans (stats)
wss1 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
wss1[i] <- sum(kmeans(newiris, centers=i, iter.max=1000, nstart=50,
algorithm="Lloyd")$withinss)
}
# using Kmeans (amap)
library(amap)
wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
wss2[i] <- sum(Kmeans(newiris, centers=i, iter.max=1000, nstart=50,
method="euclidean")$withinss)
}
# plots
plot(1:15, wss1, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="kmeans (stats package)")
plot(1:15, wss2, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Kmeans (amap package)")
EDIT:
I've emailed the author of the amap package and will post the reply when/if I get any.
https://cran.r-project.org/web/packages/amap/index.html
The author of the amap package, changed the code and the value of withinss variable is the sum applied by method (eg. euclidean distance).
One way to solve this, given the return of Kmeans function (amap), recalculate the value of withinss ( Error Sum of Squares (SSE) ).
Here is my suggestion:
# using Kmeans (amap)
library(amap)
wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
set.seed(4358)
ans.Kmeans <- Kmeans(newiris, centers=i, iter.max=1000, nstart=50, method="euclidean")
wss <- vector(mode = "numeric", length=i)
for (j in 1:i) {
km = as.matrix(newiris[which(ans.Kmeans$cluster %in% j),])
## average = as.matrix( t(apply(km,2,mean) ))
## wss[j] = sum( apply(km, 1, function(x) sum((x-average) ^ 2 )))
## or
wss[j] <- ( nrow(km)-1) * sum(apply(km,2,var))
}
wss2[i] = sum(wss)
}
Note. The method for pearson in this package is wrong (be careful !) on version 0.8-14.
Line 325 according code in this link:
https://github.com/cran/amap/blob/master/src/distance_T.inl
I want to generate predicted values of a GLM function including stochastic uncertainty. I use 2 approaches and compare them to make sure its correct.
rm(list=ls())
library(MASS)
n <- 1500
d <- mvrnorm(n=n, mu=c(0,0,0,0),Sigma=matrix(.7, nrow=4, ncol=4) + diag(4)*.3)
d[,1] <- qgamma(p=pnorm(q=d[,1]), shape=2, rate=2) * 1000
m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4], family=gaussian(link="sqrt"))
p_lin <- m$coef[1] + m$coef[2]*d[,2] + m$coef[3]*d[,3] + m$coef[4]*d[,4]
p1 <- rnorm(n=n, mean=p_lin^2, sd=sd(p_lin^2 - d[,1]))
p2 <- simulate(m)$sim_1
par(mfrow=c(1,1), mar=c(4,2,2,1), pch=16, cex=0.8, pty="s")
xylim <- c(min(c(d[,1], p1, p2)), max(c(d[,1], p1, p2)))
plot(x=d[,1], y=p1, xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(x=d[,1], y=simulate(m)$sim_1, col=rgb(0,1,0,alpha=0.1))
abline(a=0, b=1, col="red")
The predictions differ. Looking at the source code of the simulate() function
(this can be done by using:
getS3method(c("predict"), class = "glm")
)
I see that the simulate() function applies a weighted-based sd:
if (!is.null(object$weights))
vars <- vars/object$weights ftd + rnorm(ntot, sd = sqrt(vars)) # this is the prediction including stochastic uncertainty; ftd is defined as fitted(object)
Looking at the help function I read that "The methods for linear models fitted by lm or glm(family = "gaussian") assume that any weights which have been supplied are inversely proportional to the error variance." However, I assume this is about the prior weights, which I did not apply and are NULL (m$prior.weights). However, the simulate function seems to use the m$weights, which seems identical to 4*m$fitted.values. I googled a lot but can't get to the bottom of this. Why does the simulate() function apply these weights in the sd? Is this correct? How are these weights calculated?
(its related to the post: microsimulation GLM including stochastic part; hopefully I'm not wrong in starting a new one)