Given such data:
SN = Sensitivity;
SP = Specificity
Cutpoint SN 1-SP
1 0.5 0.1
2 0.7 0.2
3 0.9 0.6
How can i plot the ROC curve and calculate AUC. And compare the AUC between two different ROC curves. In the most of the packages such pROC or ROCR, the input of the data is different from those shown above. Can anybody suggest the way to solve this problem in R or by something else?
ROCsdat <- data.frame(cutpoint = c(5, 7, 9), TPR = c(0.56, 0.78, 0.91), FPR = c(0.01, 0.19, 0.58))
## plot version 1
op <- par(xaxs = "i", yaxs = "i")
plot(TPR ~ FPR, data = dat, xlim = c(0,1), ylim = c(0,1), type = "n")
with(dat, lines(c(0, FPR, 1), c(0, TPR, 1), type = "o", pch = 25, bg = "black"))
text(TPR ~ FPR, data = dat, pos = 3, labels = dat$cutpoint)
abline(0, 1)
par(op)
First off, I would recommend to visit your local library and find an introductory book on R. It is important to have a solid base before you can write your own code, and copy-pasting code found on the internet without really understanding what is means is risky at best.
Regarding your question, I believe the (0,0) and (1,1) cooordinates are part of the ROC curve so I included them in the data:
ROCsdat <- data.frame(cutpoint = c(-Inf, 5, 7, 9, Inf), TPR = c(0, 0.56, 0.78, 0.91, 1), FPR = c(0, 0.01, 0.19, 0.58, 1))
AUC
I strongly recommend against setting up your own trapezoid integration function at this stage of your training in R. It's too error-prone and easy to screw up with a small (syntax) mistake.
Instead, use a well established integration code like the trapz function in pracma:
library(pracma)
trapz(ROCsdat$FPR, ROCsdat$TPR)
Plotting
I think you mostly got the plotting, although I would write it slightly differently:
plot(TPR ~ FPR, data = ROCsdat, xlim = c(0,1), ylim = c(0,1), type="b", pch = 25, bg = "black")
text(TPR ~ FPR, data = ROCsdat, pos = 3, labels = ROCsdat$cutpoint)
abline(0, 1, col="lightgrey")
Comparison
For the comparison, let's say you have two AUCs in auc1 and auc2. The if/else syntax looks like this:
if (auc1 < auc2) {
cat("auc1 < auc2!\n")
} else if (auc1 == auc2) {
cat("aucs are identical!\n")
} else {
cat("auc1 > auc2!\n")
}
I suppose you could just compute it manually:
dat <- data.frame(tpr=c(0, .5, .7, .9, 1), fpr=c(0, .1, .2, .6, 1))
sum(diff(dat$fpr) * (dat$tpr[-1] + dat$tpr[-length(dat$tpr)]) / 2)
# [1] 0.785
You need to have the tpr and fpr vectors begin with 0 and end with 1 to compute the AUC properly.
Related
I am plotting the density of F(1,49) in R. It seems that the simulated plot does not match the theoretical plot when values approach the zero.
set.seed(123)
val <- rf(1000, df1=1, df2=49)
plot(density(val), yaxt="n",ylab="",xlab="Observation",
main=expression(paste("Density plot (",italic(n),"=1000, ",italic(df)[1],"=1, ",italic(df)[2],"=49)")),
lwd=2)
curve(df(x, df1=1, df2=49), from=0, to=10, add=T, col="red",lwd=2,lty=2)
legend("topright",c("Theoretical","Simulated"),
col=c("red","black"),lty=c(2,1),bty="n")
Using density(val, from = 0) gets you much closer, although still not perfect. Densities near boundaries are notoriously difficult to calculate in a satisfactory way.
By default, density uses a Gaussian kernel to estimate the probability density at a given point. Effectively, this means that at each point an observation was found, a normal density curve is placed there with its center at the observation. All these normal densities are added up, then the result is normalized so that the area under the curve is 1.
This works well if observations have a central tendency, but gives unrealistic results when there are sharp boundaries (Try plot(density(runif(1000))) for a prime example).
When you have a very high density of points close to zero, but none below zero, the left tail of all the normal kernels will "spill over" into the negative values, giving a Gaussian-type which doesn't match the theoretical density.
This means that if you have a sharp boundary at 0, you should remove values of your simulated density that are between zero and about two standard deviations of your smoothing kernel - anything below this will be misleading.
Since we can control the standard deviation of our smoothing kernel with the bw parameter of density, and easily control which x values are plotted using ggplot, we will get a more sensible result by doing something like this:
library(ggplot2)
ggplot(as.data.frame(density(val), bw = 0.1), aes(x, y)) +
geom_line(aes(col = "Simulated"), na.rm = TRUE) +
geom_function(fun = ~ df(.x, df1 = 1, df2 = 49),
aes(col = "Theoretical"), lty = 2) +
lims(x = c(0.2, 12)) +
theme_classic(base_size = 16) +
labs(title = expression(paste("Density plot (",italic(n),"=1000, ",
italic(df)[1],"=1, ",italic(df)[2],"=49)")),
x = "Observation", y = "") +
scale_color_manual(values = c("black", "red"), name = "")
The kde1d and logspline packages are not bad for such densities.
sims <- rf(1500, 1, 49)
library(kde1d)
kd <- kde1d(sims, bw = 1, xmin = 0)
plot(kd, col = "red", xlim = c(0, 2), ylim = c(0, 2))
curve(df(x, 1, 49), add = TRUE)
library(logspline)
fit <- logspline(sims, lbound = 0, knots = c(0, 0.5, 1, 1.5, 2))
plot(fit, col = "red", xlim = c(0, 2), ylim = c(0, 2))
curve(df(x, 1, 49), add = TRUE)
I am plotting a count histogram of my data and then overlaying the shape of the gamma distribution that I think underlies the data. The points on the gamma distribution are generated using dgamma and plotted using curve. No matter how many points I use to generate the curve, the output still looks pixellated. Does anyone know why and is it possible to obtain a smooth curve?
dt <- rgamma(100, shape = 2.59, scale = 0.16)
n <- 1000
x <- seq(min(dt), max(dt), length.out = n)
hist(dt, freq = FALSE, ylim = c(0, 2.5), xlab = "D")
curve(dgamma(x, shape = 2.59, scale = 0.16), add = TRUE, lwd = 2, n = n)
curve has a n attribute that defaults to 101 you need to increase that. So note that not the datapoints you throw in determine your resolution.
curve(dgamma(x, shape = 2.59, scale = 0.16), add = TRUE, lwd = 2, n = n) # then it takes your defined n = 1000
I have values in x axis ranging from 300 mm to 0.075 mm, and in y - axis from 0 to 100. I need to predict the values for x = 0.002. There is a need to plot using semilog plot. I tried to use lm function in the following way:
f2 <- data.frame(sievesize = c(0.075, 1.18, 2.36, 4.75), weight = c(55, 66.9, 67.69, 75)
f3 <- data.frame(sievesize = 0.002)
model1 <- lm(weight ~ log10(sievesize), data = f2)
pred3 <- predict(model1, f3)
Is there any better way to predict the values for 0.002?
You cannot do much with the data except to calculate the prediction interval to understand what is a margin of error for your prediction (it will be shown that it is 38.5 mm +/- 21 mm):
just four points in a range of your experimental data (~ 18 bytes of data).
0.002 mm sieve size is outside your data range [0.075, 4.75]. Unfortunately this kind of extrapolation of any model leads to quate a huge prediction error.
non-linear relation you are fitting in lin-log plot has a discontinuity when approach to zero
the data are distributed in a very narrow range for an exponential dependence.
Please see below the code:
f2 <- data.frame(sievesize = c(0.075, 1.18, 2.36, 4.75), weight = c(55, 66.9, 67.69, 75))
f3 <- data.frame(sievesize = c(0.002))
m_lm <- lm(weight ~ log10(sievesize), data = f2)
fit_lm <- predict(m_lm, f3, interval = "prediction")
fit_lm
pred_x <- data.frame(sievesize = seq(0.001, 5, .01))
fit_conf <- predict(m_lm, pred_x, interval = "prediction")
# fit lwr upr
# 1 38.46763 17.73941 59.19586
plot(log10(f2$sievesize), f2$weight, ylim = c(0, 85), pch = 16, xlim = c(-3, 1))
points(log10(f3$sievesize), fit_lm[, 1], col = "red", pch = 16)
lines(log10(pred_x$sievesize), fit_conf[, 1])
lines(log10(pred_x$sievesize), fit_conf[, 2], col = "blue")
lines(log10(pred_x$sievesize), fit_conf[, 3], col = "blue")
legend("bottomright",
legend = c("experiment", "fitted line", "prediction interval", "forecasted"),
lty = c(NA, 1, 1, NA),
lwd = c(NA, 1, 1, NA),
pch = c(16, NA, NA, 16),
col = c("black", "black", "blue", "red"))
and the graph which illustrates above mentioned points:
So the usage some advance techniques like nonlinear fit, glm or bayessian regression etc. will not bring additional insights as the data set is extriemly small and distributed in very narrow range.
I try to compare the power functions of the Chi-square-Test and the t-Test for one particular value and my overall goal was to show that the t-Test is more powerful (because it has an assumption about the distribution). I used the pwr package for R for calculating the power of each function and then wrote two functions and plotted the results.
However, I do not find that the t-test is better than the Chi-square-test, and I am puzzled by the result. I spend hours on it so every help is so much appreciated.
Is the code wrong, do I have a wrong understanding of the power functions, or is there something wrong in the package?
library(pwr)
#mu is the value for which the power is calculated
#no is the number of observations
#function of the power of the t-test with a h0 of .2
g <- function(mu, alpha, no) { #calculate the power of a particular value for the t-test with h0=.2
p <- mu-.20
sigma <- sqrt(.5*(1-.5))
pwr.t.test(n = no, d = p/sigma, sig.level = alpha, type = "one.sample", alternative="greater")$power # d is the effect size p/sigma
}
#chi squared test
h <- function(mu, alpha, no, degree) {#calculate the power of a particular value for the chi squared test
p01 <- .2 # these constructs the effect size (which is a bit different for the chi squared)
p02 <- .8
p11 <-mu
p12 <- 1-p11
effect.size <- sqrt(((p01-p11)^2/p01)+((p02-p12)^2/p02)) # effect size
pwr.chisq.test(N=no, df=degree, sig.level = alpha, w=effect.size)$power
}
#create a diagram
plot(1, 1, type = "n",
xlab = expression(mu),
xlim = c(.00, .75),
ylim = c(0, 1.1),
ylab = expression(1-beta),
axes=T, main="Power function t-Test and Chi-squared-Test")
axis(side = 2, at = c(0.05), labels = c(expression(alpha)), las = 3)
axis(side = 1, at = 3, labels = expression(mu[0]))
abline(h = c(0.05, 1), lty = 2)
legend(.5,.5, # places a legend at the appropriate place
c("t-Test","Chi-square-Test"), # puts text in the legend
lwd=c(2.5,2.5),col=c("black","red"))
curve(h(x, alpha = 0.05, no = 100, degree=1), from = .00, to = .75, add = TRUE, col="red",lwd=c(2.5,2.5) )
curve(g(x, alpha = 0.05, no = 100), from = .00, to = .75, add = TRUE, lwd=c(2.5,2.5))
Thanks a lot in advance!
If I understand the problem right, you are testing for a Binomial distribution with the mean under the null equal to 0.2 and the alternative being null greater than 0.2? If so, then on line 2 of you function g, shouldn't it be sigma <- sqrt(.2*(1-.2)) instead of sigma <- sqrt(.5*(1-.5))? That way, your standard deviation will be smaller, resulting in a larger test statistic and hence smaller p-value leading to higher power.
I am trying to figure out how to count values above a certain centile in a centile regression.
I am using reference values to create a centile regression using the package lmsqreg, then I am plotting the experimental values on top.
eg.
#reference data
male.weight <- lmsqreg.fit(log(males$heart),
log(males$weight),
pvec = (centiles <- c(0.03, 0.1, 0.5, 0.9, 0.97)),
lam.fixed=1)
plot(log(males$weight),log(males$heart),
axes=FALSE,
frame.plot=TRUE,
xlab="Weight (kg)",
ylab="Heart weight (g)",
main="Men",
col = "white",
xlim = c(3.2,5.2),
ylim = c(4.7, 7))
axis(1,at=log(xat <- c(3,5,7,9,11,13,15,17,19)*10),
labels=xat)
axis(2,at=log(yat <- (1.5:10)*100),labels=yat)
lpars <- list(lty=rep(c(2,1,2),
c(2,1,2)),
col=c("#444444","#999999","#000000")
[c(1,2,3,2,1)])
for(l in 1:nrow(male.weight$qsys$outmat) ){
lines(male.weight$qsys$targetx,
male.weight$qsys$outmat[l,],
lty=lpars$lty[l],col=lpars$col[l])
}
#experimental data
points(log(m.heart$weight[m.heart$cod.g == "nCA"]),
log(m.heart$heart.w[m.heart$cod.g == "nCA"]),
pch = 21, col = "black")
points(log(m.heart$weight[m.heart$cod.g == "CA"]),
log(m.heart$heart.w[m.heart$cod.g == "CA"]),
pch = "♥", col = "red")
What I would like to find out is what percentage of m.heart$weight[m.heart$cod.g == "nCA"] (experimental data) lies above the 0.9 and the 0.97 centile. Something similar might have been answered elsewhere, but as I am not familiar with the nomenclature for this area, I am not sure what am I supposed to search for.
Thanks in advance,
Lola