Counting values above a centile in a centile regression - r

I am trying to figure out how to count values above a certain centile in a centile regression.
I am using reference values to create a centile regression using the package lmsqreg, then I am plotting the experimental values on top.
eg.
#reference data
male.weight <- lmsqreg.fit(log(males$heart),
log(males$weight),
pvec = (centiles <- c(0.03, 0.1, 0.5, 0.9, 0.97)),
lam.fixed=1)
plot(log(males$weight),log(males$heart),
axes=FALSE,
frame.plot=TRUE,
xlab="Weight (kg)",
ylab="Heart weight (g)",
main="Men",
col = "white",
xlim = c(3.2,5.2),
ylim = c(4.7, 7))
axis(1,at=log(xat <- c(3,5,7,9,11,13,15,17,19)*10),
labels=xat)
axis(2,at=log(yat <- (1.5:10)*100),labels=yat)
lpars <- list(lty=rep(c(2,1,2),
c(2,1,2)),
col=c("#444444","#999999","#000000")
[c(1,2,3,2,1)])
for(l in 1:nrow(male.weight$qsys$outmat) ){
lines(male.weight$qsys$targetx,
male.weight$qsys$outmat[l,],
lty=lpars$lty[l],col=lpars$col[l])
}
#experimental data
points(log(m.heart$weight[m.heart$cod.g == "nCA"]),
log(m.heart$heart.w[m.heart$cod.g == "nCA"]),
pch = 21, col = "black")
points(log(m.heart$weight[m.heart$cod.g == "CA"]),
log(m.heart$heart.w[m.heart$cod.g == "CA"]),
pch = "♥", col = "red")
What I would like to find out is what percentage of m.heart$weight[m.heart$cod.g == "nCA"] (experimental data) lies above the 0.9 and the 0.97 centile. Something similar might have been answered elsewhere, but as I am not familiar with the nomenclature for this area, I am not sure what am I supposed to search for.
Thanks in advance,
Lola

Related

Density plot of the F-distribution (df1=1). Theoretical or simulated?

I am plotting the density of F(1,49) in R. It seems that the simulated plot does not match the theoretical plot when values approach the zero.
set.seed(123)
val <- rf(1000, df1=1, df2=49)
plot(density(val), yaxt="n",ylab="",xlab="Observation",
main=expression(paste("Density plot (",italic(n),"=1000, ",italic(df)[1],"=1, ",italic(df)[2],"=49)")),
lwd=2)
curve(df(x, df1=1, df2=49), from=0, to=10, add=T, col="red",lwd=2,lty=2)
legend("topright",c("Theoretical","Simulated"),
col=c("red","black"),lty=c(2,1),bty="n")
Using density(val, from = 0) gets you much closer, although still not perfect. Densities near boundaries are notoriously difficult to calculate in a satisfactory way.
By default, density uses a Gaussian kernel to estimate the probability density at a given point. Effectively, this means that at each point an observation was found, a normal density curve is placed there with its center at the observation. All these normal densities are added up, then the result is normalized so that the area under the curve is 1.
This works well if observations have a central tendency, but gives unrealistic results when there are sharp boundaries (Try plot(density(runif(1000))) for a prime example).
When you have a very high density of points close to zero, but none below zero, the left tail of all the normal kernels will "spill over" into the negative values, giving a Gaussian-type which doesn't match the theoretical density.
This means that if you have a sharp boundary at 0, you should remove values of your simulated density that are between zero and about two standard deviations of your smoothing kernel - anything below this will be misleading.
Since we can control the standard deviation of our smoothing kernel with the bw parameter of density, and easily control which x values are plotted using ggplot, we will get a more sensible result by doing something like this:
library(ggplot2)
ggplot(as.data.frame(density(val), bw = 0.1), aes(x, y)) +
geom_line(aes(col = "Simulated"), na.rm = TRUE) +
geom_function(fun = ~ df(.x, df1 = 1, df2 = 49),
aes(col = "Theoretical"), lty = 2) +
lims(x = c(0.2, 12)) +
theme_classic(base_size = 16) +
labs(title = expression(paste("Density plot (",italic(n),"=1000, ",
italic(df)[1],"=1, ",italic(df)[2],"=49)")),
x = "Observation", y = "") +
scale_color_manual(values = c("black", "red"), name = "")
The kde1d and logspline packages are not bad for such densities.
sims <- rf(1500, 1, 49)
library(kde1d)
kd <- kde1d(sims, bw = 1, xmin = 0)
plot(kd, col = "red", xlim = c(0, 2), ylim = c(0, 2))
curve(df(x, 1, 49), add = TRUE)
library(logspline)
fit <- logspline(sims, lbound = 0, knots = c(0, 0.5, 1, 1.5, 2))
plot(fit, col = "red", xlim = c(0, 2), ylim = c(0, 2))
curve(df(x, 1, 49), add = TRUE)

Which one the is more appropriate predictive model to use in R for the following scenario

I have values in x axis ranging from 300 mm to 0.075 mm, and in y - axis from 0 to 100. I need to predict the values for x = 0.002. There is a need to plot using semilog plot. I tried to use lm function in the following way:
f2 <- data.frame(sievesize = c(0.075, 1.18, 2.36, 4.75), weight = c(55, 66.9, 67.69, 75)
f3 <- data.frame(sievesize = 0.002)
model1 <- lm(weight ~ log10(sievesize), data = f2)
pred3 <- predict(model1, f3)
Is there any better way to predict the values for 0.002?
You cannot do much with the data except to calculate the prediction interval to understand what is a margin of error for your prediction (it will be shown that it is 38.5 mm +/- 21 mm):
just four points in a range of your experimental data (~ 18 bytes of data).
0.002 mm sieve size is outside your data range [0.075, 4.75]. Unfortunately this kind of extrapolation of any model leads to quate a huge prediction error.
non-linear relation you are fitting in lin-log plot has a discontinuity when approach to zero
the data are distributed in a very narrow range for an exponential dependence.
Please see below the code:
f2 <- data.frame(sievesize = c(0.075, 1.18, 2.36, 4.75), weight = c(55, 66.9, 67.69, 75))
f3 <- data.frame(sievesize = c(0.002))
m_lm <- lm(weight ~ log10(sievesize), data = f2)
fit_lm <- predict(m_lm, f3, interval = "prediction")
fit_lm
pred_x <- data.frame(sievesize = seq(0.001, 5, .01))
fit_conf <- predict(m_lm, pred_x, interval = "prediction")
# fit lwr upr
# 1 38.46763 17.73941 59.19586
plot(log10(f2$sievesize), f2$weight, ylim = c(0, 85), pch = 16, xlim = c(-3, 1))
points(log10(f3$sievesize), fit_lm[, 1], col = "red", pch = 16)
lines(log10(pred_x$sievesize), fit_conf[, 1])
lines(log10(pred_x$sievesize), fit_conf[, 2], col = "blue")
lines(log10(pred_x$sievesize), fit_conf[, 3], col = "blue")
legend("bottomright",
legend = c("experiment", "fitted line", "prediction interval", "forecasted"),
lty = c(NA, 1, 1, NA),
lwd = c(NA, 1, 1, NA),
pch = c(16, NA, NA, 16),
col = c("black", "black", "blue", "red"))
and the graph which illustrates above mentioned points:
So the usage some advance techniques like nonlinear fit, glm or bayessian regression etc. will not bring additional insights as the data set is extriemly small and distributed in very narrow range.

How to fit a curve to a histogram

I've explored similar questions asked about this topic but I am having some trouble producing a nice curve on my histogram. I understand that some people may see this as a duplicate but I haven't found anything currently to help solve my problem.
Although the data isn't visible here, here is some variables I am using just so you can see what they represent in the code below.
Differences <- subset(Score_Differences, select = Difference, drop = T)
m = mean(Differences)
std = sqrt(var(Differences))
Here is the very first curve I produce (the code seems most common and easy to produce but the curve itself doesn't fit that well).
hist(Differences, density = 15, breaks = 15, probability = TRUE, xlab = "Score Differences", ylim = c(0,.1), main = "Normal Curve for Score Differences")
curve(dnorm(x,m,std),col = "Red", lwd = 2, add = TRUE)
I really like this but don't like the curve going into the negative region.
hist(Differences, probability = TRUE)
lines(density(Differences), col = "Red", lwd = 2)
lines(density(Differences, adjust = 2), lwd = 2, col = "Blue")
This is the same histogram as the first, but with frequencies. Still doesn't look that nice.
h = hist(Differences, density = 15, breaks = 15, xlab = "Score Differences", main = "Normal Curve for Score Differences")
xfit = seq(min(Differences),max(Differences))
yfit = dnorm(xfit,m,std)
yfit = yfit*diff(h$mids[1:2])*length(Differences)
lines(xfit, yfit, col = "Red", lwd = 2)
Another attempt but no luck. Maybe because I am using qnorm, when the data obviously isn't normal. The curve goes into the negative direction again.
sample_x = seq(qnorm(.001, m, std), qnorm(.999, m, std), length.out = l)
binwidth = 3
breaks = seq(floor(min(Differences)), ceiling(max(Differences)), binwidth)
hist(Differences, breaks)
lines(sample_x, l*dnorm(sample_x, m, std)*binwidth, col = "Red")
The only curve that visually looks nice is the 2nd, but the curve falls into the negative direction.
My question is "Is there a "standard way" to place a curve on a histogram?" This data certainly isn't normal. 3 of the procedures I presented here are from similar posts but I am having some troubles obviously. I feel like all methods of fitting a curve will depend on the data you're working with.
Update with solution
Thanks to Zheyuan Li and others! I will leave this up for my own reference and hopefully others as well.
hist(Differences, probability = TRUE)
lines(density(Differences, cut = 0), col = "Red", lwd = 2)
lines(density(Differences, adjust = 2, cut = 0), lwd = 2, col = "Blue")
OK, so you are just struggling with the fact that density goes beyond "natural range". Well, just set cut = 0. You possibly want to read plot.density extends “xlim” beyond the range of my data. Why and how to fix it? for why. In that answer, I was using from and to. But now I am using cut.
## consider a mixture, that does not follow any parametric distribution family
## note, by construction, this is a strictly positive random variable
set.seed(0)
x <- rbeta(1000, 3, 5) + rexp(1000, 0.5)
## (kernel) density estimation offers a flexible nonparametric approach
d <- density(x, cut = 0)
## you can plot histogram and density on the density scale
hist(x, prob = TRUE, breaks = 50)
lines(d, col = 2)
Note, by cut = 0, density estimation is done strictly within range(x). Outside this range, density is 0.

How To Avoid Density Curve Getting Cut Off In Plot

I am working on an assignment using R and the fitted density curve that is overlaid on the histogram is cut off at it's peak.
Example:
x <- rexp(1000, 0.2)
hist(x, prob = TRUE)
lines(density(x), col = "blue", lty = 3, lwd = 2)
I have done a search on the internet for this but didn't find anything addressing this problem. I have tried playing with the margins, but that doesn't work. Am I missing something in my code?
Thank you for your help!
Here's the simple literal answer to the question. Make an object to hold the result of your density call and use that to set the ylim of the histogram.
x <- rexp(1000, 0.2)
tmp <- density(x)
hist(x, prob = TRUE, ylim = c(0, max(tmp$y)))
lines(tmp, col = "blue", lty = 3, lwd = 2)
(should probably go to SO)

Plot ROC curve and calculate AUC in R at specific cutoff info

Given such data:
SN = Sensitivity;
SP = Specificity
Cutpoint SN 1-SP
1 0.5 0.1
2 0.7 0.2
3 0.9 0.6
How can i plot the ROC curve and calculate AUC. And compare the AUC between two different ROC curves. In the most of the packages such pROC or ROCR, the input of the data is different from those shown above. Can anybody suggest the way to solve this problem in R or by something else?
ROCsdat <- data.frame(cutpoint = c(5, 7, 9), TPR = c(0.56, 0.78, 0.91), FPR = c(0.01, 0.19, 0.58))
## plot version 1
op <- par(xaxs = "i", yaxs = "i")
plot(TPR ~ FPR, data = dat, xlim = c(0,1), ylim = c(0,1), type = "n")
with(dat, lines(c(0, FPR, 1), c(0, TPR, 1), type = "o", pch = 25, bg = "black"))
text(TPR ~ FPR, data = dat, pos = 3, labels = dat$cutpoint)
abline(0, 1)
par(op)
First off, I would recommend to visit your local library and find an introductory book on R. It is important to have a solid base before you can write your own code, and copy-pasting code found on the internet without really understanding what is means is risky at best.
Regarding your question, I believe the (0,0) and (1,1) cooordinates are part of the ROC curve so I included them in the data:
ROCsdat <- data.frame(cutpoint = c(-Inf, 5, 7, 9, Inf), TPR = c(0, 0.56, 0.78, 0.91, 1), FPR = c(0, 0.01, 0.19, 0.58, 1))
AUC
I strongly recommend against setting up your own trapezoid integration function at this stage of your training in R. It's too error-prone and easy to screw up with a small (syntax) mistake.
Instead, use a well established integration code like the trapz function in pracma:
library(pracma)
trapz(ROCsdat$FPR, ROCsdat$TPR)
Plotting
I think you mostly got the plotting, although I would write it slightly differently:
plot(TPR ~ FPR, data = ROCsdat, xlim = c(0,1), ylim = c(0,1), type="b", pch = 25, bg = "black")
text(TPR ~ FPR, data = ROCsdat, pos = 3, labels = ROCsdat$cutpoint)
abline(0, 1, col="lightgrey")
Comparison
For the comparison, let's say you have two AUCs in auc1 and auc2. The if/else syntax looks like this:
if (auc1 < auc2) {
cat("auc1 < auc2!\n")
} else if (auc1 == auc2) {
cat("aucs are identical!\n")
} else {
cat("auc1 > auc2!\n")
}
I suppose you could just compute it manually:
dat <- data.frame(tpr=c(0, .5, .7, .9, 1), fpr=c(0, .1, .2, .6, 1))
sum(diff(dat$fpr) * (dat$tpr[-1] + dat$tpr[-length(dat$tpr)]) / 2)
# [1] 0.785
You need to have the tpr and fpr vectors begin with 0 and end with 1 to compute the AUC properly.

Resources