R: How to read Nomograms to predict the desired variable - r
I am using Rstudio. I have created nomograms using function nomogram from package rms using following code (copied from the example code of the documentation):
library(rms)
n <- 1000 # define sample size
set.seed(17) # so can reproduce the results
age <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol <- rnorm(n, 200, 25)
sex <- factor(sample(c('female','male'), n,TRUE))
# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
(log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)
ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')
f <- lrm(y ~ lsp(age,50)+sex*rcs(cholesterol,4)+blood.pressure)
nom <- nomogram(f, fun=function(x)1/(1+exp(-x)), # or fun=plogis
fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999),
funlabel="Risk of Death")
#Instead of fun.at, could have specified fun.lp.at=logit of
#sequence above - faster and slightly more accurate
plot(nom, xfrac=.45)
Result:
This code produces a nomogram but there is no line connecting each scale (called isopleth) to help predict the desired variable ("Risk of Death") from the plot. Usually, nomograms have the isopleth for prediction (example from wikipedia). But here, how do I predict the variable value?
EDIT:
From the documentation:
The nomogram does not have lines representing sums, but it has a
reference line for reading scoring points (default range 0--100). Once
the reader manually totals the points, the predicted values can be
read at the bottom.
I don't understand this. It seems that predicting is supposed to be done without the isopleth, from the scale of points. but how? Can someone please elaborate with this example on how I can read the nomograms to predict the desired variable? Thanks a lot!
EDIT 2 (FYI):
In the description of the bounty, I am talking about the isopleth. When starting the bounty, I did not know that nomogram function does not provide isopleth and has points scale instead.
From the documentation, the nomogram is used to manualy obtain prediction:
In the top of the plot (over Total points)
you draw a vertical line for each of the variables of your patient (for example age=40, cholesterol=220 ( and sex=male ), blood.pressure=172)
then you sum up the three values you read on the Points scale (40+60+3=103) to obtain Total Points.
Finally you draw a vertical line on the Total Points scale (103) to read the Risk of death (0.55).
These are regression nomograms, and work in a different way to classic nomograms. A classic nomogram will perform a full calculation. For these nomograms you drop a line from each predictor to the scale at the bottom and add your results.
The only way to have a classic 'isopleth' nomogram working on a regression model would be 1 have just two predictors or 2 have a complex multi- step nomogram.
Related
r qqp function - why is the 'perfect fit' a flat line on 0?
This may be more of a statistical question than a programming one. I just wanted to make sure I was getting the programming right first. I have a large count dataset (108 sites with 31 species = 3348 observations) but a lot of these are 0 counts because only not species were not present at every site. I have had log transformation suggested to me but others have also said that you shouldn't log transform count data. Here is my data for the first 8 species (also contains the very abundant species with the highest counts): example.abund <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0, 0,0,1,0,8,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,2,0,3,1,0,0,0,0,0,0,0,0,0, 2,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1, 0,1,0,0,0,28,1,0,1,0,0,1,0,2,0,0,2,0,0,0,1,0,0,0,1,0,0,0,2,0,0,1,0,0, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,2,0,1,0,0,8,7,7,1,1,13,0,8,0,3,0,1,1, 1,4,4,0,1,0,1,0,0,0,0,6,5,2,0,2,58,4,2,47,4,0,0,0,2,59,2,0,0,6,1,36,28,2, 1,1,0,6,0,0,2,5,0,0,0,0,87,7,0,1,1,1,0,0,1,1,0,6,11,0,0,0,3,0,4,0,7,2, 0,5,0,4,1,0,1,12,0,2,0,9,0,1,0,0,0,24,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,3,1,0,1,0,1,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,15,0,2, 81,0,1,32,26,13,2,61,0,66,2,2,0,17,43,43,0,25,19,2,25,26,91,61,0,13,0,62,186,1,4,22,1,50,3,67,86,11,56,26,74,0,6,8,7,0152,8,14,1,97,1,0,12,11,3,1,1,112,2,35,36,5,61,26,211,15,8,173,17,97,22,18,88,11,1,66,15,3,3,3,2,0,1,0,41,9,14,1,0,38,0,0,51,27,11,38,31,1,0,221,68,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,2,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,29,0,0,0,0, 0,82,12,0,0,3,0,9,0,0,164,0,0,0,0,1,0,15,0,0,0,6,56,0,0,0,6,0,0,1,0,5,5,8, 0,4,0,0,6,0,0,2,0,0,3,0,0,0,0,683,0,0,0,0,3,149,252,11,13,195,19,0,59,0,0,1,28,0, 0,0,0,0,0,0,0,0,0,0,31,55,85,0,142,0,44,52,0,0192,0,45,0,0,0,0,0,0,11,2,0,0,6, 0,0,0,0,0,0,0,0,0,0,0,0,0,19,3,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0) I am need to make a mixed model to fit the data, but first I am trying to figure out the most appropriate distribution to use. I was following the steps in this blog. But all of the red lines (meant to represent 'perfect fit' for that distribution) are coming up as being 0 along the entire plot. My question is: have a coded this correctly and there are so many 0s in my data that the perfect fit is 0? Or is there something wrong with the way I have coded? Code example: #so that the families without 0s can recognise data example.abund.1 <- example.abund + 1 plot(hist(example.abund)) qqp(example.abund, "norm") qqp(example.abund.1, "lnorm") #lognorm #have to generate estimates of parameters: nbinom <- fitdistr(example.abund.1, "Negative Binomial") qqp(example.abund.1, "nbinom", size = nbinom$estimate[[1]], mu = nbinom$estimate[[2]]) poisson <- fitdistr(example.abund.1, "Poisson") qqp(example.abund.1, "pois", poisson$estimate) gamma <- fitdistr(example.abund.1, "gamma") qqp(example.abund.1, "gamma", shape = gamma$estimate[[1]], rate = gamma$estimate[[2]])
Obtaining confidence interval for npreg as values, not as plot
I am using the well known "np" package of Hayfield & Racine for non-parametric regressions. It allows plotting confidence bands for the estimated coefficient based on bootstrap procedures. See the code below for an example. Question: I am wondering how to obtain these confidence intervalls in numerical form? One, but not the only reason for this question is that I really don't like the presentation of the ci's. More generally speaking, I would like to use and further process the confidence band within my analysis. library(np) # generate random variables: x <- 1:100 + rnorm(100)/2 y <- (1:100)^(0.25) + rnorm(100)/2 mynp <- npreg(y~x) plot(mynp, plot.errors.method="bootstrap")`
when executing plot, the function is calling to the plot method of np package which is the function npplot npplot exepts an argument plot.behavior which equals to plot by default which plots the results and returns NULL. you should set plot.behavior = "plot-data", and the function will plot and return the data of the object. dat <- plot(mynp, plot.errors.method="bootstrap",plot.behavior = "plot-data") Than the values in the line can be accesed through dat$r1$mean and the values to be added to the mean to get the upper and lower ci accesed through dat$r1$merr. notice that not all value are plotted. only half of them (every other value and than the last). read the 'help' on npplot for more options. further is an example of the use of the code and the results: library(np) # generate random variables: x <- 1:100 + rnorm(100)/2 y <- (1:100)^(0.25) + rnorm(100)/2 mynp <- npreg(y~x) dat <- plot(mynp, plot.errors.method="bootstrap",plot.behavior = "plot-data") Then recreating the results: z <- unlist(dat$r1$eval,use.names = F) CI.up = as.numeric(dat$r1$mean)+as.numeric(dat$r1$merr[,2]) CI.dn = as.numeric(dat$r1$mean)+as.numeric(dat$r1$merr[,1]) plot(dat$r1$mean~z, cex=1.5,xaxt='n', ylim=c(1.0,3.5),xlab='',ylab='lalala!', main='blahblahblah',col='blue',pch=16) arrows(z,CI.dn,z,CI.up,code=3,length=0.2,angle=90,col='red') we will get: As you can see, theresults are the same (only I have calculated the intervals for each point and not only for half of them). note the plot.errors.type attribute for npplot which gets "standard" and "quantiles" and is "standard" at default. When you specify "standard" dat$r1$merr will keep the standard errors and the plot will include mean+std err as intervals. Alternatively the plot will include the quantiles as the intervals and the quantiles will be saved at dat$r1$merr. which quntiles to use are specified by plot.errors.quantiles quantiles and it's only relevant if plot.errors.type = "quantiles"
Fit distribution to given frequency values in R
I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution. Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it? sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518, 611,1037,727,489,432,371,1125,69,595,624) plot(1:length(sample), sample, type = "l") points(1:length(sample), sample) Update. To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have: 7787 realizations of value 1 3056 realizations of value 2 2359 realizations of value 3 ... etc. Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations: # Loop to simulate values set.values <- c() for(i in 1:length(sample)){ set.values <<- c(set.values, rep(i, times = sample[i])) } hist(set.values) lines(1:length(sample), sample) points(1:length(sample), sample) and use fitdistr on the set.values: f2 <- fitdistr(set.values, 'weibull') f2 Why I think it is incorrect way and why I am looking for a better solution in R? in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)
Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over. Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot. If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box. library(optimx) sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518, 611,1037,727,489,432,371,1125,69,595,624) t.sample <- 0:22 s.fit <- sample[3:23] t.fit <- t.sample[3:23] wx <- function(param) { res <- param[2]*dweibull(t.fit*param[3],shape=param[1]) return(res) } minwx <- function(param){ v <- s.fit-wx(param) sqrt(sum(v*v)) } p0 <- c(1,200,1/20) paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1)) popt <- paramopt$par popt rms <- paramopt$value tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms) plot(t.sample[2:23], sample[2:23], type = "p",col="darkred") lines(t.fit, wx(popt),col="blue") title(main=tit)
You can directly calculate the maximum likelihood parameters, as described here. # Defining the error of the implicit function k.diff <- function(k, vec){ x2 <- seq(length(vec)) abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2), w = x2^k*sample)) } # Setting the error to "quite zero", fulfilling the equation k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min # Calculate lambda, given k l <- weighted.mean(seq(length(sample))^k, w = sample) # Plot plot(density(rep(seq(length(sample)),sample))) x <- 1:25 lines(x, dweibull(x, shape=k, scale= l))
Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this: sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518, 611,1037,727,489,432,371,1125,69,595,624) f<-fitdistr(sample, 'weibull') f If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best. For your example this would look like this: ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2]) ks The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution. Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option. f<-fitdistr(sample, 'weibull') z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2]) hist(z) f<-fitdistr(sample, 'exponential') z = rexp(10000, f$estimate[1]) hist(z)
How to draw my function to plot with data in R
I have data about response time at web site according users that hit at the same time. For example: 10 users hit the same time have (average) response time 300ms 20 users -> 450ms etc I import the data in R and I make the plot from 2 columns data (users, response time). Also I use the function loess to draw a line about those points, at the plot. Here's the code that I have wrote: users <- seq(5,250, by=5) responseTime <- c(179.5,234.0,258.5,382.5,486.0,679.0,594.0,703.5,998.0,758.0,797.0,812.0,804.5,890.5,1148.5,1182.5,1298.0,1422.0,1413.5,1209.5,1488.0,1632.0,1715.0,1632.5,2046.5,1860.5,2910.0,2836.0,2851.5,3781.0,2725.0,3036.0,2862.0,3266.0,3175.0,3599.0,3563.0,3375.0,3110.0,2958.0,3407.0,3035.5,3040.0,3378.0,3493.0,3455.5,3268.0,3635.0,3453.0,3851.5) data1 <- data.frame(users,responseTime) data1 plot(data1, xlab="Users", ylab="Response Time (ms)") lines(data1) loess_fit <- loess(responseTime ~ users, data1) lines(data1$users, predict(loess_fit), col = "green") Here's my plot's image: My questions are: How to draw my nonlinear function at the same plot to compare it with the other lines? example: response_time (f(x)) = 30*users^2. Also how to make predictions for the line of function loess and for my function and show them to the plot, example: if I have data until 250 users, make prediction until 500 users
If you know the equation of the line that you want to draw, then just define a variable for your prediction: predictedResponseTime <- 30 * users ^ 2 lines(users, predictedResponseTime) If the problem is that you want to fit a line, then you need to call a modelling function. Since loess is a non-parametric model, is isn't appropriate to use it to make predictions outside of the range of your data. In this case, a simple (ordinary least squares) linear regression using lm provides a reasonable fit. model <- lm(responseTime ~ users) prediction <- data.frame(users = 1:500) prediction$responseTime <- predict(model, prediction) with(prediction, lines(users, responseTime))
Another solution to plot your curve knowing the underlying function is function curve. In your example of f(x)=30x^2: plot(data1, xlab="Users", ylab="Response Time (ms)") lines(data1) lines(data1$users, predict(loess_fit), col = "green") curve(30*x^2,col="red", add=TRUE) #Don't forget the add parameter.
Programming a QQ plot
I have a sample of math test scores for male and female students. I want to draw QQ plot for each gender to see if each of them is normally distributed. I know how to draw the QQ plot for the overall sample, but how can I draw them separately?
Here is a simple solution using base graphics: scores <- rnorm(200, mean=12, sd=2) gender <- gl(2, 50, labels=c("M","F")) opar <- par(mfrow=c(1,2)) for (g in levels(gender)) qqnorm(scores[gender==g], main=paste("Gender =", g)) par(opar) A more elegant lattice solution then: qqmath(~ scores | gender, data=data.frame(scores, gender), type=c("p", "g")) See the on-line help for qqmath for more discussion and example of possible customization.
In Python, you have a QQplot method offered by the OpenTURNS Library see doc here. Here is an example. In a first step, we generate a random sample of size 300 from a Uniform distribution. In a second step, we consider that we do not know where this sample comes from and try to fit a Normal distribution and a Uniform distribution. In a third step, we draw the QQPlot of ;the sample against each of the fitted distributions in order to "see" which one is the best 1st step: import openturns as ot from openturns.viewer import View distribution = ot.Uniform(-1, 1) sample = distribution.getSample(300) 2nd step: fitted_normal = ot.NormalFactory().build(sample) fitted_uniform = ot.UniformFactory().build(sample) 3rd step: QQ_plot1 = ot.VisualTest.DrawQQplot(sample, fitted_normal) QQ_plot2 = ot.VisualTest.DrawQQplot(sample,fitted_uniform) View(QQ_plot1) View(QQ_plot2) As expected, the fitted Uniform is more adapted to the sample the Normal which has bigger error at both ends of the domain.