pth Quantile of Standard Normal Distribution - R - r

I'm learning statistics and R from a book called "Discovering Statistics using R"... Although it's very informative, it seems to skip over areas even though it suggests no prior knowledge of statistics or R is needed. So to the problem:
How can you calculate in R the pth quantile of the Standard Normal Distribution using the Dichotomy (or division in halves) method? (and assuming no use of qnorm()). that is:
pnorm(x) = p
pnorm(x)-p = 0
f (x) = 0
Update:
Dichotomy is a method where you take an interval [a,b] which takes values of different
signs at the end points of the interval and has a single root x within [a,b]. You then half if to find F(x1), and if f(x1) != 0 it gives you [a,x1] and [x1,b]... where the sequence x1, x2,..., converges to 0.

Clumsy, but this works:
tolerance <- 1e-6
interval <- c(-1000,1000)
quantile <- 0.2
while(interval[2]-interval[1] > tolerance) {
cat('current interval: ',interval,'\n')
interval.left <- c(interval[1],mean(interval))
interval.right <- c(mean(interval),interval[2])
if(sum(sign(pnorm(interval.left)-quantile))==0) {
interval <- interval.left
} else {
interval <- interval.right
}
}
mean(interval)
qnorm(quantile)

Related

Generate Random Survival Times From A Hazard Function Applying A Hazard Rate with flexsurv

Please consider the following:
My aim is to draw random survival times from a flexible parametric multi-state model fitted with flexsurvreg (more specifically msfit.flexsurvreg) and applying some hazard ratio (HR, in this example set to 0.2).
I found an example to generate random survival times using any hazard function here. This is also were I apply the HR.
Problem
With the actual data, I receive an error once the HR is below the value of 0.2: Error in uniroot((function(x) { : no sign change found in 1000 iterations.
This does not happen in the reproducible example below.
Questions
Is there another, better way than the one below to draw random survival times while applying a HR?
Can someone indicate why the "no sign change" error may occur and how this can be fixed?
Any help is greatly appreciated!
Minimal reproducible example
# Load package
library(flexsurv)
#> Loading required package: survival
# Load data
data("bosms4")
# Define hazard ratio
hr <- 0.2
# Fit model (weibull)
crwei <- flexsurvreg(formula = Surv(years, status) ~ trans + shape(trans),
data = bosms3, dist = "weibull")
# Create transition matrix
Q <- rbind(c(NA,1,2),c(NA,NA,3), c(NA,NA,NA))
# Capture parameters
pars <- pars.fmsm(crwei, trans=Q, newdata=data.frame(trans=1:3))
# Code from https://eurekastatistics.com/generating-random-survival-times-from-any-hazard-function/ ----
inverse = function(fn, min_x, max_x){
# Returns the inverse of a function for a given range.
# E.g. inverse(sin, 0, pi/2)(sin(pi/4)) equals pi/4 because 0 <= pi/4 <= pi/2
fn_inv = function(y){
uniroot((function(x){fn(x) - y}), lower=min_x, upper=max_x)[1]$root
}
return(Vectorize(fn_inv))
}
integrate_from_0 = function(fn, t){
int_fn = function(t) integrate(fn, 0, t)
result = sapply(t, int_fn)
value = unlist(result["value",])
msg = unlist(result["message",])
value[which(msg != "OK")] = NA
return(value)
}
random_survival_times = function(hazard_fn, n, max_time=10000){
# Given a hazard function, returns n random time-to-event observations.
cumulative_density_fn = function(t) 1 - exp(-integrate_from_0(hazard_fn, t))
inverse_cumulative_density_fn = inverse(cumulative_density_fn, 0, max_time)
return(inverse_cumulative_density_fn(runif(n)))
}
# Run with data ----
random_survival_times(hazard_fn = function(t){crwei$dfns$h(t, pars[[1]][1], pars[[1]][2]) * hr}, n = 100)
#> Error in integrate(fn, 0, t): non-finite function value
# Adapt random_survival time function replacing 0 with 0.1 ----
random_survival_times <- function(hazard_fn, n, max_time=10000){
# Given a hazard function, returns n random time-to-event observations.
cumulative_density_fn = function(t) 1 - exp(-integrate_from_0(hazard_fn, t))
inverse_cumulative_density_fn = inverse(cumulative_density_fn, 0.1, max_time)
return(inverse_cumulative_density_fn(runif(n)))
}
# Run again with data ----
random_survival_times(hazard_fn = function(t){crwei$dfns$h(t, pars[[1]][1], pars[[1]][2]) * hr}, n = 100)
#> Error in uniroot((function(x) {: f() values at end points not of opposite sign
# Adapt inverse adding extendedInt = "yes" ----
inverse <- function(fn, min_x, max_x){
# Returns the inverse of a function for a given range.
# E.g. inverse(sin, 0, pi/2)(sin(pi/4)) equals pi/4 because 0 <= pi/4 <= pi/2
fn_inv <- function(y){
uniroot((function(x){fn(x) - y}), lower=min_x, upper=max_x,
extendInt = "yes" # extendInt added because of error on some distributions: "Error in uniroot((function(x) { : f() values at end points not of opposite sign. Solution found here: https://stackoverflow.com/questions/38961221/uniroot-solution-in-r
)[1]$root
}
return(Vectorize(fn_inv))
}
# Run again with data ----
res <- random_survival_times(hazard_fn = function(t){crwei$dfns$h(t, pars[[1]][1], pars[[1]][2]) * hr}, n = 100)
res[1:5]
#> [1] 1.074281 13.688663 30.896637 159.643827 15.805103
Created on 2022-10-18 with reprex v2.0.2
This method of sampling survival times basically works by sampling a random uniform(0,1) number p and finding x for which the survival probability is p. The uniroot step is used to solve S(x) = p by a numerical search. In your case, it is having difficulty finding a solution after 1000 steps.
I've seen this happen, and fixed by adding, e.g. uniroot(..., maxiter=10000) to tell it to try a bit harder to find the solution. That's always been enough in my tests, though those may be limited. If that doesn't work, I'd advise digging in by hand and examining the survival curve that you are trying to invert - it may be invalid due to some parameter value being extreme.
(This kind of thing is done in the function qgeneric in the flexsurv package, though it borrows a vectorised version of uniroot from the rstpm2 package which is faster.)

Exponential distribution in R

I want to simulate some data from an exp(1) distribution but they have to be > 0.5 .so i used a while loop ,but it does not seem to work as i would like to .Thanks in advance for your responses !
x1<-c()
w<-rexp(1)
while (length(x1) < 100) {
if (w > 0.5) {
x1<- w }
else {
w<-rexp(1)
}
}
1) The code in the question has these problems:
we need a new random variable on each iteration but it only generates new random variables if the if condition is FALSE
x1 is repeatedly overwritten rather than extended
although while could be used repeat seems better since having the test at the end is a better fit than the test at the beginning
We can fix this up like this:
x1 <- c()
repeat {
w <- rexp(1)
if (w > 0.5) {
x1 <- c(x1, w)
if (length(x1) == 100) break
}
}
1a) A variation would be the following. Note that an if whose condition is FALSE evaluates to NULL if there is no else leg so if the condition is FALSE on the line marked ## then nothing is concatenated to x1.
x1 <- c()
repeat {
w <- rexp(1)
x1 <- c(x1, if (w > 0.5) w) ##
if (length(x1) == 100) break
}
2) Alternately, this generates 200 exponential random variables keeping only those greater than 0.5. If fewer than 100 are generated then repeat. At the end it takes the first 100 from the last batch generated. We have chosen 200 to be sufficiently large that on most runs only one iteration of the loop will be needed.
repeat {
r <- rexp(200)
r <- r[r > 0.5]
if (length(r) >= 100) break
}
r <- head(r, 100)
Alternative (2) is actually faster than (1) or (1a) because it is more highly vectorized. This is despite it throwing away more exponential random variables than the other solutions.
I would advise against a while (or any other accept/reject) loop; instead use the methods from truncdist:
# Sample 1000 observations from a truncated exponential
library(truncdist);
x <- rtrunc(1000, spec = "exp", a = 0.5);
# Plot
library(ggplot2);
ggplot(data.frame(x = x), aes(x)) + geom_histogram(bins = 50) + xlim(0, 10);
It's also fairly straightforward to implement a sampler using inverse transform sampling to draw samples from a truncated exponential distribution that avoids rejecting samples in a loop. This will be a more efficient method than any accept/reject-based sampling method, and works particularly well in your case, since there exists a closed form of the truncated exponential cdf.
See for example this post for more details.

Calculating pmf and cdf for 20 sided dice in R

I would like to create two functions that would calculate the probability mass function (pmf) and cumulative distribution function (cdf) for a dice of 20 sides.
In the function I would use one argument, y for the side(from number 1 to 20). I should be able to put a vector and it would return the value for each of the variable.
If the value entered is non-discrete, it should then return zero in the result and a warning message.
This is what have solved so far for PMF:
PMF= function(side) {
a = NULL
for (i in side)
{
a= dbinom(1, size=1, prob=1/20)
print(a)
}
}
And this is what I got for CDF:
CDF= function(side) {
a = NULL
for (i in side)
{
a= pnorm(side)
print(a)
}
}
I am currently stuck with the warning message and the zero in result. How can I assing in the function the command line for that?
Next,how can I plot these two functions on the same plot on a specific interval (for example 1,12)?
Did I use the right function for calculating cdf and pmf?
I would propose the following simplifications:
PMF <- function(side) {
x <- rep(0.05, length(side))
bad_sides <- ! side %in% 1:20 # sides that aren't in 1:20 are bad
x[bad_sides] <- 0 # set bad sides to 0
# warnings use the warning() function. See ?warning for details
if (any(bad_sides)) warning("Sides not integers between 1 and 20 have 0 probability!")
# print results is probably not what you want, we'll return them instead.
return(x)
}
For the CDF, I assume you mean the probability of rolling a number less than or equal to the side given, which is side / 20. (pnorm is the wrong function... it gives the CDF of the normal distribution.)
CDF <- function(side) {
return(pmin(1, pmax(0, floor(side) / 20)))
}
Technically, the CDF is defined for non-integer values. The CDF of 1.2 is just the same as the CDF of 1, so I use floor here. If you want to make it more robust, you could make it min(1, floor(side) / 20) to make sure it doesn't exceed 1, and similarly a max() with 0 to make sure it's not negative. Or you could just try not to give it negative values or values over 20.
Plotting:
my_interval <- 1:12
plot(range(my_interval), c(0, 1), type = "n")
points(my_interval, PMF(my_interval))
lines(my_interval, CDF(my_interval), type = "s")

cost function in cv.glm of boot library in R

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.
The function has the following signature:
cv.glm(data, glmfit, cost, K)
with the first two denoting the data and model and K specifies the k-fold.
My problem is the cost parameter which is defined as:
cost: A function of two vector arguments specifying the cost function
for the crossvalidation. The first argument to cost should correspond
to the observed responses and the second argument should correspond to
the predicted or fitted responses from the generalized linear model.
cost must return a non-negative scalar value. The default is the
average squared error function.
I guess for classification it would make sense to have a function which returns the rate of misclassification something like:
nrow(subset(data, (predict >= 0.5 & data$response == "no") |
(predict < 0.5 & data$response == "yes")))
which is of course not even syntactically correct.
Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.
It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:
# [...] Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:
## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <- rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1
## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333 ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
## (assuming you use 0.5 as the cutoff for your predictions).
I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)
Added on 2016-03-22:
The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:
cost <- function(r, pi = 0) {
mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}
I will try to explain the cost function in simple words. Let's take
cv.glm(data, glmfit, cost, K) arguments step by step:
data
The data consists of many observations. Think of it like series of numbers or even.
glmfit
It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost
Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function.
.
It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K
The number to which the input should be split. Default gives leave one out cross validation.
Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :
cost<- function(x, y){
e=0
for (i in 1:length(x)){
if(x[i]>0.5)
{
if( y[i]=='yes') {e=0}
else {e=x[i]-0.5}
}else
{
if( y[i]=='no') {e=0}
else {e=0.5-x[i]}
}
e=e*e #square error
}
e=e/i #mean square error
return (e)
}
Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html
The cost function can optionally be defined if there is one you prefer over the default average squared error. If you wanted to do so then the you would write a function that returns the cost you want to minimize using two inputs: (1) the vector of known labels that you are predicting, and (2) the vector of predicted probabilities from your model for those corresponding labels. So for the cost function that (I think) you described in your post you are looking for a function that will return the average number of accurate classifications which would look something like this:
cost <- function(labels,pred){
mean(labels==ifelse(pred > 0.5, 1, 0))
}
With that function defined you can then pass it into your glm.cv() call. Although I wouldn't recommend using your own cost function over the default one unless you have reason to. Your example isn't reproducible, so here is another example:
> library(boot)
>
> cost <- function(labels,pred){
+ mean(labels==ifelse(pred > 0.5, 1, 0))
+ }
>
> #make model
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> #run cv with your cost function
> (nodal.glm.err <- cv.glm(nodal, nodal.glm, cost, nrow(nodal)))
$call
cv.glm(data = nodal, glmfit = nodal.glm, cost = cost, K = nrow(nodal))
$K
[1] 53
$delta
[1] 0.8113208 0.8113208
$seed
[1] 403 213 -2068233650 1849869992 -1836368725 -1035813431 1075589592 -782251898
...
The cost function defined in the example for cv.glm clearly assumes that the predictions are probabilities, which would require the type="response" argument in the predict function. The documentation from library(boot) should state this explicitly. I would otherwise be forced to assume that the default type="link" is used inside the cv.glm function, in which case the cost function would not work as intended.

How do I best simulate an arbitrary univariate random variate using its probability function?

In R, what's the best way to simulate an arbitrary univariate random variate if only its probability density function is available?
Here is a (slow) implementation of the inverse cdf method when you are only given a density.
den<-dnorm #replace with your own density
#calculates the cdf by numerical integration
cdf<-function(x) integrate(den,-Inf,x)[[1]]
#inverts the cdf
inverse.cdf<-function(x,cdf,starting.value=0){
lower.found<-FALSE
lower<-starting.value
while(!lower.found){
if(cdf(lower)>=(x-.000001))
lower<-lower-(lower-starting.value)^2-1
else
lower.found<-TRUE
}
upper.found<-FALSE
upper<-starting.value
while(!upper.found){
if(cdf(upper)<=(x+.000001))
upper<-upper+(upper-starting.value)^2+1
else
upper.found<-TRUE
}
uniroot(function(y) cdf(y)-x,c(lower,upper))$root
}
#generates 1000 random variables of distribution 'den'
vars<-apply(matrix(runif(1000)),1,function(x) inverse.cdf(x,cdf))
hist(vars)
To clarify the "use Metropolis-Hastings" answer above:
suppose ddist() is your probability density function
something like:
n <- 10000
cand.sd <- 0.1
init <- 0
vals <- numeric(n)
vals[1] <- init
oldprob <- 0
for (i in 2:n) {
newval <- rnorm(1,mean=vals[i-1],sd=cand.sd)
newprob <- ddist(newval)
if (runif(1)<newprob/oldprob) {
vals[i] <- newval
} else vals[i] <- vals[i-1]
oldprob <- newprob
}
Notes:
completely untested
efficiency depends on candidate distribution (i.e. value of cand.sd).
For maximum efficiency, tune cand.sd to an acceptance rate of 25-40%
results will be autocorrelated ... (although I guess you could always
sample() the results to scramble them, or thin)
may need to discard a "burn-in", if your starting value is weird
The classical approach to this problem is rejection sampling (see e.g. Press et al Numerical Recipes)
Use cumulative distribution function http://en.wikipedia.org/wiki/Cumulative_distribution_function
Then just use its inverse.
Check here for better picture http://en.wikipedia.org/wiki/Normal_distribution
That mean: pick random number from [0,1] and set as CDF, then check Value
It is also called quantile function.
This is a comment but I don't have enough reputation to drop a comment to Ben Bolker's answer.
I am new to Metropolis, but IMHO this code is wrong because:
a) the newval is drawn from a normal distribution whereas in other codes it is drawn from a uniform distribution; this value must be drawn from the range covered by the random number. For example, for a gaussian distribution this should be something like runif(1, -5, +5).
b) the prob value must be updated only if acceptance.
Hope this help and hope that someone with reputation could correct this answer (especially mine if I am wrong).
# the distribution
ddist <- dnorm
# number of random number
n <- 100000
# the center of the range is taken as init
init <- 0
# the following should go into a function
vals <- numeric(n)
vals[1] <- init
oldprob <- 0
for (i in 2:n) {
newval <- runif(1, -5, +5)
newprob <- ddist(newval)
if (runif(1) < newprob/oldprob) {
vals[i] <- newval
oldprob <- newprob
} else vals[i] <- vals[i-1]
}
# Final view
hist(vals, breaks = 100)
# and comparison
hist(rnorm(length(vals)), breaks = 100)

Resources