Issue with quantile type 2 - r

I don't understand the following behavior with quantile. With type=2 it should average at discontinuities, but this doesn't seem to happen always. If I create a list of 100 numbers and look at the percentiles, then shouldn't I take the average at every percentile? This behavior happens for some, but not for all (i.e. 7th percentile).
quantile(seq(1, 100, 1), 0.05, type=2)
# 5%
# 5.5
quantile(seq(1, 100, 1), 0.06, type=2)
# 6%
# 6.5
quantile(seq(1, 100, 1), 0.07, type=2)
# 7%
# 8
quantile(seq(1, 100, 1), 0.08, type=2)
# 8%
# 8.5
Is this related to floating point issues?
100*0.06 == 6
#TRUE
100*0.07 == 7
#FALSE
sprintf("%.20f", 100*0.07)
#"7.00000000000000088818"

As far as I can tell, it is related to floating points as 0.07 is not exactly representable with floating points.
p <- seq(0, 0.1, by = 0.001)
q <- quantile(seq(1, 100, 1), p, type=2)
plot(p, q, type = "b")
abline(v = 0.07, col = "grey")
If you think of the quantile (type 2) as a function of p, you will never evaluate the function at exactly 0.07, hence your results.Try e.g. decreasing by in the above. In that sense, the function returns exactly as expected. In practice with continuous data, I cannot imagine it would be of any consequence (but that is a poor argument I know).

Related

r How do I rescale a range of numbers with these constraints?

I need to rescale a series of numbers with certain constraints.
Let's say I have a vector like this:
x <- c(0.5, 0.3, 0.6, 0.4, 0.9, 0.1, 0.2, 0.3, 0.6)
The sum of x must be 6. Right now the sum of x = 3.9.
The numbers cannot be lower than 0
The numbers cannot be higher than 1
I know how to do 1 and 2+3 separately, but not together.
How do I rescale this?
EDIT: As was tried by r2evans, preferably the relative relationships of the numbers is preserved
I don't know that this can be done with a simple expression, but we can optimize our way through it:
opt <- optimize(function(z) abs(6 - sum( z + (1-z) * (x - min(x)) / diff(range(x)) )),
lower=0, upper=1)
opt
# $minimum
# [1] 0.2380955
# $objective
# [1] 1.257898e-06
out <- ( opt$minimum + (1-opt$minimum) * (x - min(x)) / diff(range(x)) )
out
# [1] 0.6190477 0.4285716 0.7142858 0.5238097 1.0000000 0.2380955 0.3333335 0.4285716 0.7142858 1.0000000
sum(out)
# [1] 6.000001
Because that is note perfectly 6, we can do one more step to safeguard it:
out <- out * 6/sum(out)
out
# [1] 0.6190476 0.4285715 0.7142857 0.5238096 0.9999998 0.2380954 0.3333335 0.4285715 0.7142857 0.9999998
sum(out)
# [1] 6
This process preserves the relative relationships of the numbers. If there are more "low" numbers than "high" numbers, scaling so that the sum is 6 will bring the higher numbers above 1. To compensate for that, we shift the lower-end (z in my code), so that all numbers are nudged up a little (but the lower numbers will be nudged up proportionately more).
The results should always be that the numbers are in [opt$minimum,1], and the sum will be 6.
Should be possible with a while loop to increase the values of x (to an upper limit of 1)
x <- c(0.5, 0.3, 0.6, 0.4, 0.9, 0.1, 0.2, 0.3, 0.6)
current_sum = sum(x)
target_sum = 6
while (!current_sum == target_sum) {
print(current_sum)
perc_diff <- (target_sum - current_sum) / target_sum
x <- x * (1 + perc_diff)
x[which(x > 1)] <- 1
current_sum = sum(x)
}
x <- c(0.833333333333333, 0.5, 1, 0.666666666666667, 1, 0.166666666666667,
0.333333333333333, 0.5, 1)
There is likely a more mathematical way

Is there a way in Base R to replicate what VLOOKUP TRUE in Excel does?

I have a consumption pattern that looks like this:
x <-0:10
y<-c(0, 0.05, 0.28, 0.45, 0.78, 0.86, 0.90, 0.92, 0.95, 0.98, 1.00)
X is in years, and Y is not always monotonically-increasing, although it should be most of the time.
If I needed to estimate how many years would elapse before 80% is consumed, in Excel, I would use the VLOOKUP TRUE function which would return 78%, then I would lookup the next value in the series (86%) and then linearly interpolate to get 4.25 years. It's laborious but it gets the job done.
Is there an easy way to compute this in R, in a user-defined function that I can apply to many cases?
Thanks!
x <- 0:10
y <- c(0, 0.05, 0.28, 0.45, 0.78, 0.86, 0.90, 0.92, 0.95, 0.98, 1.00)
estimate_years <- function(x, y, percent) {
idx <- max(which(y < percent))
(percent - y[idx]) / (y[idx+1] - y[idx]) * (x[idx+1] - x[idx]) + x[idx]
}
estimate_years(x, y, 0.80) ## 4.25
Although the approx calculation is cool,
exact linear interpolation here is easy.
idx is the next smaller position for y and x.
idx+1 thus is the next equal/bigger position for y and x in relation to percentage.
Through triangular calculation, where
k = part / total
which is
(percent - y[idx]) / (y[idx+1] - y[idx])
and applying k * total_x
represented here by k * (x[idx+1] - x[idx]) - the result of the linear interpolation
and adding last smaller years
x[idx], we obtain the result.
You could try with approx
resolution <- 1000
fn <- approx(x, y, n=resolution)
min(fn$x[fn$y > 0.8])
[1] 4.254254
The better you need your estimate to be, use a higher number for resolution

Punif and Quinf Functions in R

I have a question:
If X is a random variable in a density function which is uniform between -2 and 3.
I want to find these two questions:
What is the upper quartile of X?
What is the 44% quantile of X?
Now the things I have tried are below:
z <- 1 - punif(0.75, min = -2, max = 3, lower.tail = TRUE)
answer: 0.45
y <- qunif(0.44, min = -2, max = 3, lower.tail = TRUE)
answer: 0.2
First is this even the right way to go about it.
Second, I understand that Punif finds the accumulate probability of X. What does qunif find, and what does the result tell me about X and the distribution?
If is you have a random variable x with uniform distribution from a to b
X ~ U(a,b)
Then punif(x, a, b) is the probability that U <= x
And qunif(x, a, b) finds the value y such that Pr(U <= y)=x
You can visualize these plots with
curve(punif(x, -2, 3), from=-2, to=3, main="punif")
curve(qunif(x, -2, 3), from=0, to=1, main="qunif")
Note how punif expects a value anywhere between a and b but qunif expects a probability so it must be between 0 and 1.

Optimisation of matrix in R

I'm new to optimisation/calibration of models in R, but i'm eager to learn and really need some help. My question relates to demographic modelling.
I've done some research and found help here and here but neither have quite answered my question.
I have a matrix of scalars (propensities) where each column must total to 1. These propensities are used to estimate the number of households that would arise from a given population (persons by age). The propensities model tends to overestimate the number of households in history (for which I know the true number of households).
I want to calibrate the model to minimise the error in the number of households by tweaking the propensities such that the columns still add to 1 and propensities with an initial value of zero must remain zero.
Simple example:
# Propensities matrix
mtx <- matrix(c(0.00, 0.00, 0.85, 0.00, 0.15, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Population by age cohort
pop <- c(2600, 16200, 13400)
# True number of households
target <- c(7000, 4500, 5500)
# Function to optimise
hh <- function(mtx, pop, target) {
# Estimate living arrangements
x <- mtx %*% pop
# Estimate number of households using parent cohorts (1,2 and 4)
x <- c(x[1,1]/2, x[2,1]/2, x[4,1]) - target
return(x)
}
I haven't included any of my code for the optimisation/calibration step as it would be embarrassing and I've haven't been able to get anything to work!
Ideally i will have one set of propensities that generalises well for lots of different regions at the end of this process. Any advice on how i should go about achieving it? Helpful links?
Update
The snippet of code below executes the local search method as suggested by Enrico.
library(tidyverse)
library(NMOF)
data <- list(mtx = matrix(c(0.00, 0.00, 0.90, 0.00, 0.10, 0.25, 0.50, 0.00,
0.25, 0.00, 0.60, 0.20, 0.00, 0.20, 0.00), ncol = 3),
pop = c(2600, 16200, 13400),
target = c(7190, 4650, 5920))
# True mtx
mtx.true <- matrix(c(0.00, 0.00, 0.75, 0.00, 0.25, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Function to optimise
households <- function(x, data) {
# Estimate living arrangements
z <- x %*% data$pop
# Estimate number of households using parent cohorts (1,2 and 4)
z <- c(z[1,1]/2, z[2,1]/2, z[4,1]) - data$target
sum(abs(z))
}
# Local search function to perturb propensities
neighbour <- function(x, data) {
# Choose random column from mtx
i <- sample(1:ncol(x), 1)
# Select two non-zero propensities from mtx column
j <- which(x[, i] != 0) %>% sample(2, replace = FALSE)
# Randomnly select one to perturb positively
x[j[1], i] <- 0.1 * (1 - x[j[1], i]) + x[j[1], i]
# Perturb second propensity to ensure mtx column adds to 1
x[j[2], i] <- x[j[2], i] + (1 - sum(x[,i]))
x
}
# Local search algorithm inputs
localsearch <- list(x0 = data$mtx,
neighbour = neighbour,
nS = 50000,
printBar = FALSE)
# Execute
now <- Sys.time()
solution <- LSopt(OF = households, algo = localsearch, data)
#>
#> Local Search.
#> Initial solution: 2695
#> Finished.
#> Best solution overall: 425.25
Sys.time() - now
#> Time difference of 6.33272 secs
# Inspect propensity matrices
print(solution$xbest)
#> [,1] [,2] [,3]
#> [1,] 0.0000000 0.3925 0.6
#> [2,] 0.0000000 0.4250 0.2
#> [3,] 0.2937976 0.0000 0.0
#> [4,] 0.0000000 0.1825 0.2
#> [5,] 0.7062024 0.0000 0.0
print(mtx.true)
#> [,1] [,2] [,3]
#> [1,] 0.00 0.35 0.65
#> [2,] 0.00 0.45 0.15
#> [3,] 0.75 0.00 0.00
#> [4,] 0.00 0.20 0.20
#> [5,] 0.25 0.00 0.00
Thanks!
I can only comment on the optimisation part.
The code you have provided is sufficient; only your objective function evaluates to a vector. You will need to transform this vector into a single number that is to be minimised, such as the sum of squares or of absolute values.
When it comes to methods, I would try heuristics; in fact, I would try a Local-Search method. These methods operate on the solution through functions which you define; thus, you may code your solution as a matrix. More specifically, you would need two functions: the objective function (which you essentially have) and a neighbourhood function, which takes as input a solution and modifies it. In your particular case, it could take a matrix, select two none-zero elements from one column, and increase one and decrease the other. Thus, the column sum would remain unchanged.
Perhaps the tutorial http://enricoschumann.net/files/NMOF_Rmetrics2012.pdf is of interest, with R code http://enricoschumann.net/files/NMOF_Rmetrics2012.R .

R Generic solution to create 2*2 confusion matrix

My question is related to this one on producing a confusion matrix in R with the table() function. I am looking for a solution without using a package (e.g. caret).
Let's say these are our predictions and labels in a binary classification problem:
predictions <- c(0.61, 0.36, 0.43, 0.14, 0.38, 0.24, 0.97, 0.89, 0.78, 0.86, 0.15, 0.52, 0.74, 0.24)
labels <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0)
For these values, the solution below works well to create a 2*2 confusion matrix for, let's say, threshold = 0.5:
# Confusion matrix for threshold = 0.5
conf_matrix <- as.matrix(table(predictions>0.5,labels))
conf_matrix
labels
0 1
FALSE 4 3
TRUE 2 5
However, I do not get a 2*2 matrix if I select any value that is smaller than min(predictions) or larger than max(predictions), since the data won't have either a FALSE or TRUE occurrence e.g.:
conf_matrix <- as.matrix(table(predictions>0.05,labels))
conf_matrix
labels
0 1
TRUE 6 8
I need a method that consistently produces a 2*2 confusion matrix for all possible thresholds (decision boundaries) between 0 and 1, as I use this as an input in an optimisation. Is there a way I can tweak the table function so it always returns a 2*2 matrix here?
You can make your thresholded prediction a factor variable to achieve this:
(conf_matrix <- as.matrix(table(factor(predictions>0.05, levels=c(F, T)), labels)))
# labels
# 0 1
# FALSE 0 0
# TRUE 6 8

Resources