Finding nth percentile in a matrix with conditions - r

I have a matrix measuring 100 rows x 10 columns:
mat1 = matrix(1:1000, nrow = 100, ncol = 10)
I wish to find the nth percentile of each column using colQuantiles, where the nth percentile is equal to a probability value contained in Probs, except when any of the values in Probs > 0.99 - in which case I want the value of 0.99 applied.
Probs = c(0.99, 0.95, 1, 1, 0.96, 0.92, 1, 0.98, 0.99, 1)
I have tried the following:
Res = ifelse(Probs > 0.99, colQuantiles(mat1, Probs = c(0.99)), colQuantiles(mat1, probs = Probs))
But this simply returns the if true part of the above statement for all ten columns of mat1, presumably because there at least one of the values in Probs is > 0.99. How can I adapt the above so it treats each column of mat1 individually according to the probabilities in Probs?

You can use mapply as follows:
Probs[Probs > 0.99] <- 0.99
unname(mapply(function(x, p) quantile(x, p),
split(mat1, rep(1:ncol(mat1), each = nrow(mat1))),
Probs))
output:
[1] 99.01 195.05 299.01 399.01 496.04 592.08 699.01 798.02 899.01 999.01
It splits the matrix into a set of column vectors (see How to convert a matrix to a list of column-vectors in R?) and then find the nth percentile for each column.

We cannot pass different probability for different columns in colQuantiles but we can get all the probabilities for each column using colQuantiles
temp <- matrixStats::colQuantiles(mat1, probs = pmin(Probs, 0.99))
and then extract the diagonal of the matrix to get the required probability in each column.
diag(temp)
#[1] 99.01 195.05 299.01 399.01 496.04 592.08 699.01 798.02 899.01 999.01

Related

r survey svyquantile: Rounding the number of decimal places

library(haven)
library(survey)
library(dplyr)
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
# Rename variables into something more readable
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
# Select the necessary columns
nhanesAnalysis <- nhanesDemo %>%
select(fpl, age, gender, persWeight, psu, strata)
# Set up the design
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
# Select those between the agest of 18 and 79
ageDesign <- subset(nhanesDesign, age > 17 & age < 80 & !is.na(fpl))
quantile_results <- svyquantile(~fpl, ageDesign, quantiles=c(0.1, 0.5, 0.9))
print(quantile_results)
The default rounding of svyquantile appears to be two digits past the decimal place. How can I change this? I couldn't find anything in the documentation.
svyquantile does no rounding.
In this example, the two digit precision is the precision of the data: fpl is given to only two decimal places and by default svyquantile returns the left quantile, which is always one of the observed values. In fact, most of the distinct values of fpl occur multiple times: there are 20 observations equal to the 10th percentile, 29 equal to the median, and 1220 equal to the 90th percentile, so the quantile will be equal to one of the observed values in this example no matter what you specify for the qrule argument.
If you make fpl noisier, you'll get more digits
> ageDesign<-update(ageDesign, fpl_noisy=fpl+runif(nrow(ageDesign),0,0.005))
> svyquantile(~fpl_noisy, ageDesign, quantiles=c(0.1, 0.5, 0.9))
$fpl_noisy
quantile ci.2.5 ci.97.5 se
0.1 0.8027744 0.7128426 0.8841695 0.04019022
0.5 2.9711470 2.5921659 3.3747105 0.18357099
0.9 5.0031355 5.0027002 5.0035307 0.00019482
attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"

how to write R function to make values close to zero to zero and leaving others as is

I have a data frame with 3 columns and 40 rows. The 1st two columns contain a value range from -1 to 1, and the 3rd column contains the sum of the two columns. Therefore, I would like to change values closer to zero, such as 0.3, 0.2, 0.1, -0.1, -0.2, -0.3 in the 3rd columns to zero and the rest as it was.
library(dplyr)
set.seed(2)
D = data.frame(from = runif(40, -1,1), to = runif(40,-1,1)) %>% dplyr::mutate(weight = from + to)
I appreciate your help.
With replace, check if the absolute value is below a threshold:
thrs = 0.5
transform(D, weight = replace(weight, abs(weight) < thrs, 0))
Or in the dplyr framework:
D %>%
mutate(weight = replace(weight, abs(weight) < 0.5, 0))

Declining a value w/o a for loop

I would like to decline (i.e. multiply) a value (first.value) against a vector of percentage decline values (decline.vector), where the first value is declined against the starting percentage decline value, then that output value is declined against the second percentage decline value, and so on. I assume there is a more elegant way to do so in R than writing a for loop to reassign the new value and cbind to create the new vector, but I remain a novice.
The decline vectors are not sequences like below, this just an example.
Although, is it possible to sequence where 'by=' is a vector? I did not find anything in the ?seq that suggests it is possible.
Whereby:
first.value <- 100
decline.vector <- c(0.85, 0.9, 0.925, 0.95, 0.975)
Desired output:
[100] 85, 75.5, 70.763, 67.224, 65.544
You can do this with the Reduce function in base R
first.value <- 100
decline.vector <- c(0.85, 0.9, 0.925, 0.95, 0.975)
Reduce(`*`, decline.vector, first.value, accumulate = TRUE)
# [1] 100.00000 85.00000 76.50000 70.76250 67.22437 65.54377
You could also use cumprod
first.value * cumprod(c(1, decline.vector))
# [1] 100.00000 85.00000 76.50000 70.76250 67.22438 65.54377
If you don't want first.value to be the first element of the output, then do
first.value * cumprod(decline.vector)
# [1] 85.00000 76.50000 70.76250 67.22438 65.54377

Optimisation of matrix in R

I'm new to optimisation/calibration of models in R, but i'm eager to learn and really need some help. My question relates to demographic modelling.
I've done some research and found help here and here but neither have quite answered my question.
I have a matrix of scalars (propensities) where each column must total to 1. These propensities are used to estimate the number of households that would arise from a given population (persons by age). The propensities model tends to overestimate the number of households in history (for which I know the true number of households).
I want to calibrate the model to minimise the error in the number of households by tweaking the propensities such that the columns still add to 1 and propensities with an initial value of zero must remain zero.
Simple example:
# Propensities matrix
mtx <- matrix(c(0.00, 0.00, 0.85, 0.00, 0.15, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Population by age cohort
pop <- c(2600, 16200, 13400)
# True number of households
target <- c(7000, 4500, 5500)
# Function to optimise
hh <- function(mtx, pop, target) {
# Estimate living arrangements
x <- mtx %*% pop
# Estimate number of households using parent cohorts (1,2 and 4)
x <- c(x[1,1]/2, x[2,1]/2, x[4,1]) - target
return(x)
}
I haven't included any of my code for the optimisation/calibration step as it would be embarrassing and I've haven't been able to get anything to work!
Ideally i will have one set of propensities that generalises well for lots of different regions at the end of this process. Any advice on how i should go about achieving it? Helpful links?
Update
The snippet of code below executes the local search method as suggested by Enrico.
library(tidyverse)
library(NMOF)
data <- list(mtx = matrix(c(0.00, 0.00, 0.90, 0.00, 0.10, 0.25, 0.50, 0.00,
0.25, 0.00, 0.60, 0.20, 0.00, 0.20, 0.00), ncol = 3),
pop = c(2600, 16200, 13400),
target = c(7190, 4650, 5920))
# True mtx
mtx.true <- matrix(c(0.00, 0.00, 0.75, 0.00, 0.25, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Function to optimise
households <- function(x, data) {
# Estimate living arrangements
z <- x %*% data$pop
# Estimate number of households using parent cohorts (1,2 and 4)
z <- c(z[1,1]/2, z[2,1]/2, z[4,1]) - data$target
sum(abs(z))
}
# Local search function to perturb propensities
neighbour <- function(x, data) {
# Choose random column from mtx
i <- sample(1:ncol(x), 1)
# Select two non-zero propensities from mtx column
j <- which(x[, i] != 0) %>% sample(2, replace = FALSE)
# Randomnly select one to perturb positively
x[j[1], i] <- 0.1 * (1 - x[j[1], i]) + x[j[1], i]
# Perturb second propensity to ensure mtx column adds to 1
x[j[2], i] <- x[j[2], i] + (1 - sum(x[,i]))
x
}
# Local search algorithm inputs
localsearch <- list(x0 = data$mtx,
neighbour = neighbour,
nS = 50000,
printBar = FALSE)
# Execute
now <- Sys.time()
solution <- LSopt(OF = households, algo = localsearch, data)
#>
#> Local Search.
#> Initial solution: 2695
#> Finished.
#> Best solution overall: 425.25
Sys.time() - now
#> Time difference of 6.33272 secs
# Inspect propensity matrices
print(solution$xbest)
#> [,1] [,2] [,3]
#> [1,] 0.0000000 0.3925 0.6
#> [2,] 0.0000000 0.4250 0.2
#> [3,] 0.2937976 0.0000 0.0
#> [4,] 0.0000000 0.1825 0.2
#> [5,] 0.7062024 0.0000 0.0
print(mtx.true)
#> [,1] [,2] [,3]
#> [1,] 0.00 0.35 0.65
#> [2,] 0.00 0.45 0.15
#> [3,] 0.75 0.00 0.00
#> [4,] 0.00 0.20 0.20
#> [5,] 0.25 0.00 0.00
Thanks!
I can only comment on the optimisation part.
The code you have provided is sufficient; only your objective function evaluates to a vector. You will need to transform this vector into a single number that is to be minimised, such as the sum of squares or of absolute values.
When it comes to methods, I would try heuristics; in fact, I would try a Local-Search method. These methods operate on the solution through functions which you define; thus, you may code your solution as a matrix. More specifically, you would need two functions: the objective function (which you essentially have) and a neighbourhood function, which takes as input a solution and modifies it. In your particular case, it could take a matrix, select two none-zero elements from one column, and increase one and decrease the other. Thus, the column sum would remain unchanged.
Perhaps the tutorial http://enricoschumann.net/files/NMOF_Rmetrics2012.pdf is of interest, with R code http://enricoschumann.net/files/NMOF_Rmetrics2012.R .

Sampling using conditional probability table

I am trying to simulate certain discrete variable depicting "true state of the world" (say, "red", "green" or "blue") and its indicator, somewhat imperfectly describing it.
r_names <- c("real_R", "real_G", "real_B")
Lets say I have some prior belief about distribution of "reality" variable, which I will use to sample it.
r_probs <- c(0.3, 0.5, 0.2)
set.seed(100)
reality <- sample(seq_along(r_names), 10000, prob=r_probs, replace = TRUE)
Now, let's say I have conditional probability table that stipulates the value of indicator given each of the "realities"
ri_matrix <- matrix(c(0.7, 0.3, 0,
0.2, 0.6, 0.2,
0.05,0.15,0.8), byrow=TRUE,nrow = 3)
dimnames(ri_matrix) <- list(paste("real", r_names, sep="_"),
paste("ind", r_names, sep="_"))
ri_matrix
># ind_R ind_G ind_B
># real_Red 0.70 0.30 0.0
># real_Green 0.20 0.60 0.2
># real_Blue 0.05 0.15 0.8
Since base::sample() is not vectorized for prob argument, I have to:
sample_cond <- function(r, rim){
unlist(lapply(r, function(x)
sample(seq_len(ncol(rim)), 1, prob = rim[x,], replace = TRUE)))
}
Now I can sample my "indicator" variable using the conditional probability matrix
set.seed(200)
indicator <- sample_cond(reality, ri_matrix)
Just to make sure the distributions turned out as expected:
prop.table(table(reality, indicator), margin = 1)
#> indicator
#> reality 1 2 3
#> 1 0.70043610 0.29956390 0.00000000
#> 2 0.19976124 0.59331476 0.20692400
#> 3 0.04365278 0.14400401 0.81234320
Is there a better (i.e. more idiomatic and/or efficient) way to sample a discrete variable conditioned on another discrete random variable?
UPDATE:
As suggested by #Mr.Flick, this is at least 50x faster, because it reuses probability vectors instead of repeated subsetting of the conditional probability matrix.
sample_cond_group <- function(r, rim){
il <- mapply(function(x,y){sample(seq(ncol(rim)), length(x), prob = y, replace = TRUE)},
x=split(r, r),
y=split(rim, seq(nrow(rim))))
unsplit(il, r)
}
You can be a bit more efficient by drawing all the random samples per group with a split/combine type strategy. That might look something like this
simFun <- function(N, r_probs, ri_matrix) {
stopifnot(length(r_probs) == nrow(ri_matrix))
ind <- sample.int(length(r_probs), N, prob = r_probs, replace=TRUE)
grp <- split(data.frame(ind), ind)
unsplit(Map(function(data, r) {
draw <-sample.int(ncol(ri_matrix), nrow(data), replace=TRUE, prob=ri_matrix[r, ])
data.frame(data, draw)
}, grp, as.numeric(names(grp))), ind)
}
Than you can call with
simFun(10000, r_probs, ri_matrix)

Resources