Using optimisation function (optimise) together with dbinom in R (optimisation issue) - r

When p = 0.5, n = 5 and x = 3
dbinom(3,5,0.5) = 0.3125
Lets say I dont know p (n and x is known) and want to find it.
binp <- function(bp) dbinom(3,5,bp) - 0.3125
optimise(binp, c(0,1))
It does not return 0.5. Also, why is
dbinom(3,5,0.5) == 0.3125 #FALSE
But,
x <- dbinom(3,5,0.5)
x == dbinom(3,5,0.5) #TRUE

optimize() searches the param that minimizes the output of function. Your function can return a negative value (e.g., binp(0.1) is -0.3044). If you search the param that minimizes difference from zero, it would be good idea to use sqrt((...)^2). If you want the param that makes output zero, uniroot would help you. And the param what you want isn't uniquely decided. (note; x <- dbinom(3, 5, 0.5); x == dbinom(3, 5, 0.5) is equibalent to dbinom(3, 5, 0.5) == dbinom(3, 5, 0.5))
## check output of dbinom(3, 5, prob)
input <- seq(0, 1, 0.001)
output <- Vectorize(dbinom, "prob")(3, 5, input)
plot(input, output, type="l")
abline(h = dbinom(3, 5, 0.5), col = 2) # there are two answers
max <- optimize(function(x) dbinom(3, 5, x), c(0, 1), maximum = T)$maximum # [1] 0.6000006
binp <- function(bp) dbinom(3,5,bp) - 0.3125 # your function
uniroot(binp, c(0, max))$root # [1] 0.5000036
uniroot(binp, c(max, 1))$root # [1] 0.6946854
binp2 <- function(bp) sqrt((dbinom(3,5,bp) - 0.3125)^2)
optimize(binp2, c(0, max))$minimum # [1] 0.499986
optimize(binp2, c(max, 1))$minimum # [1] 0.6947186
dbinom(3, 5, 0.5) == 0.3125 # [1] FALSE
round(dbinom(3, 5, 0.5), 4) == 0.3125 # [1] TRUE
format(dbinom(3, 5, 0.5), digits = 16) # [1] "0.3124999999999999"

Related

Solutions to a system of inequalities in R

Suppose I have the following system of inequalities:
-2x + y <= -3
1.25x + y <= 2.5
y >= -3
I want to find multiple tuples of (x, y) that satisfy the above inequalities.
library(Rglpk)
obj <- numeric(2)
mat <- matrix(c(-2, 1, 1.25, 1, 0, 1), nrow = 3)
dir <- c("<=", "<=", ">=")
rhs <- c(-3, 2.5, -3)
Rglpk_solve_LP(obj = obj, mat = mat, dir = dir, rhs = rhs)
Using the above code only seems to return 1 possible solution tuple (1.5, 0). Is possible to return other solution tuples?
Edit: Based on the comments, I would be interested to learn if there are any functions that could help me find the corner points.
Actually to understand the possible answers for the given question we can try to solve the system of inequalities graphically.
There was a nice answer concerning plotting of inequations in R at stackowerflow. Using the given aproach we can plot the following graph:
library(ggplot2)
fun1 <- function(x) 2*x - 3 # this is the same as -2x + y <= -3
fun2 <- function(x) -1.25*x + 2.5 # 1.25x + y <= 2.5
fun3 <- function(x) -3 # y >= -3
x1 = seq(-1,5, by = 1/16)
mydf = data.frame(x1, y1=fun1(x1), y2=fun2(x1),y3= fun3(x1))
mydf <- transform(mydf, z = pmax(y3,pmin(y1,y2)))
ggplot(mydf, aes(x = x1)) +
geom_line(aes(y = y1), colour = 'blue') +
geom_line(aes(y = y2), colour = 'green') +
geom_line(aes(y = y3), colour = 'red') +
geom_ribbon(aes(ymin=y3,ymax = z), fill = 'gray60')
All the possible (infinite by number) tuples lie inside the gray triangle.
The vertexes can be found using the following code.
obj <- numeric(2)
mat <- matrix(c(-2, 1.25, 1, 1), nrow = 2)
rhs <- matrix(c(-3, 2.5), nrow = 2)
aPoint <- solve(mat, rhs)
mat <- matrix(c(-2, 0, 1, 1), nrow = 2)
rhs <- matrix(c(-3, -3), nrow = 2)
bPoint <- solve(mat, rhs)
mat <- matrix(c(1.25, 0, 1, 1), nrow = 2)
rhs <- matrix(c(2.5, -3), nrow = 2)
cPoint <- solve(mat, rhs)
Note the order of arguments of matrices.
And you get the coordinates:
> aPoint
[,1]
[1,] 1.6923077
[2,] 0.3846154
> bPoint
[,1]
[1,] 0
[2,] -3
> cPoint
[,1]
[1,] 4.4
[2,] -3.0
All the codes below are with base R only (no need library(Rglpk))
1. Corner Points
If you want to get all the corner points, here is one option
A <- matrix(c(-2, 1.25, 0, 1, 1, -1), nrow = 3)
b <- c(-3, 2.5, 3)
# we use `det` to check if the coefficient matrix is singular. If so, we return `Inf`.
xh <-
combn(nrow(A), 2, function(k) {
if (det(A[k, ]) == 0) {
rep(NA, length(k))
} else {
solve(A[k, ], b[k])
}
})
# We filter out the points that satisfy the constraint
corner_points <- t(xh[, colSums(A %*% xh <= b, na.rm = TRUE) == length(b)])
such that
> corner_points
[,1] [,2]
[1,] 1.692308 0.3846154
[2,] 0.000000 -3.0000000
[3,] 4.400000 -3.0000000
2. Possible Tuples
If you want to have multiple tuples, e.g., n=10, we can use Monte Carlo simulation (based on the obtained corner_points in the previous step) to select the tuples under the constraints:
xrange <- range(corner_points[, 1])
yrange <- range(corner_points[, 2])
n <- 10
res <- list()
while (length(res) < n) {
px <- runif(1, xrange[1], xrange[2])
py <- runif(1, yrange[1], yrange[2])
if (all(A %*% c(px, py) <= b)) {
res[length(res) + 1] <- list(c(px, py))
}
}
and you will see n possible tuples in a list like below
> res
[[1]]
[1] 3.643167 -2.425809
[[2]]
[1] 2.039007 -2.174171
[[3]]
[1] 0.4990635 -2.3363637
[[4]]
[1] 0.6168402 -2.6736421
[[5]]
[1] 3.687389 -2.661733
[[6]]
[1] 3.852258 -2.704395
[[7]]
[1] 1.7571062 0.1067597
[[8]]
[1] 3.668024 -2.771307
[[9]]
[1] 2.108187 -1.365349
[[10]]
[1] 2.106528 -2.134310
First of all, the matrix representing the three equations needs a small correction, because R fills matrices column by column :
-2x + y <= -3
1.25x + y <= 2.5
y >= -3
mat <- matrix(c(-2, 1.25, 0, 1, 1, 1), nrow = 3
# and not : mat <- matrix(c(-2, 1, 1.25, 1, 0, 1), nrow = 3)
To get different tuples, you could modify the objective function :
obj <- numeric(2) results in an objective function 0 * x + 0 * y which is always equal to 0 and can't be maximized : the first valid x,y will be selected.
Optimization on x is achieved by using obj <- c(1,0), resulting in maximization / minimization of 1 * x + 0 * y.
Optimization on y is achieved by using obj <- c(0,1).
#setting the bounds is necessary, otherwise optimization occurs only for x>=0 and y>=0
bounds <- list(lower = list(ind = c(1L, 2L), val = c(-Inf, -Inf)),
upper = list(ind = c(1L, 2L), val = c(Inf, Inf)))
# finding maximum x: obj = c(1,0), max = T
Rglpk_solve_LP(obj = c(10,0), mat = mat, dir = dir, rhs = rhs,bound=bounds, max = T)$solution
# [1] 4.4 -3.0
# finding minimum x: obj = c(1,0), max = F
Rglpk_solve_LP(obj = c(10,0), mat = mat, dir = dir, rhs = rhs,bound=bounds, max = F)$solution
#[1] 0 -3
# finding maximum y: obj = c(0,1), max = T
Rglpk_solve_LP(obj = c(0,1), mat = mat, dir = dir, rhs = rhs,bound=bounds, max = T)$solution
#[1] 1.6923077 0.3846154

rnorm is generating non-random looking realizations

I was debugging my simulation and I find that when I run rnorm(), my random normal values don't look random to me at all. ccc is the mean sd vector that is given parametrically. How can I get really random normal realizations? Since my original simulation is quite long, I don't want to go into Gibbs sampling... Should you know why I get non-random looking realizations of normal random variables?
> ccc
# [1] 144.66667 52.52671
> rnorm(20, ccc)
# [1] 144.72325 52.31605 144.44628 53.07380 144.64438 53.87741 144.91300 54.06928 144.76440
# [10] 52.09181 144.61817 52.17339 145.01374 53.38597 145.51335 52.37353 143.02516 52.49332
# [19] 144.27616 54.22477
> rnorm(20, ccc)
# [1] 143.88539 52.42435 145.24666 50.94785 146.10255 51.59644 144.04244 51.78682 144.70936
# [10] 53.51048 143.63903 51.25484 143.83508 52.94973 145.53776 51.93892 144.14925 52.35716
# [19] 144.08803 53.34002
It's a basic concept to set parameters in a function. Take rnorm() for example:
Its structure is rnorm(n, mean = 0, sd = 1). Obviously, mean and sd are two different parameters, so you need to put respective values to them. Here is a confusing situation where you may get stuck:
arg <- c(5, 10)
rnorm(1000, arg)
This actually means rnorm(n = 1000, mean = c(5, 10), sd = 1). The standard deviation is set to 1 because the position of arg represents the parameter mean and you don't set sd additionally. Therefore, rnorm() will take the default value 1 to sd. However, what does mean = c(5, 10) mean? Let's check:
x <- rnorm(1000, arg)
hist(x, breaks = 50, prob = TRUE)
# lines(density(x), col = 2, lwd = 2)
mean = c(5, 10) and sd = 1 will recycle to length 1000, i.e.
rnorm(n = 1000, mean = c(5, 10, 5, 10, ...), sd = c(1, 1, 1, 1, ...))
and hence the final sample x is actually a blend of 500 N(5, 1) samples and 500 N(10, 1) samples which are drawn alternately, i.e.
c(rnorm(1, 5, 1), rnorm(1, 10, 1), rnorm(1, 5, 1), rnorm(1, 10, 1), ...)
As for your question, it should be:
arg <- c(5, 10)
rnorm(1000, arg[1], arg[2])
and this means rnorm(n = 1000, mean = 5, sd = 10). Check it again, and you will get a normal distribution with mean = 5 and sd = 10.
x <- rnorm(1000, arg[1], arg[2])
hist(x, breaks = 50, prob = T)
# curve(dnorm(x, arg[1], arg[2]), col = 2, lwd = 2, add = T)

Calculate row specific based on min

My data looks like this
df <- data.frame(x = c(3, 5, 4, 4, 3, 2),
y = c(.9, .8, 1, 1.2, .5, .1))
I am trying to multiply each x value by either y or 1, depending on which has the least value.
df$z <- df$x * min(df$y, 1)
The problem is it is taking the min of the whole column, so it is multiplying every x by 0.1.
Instead, I need x multiplied by .9, .8, 1, 1, .5, .1...
We need pmin that will go through each value of 'y' and get the minimum val when it is compared with the second value (which is recycled)
pmin(df$y, 1)
#[1] 0.9 0.8 1.0 1.0 0.5 0.1
Likewise, we can have n arguments (as the parameter is ...)
pmin(df$y, 1, 0)
#[1] 0 0 0 0 0 0
To get the output, just multiply 'x' with the pmin output
df$x * pmin(df$y, 1)
which can also be written as
with(df, x * pmin(y, 1))
Maybe you could use an ifelse function:
df <- data.frame(x = c(3, 5, 4, 4, 3, 2),
y = c(.9, .8, 1, 1.2, .5, .1))
df$z = ifelse(df$y<1, df$x*df$y, df$x*1)
This will compare the values of each row.
Hope it helps! :)

Test if vector is contained in another vector, including repetitions

I've been struggling with this one for a while: given two vectors, each containing possible repetitions of elements, how do I test if one is perfectly contained in the other? %in% does not account for repetitions. I can't think of an elegant solution that doesn't rely on a something from the apply family.
x <- c(1, 2, 2, 2)
values <- c(1, 1, 1, 2, 2, 3, 4, 5, 6)
# returns TRUE, but x[x == 2] is greater than values[values == 2]
all(x %in% values)
# inelegant solution
"%contains%" <-
function(values, x){
n <- intersect(x, values)
all( sapply(n, function(i) sum(values == i) >= sum(x == i)) )
}
# which yields the following:
> values %contains% x
[1] FALSE
> values <- c(values, 2)
> values %contains% x
[2] TRUE
Benchmarking update
I may have found another solution in addition to the answer provided by Marat below
# values and x must all be non-negative - can change the -1 below accordingly
"%contains%" <-
function(values, x){
t <- Reduce(function(.x, .values) .values[-which.max(.values == .x)]
, x = x
, init = c(-1, values))
t[1] == -1
}
Benchmarking all the answers so far, including thelatemail's modification of marat, using both large and small x
library(microbenchmark)
set.seed(31415)
values <- sample(c(0:100), size = 100000, replace = TRUE)
set.seed(11235)
x_lrg <- sample(c(0:100), size = 1000, replace = TRUE)
x_sml <- c(1, 2, 2, 2)
lapply(list(x_sml, x_lrg), function(x){
microbenchmark( hoho_sapply(values, x)
, marat_table(values, x)
, marat_tlm(values, x)
, hoho_reduce(values, x)
, unit = "relative")
})
# Small x
# [[1]]
# Unit: relative
# expr min lq mean median uq max neval
# hoho_sapply(values, x) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
# marat_table(values, x) 12.718392 10.966770 7.487895 9.260099 8.648351 1.819833 100
# marat_tlm(values, x) 1.354452 1.181094 1.026373 1.088879 1.266939 1.029560 100
# hoho_reduce(values, x) 2.951577 2.748087 2.069830 2.487790 2.216625 1.097648 100
#
# Large x
# [[2]]
# Unit: relative
# expr min lq mean median uq max neval
# hoho_sapply(values, x) 1.158303 1.172352 1.101410 1.177746 1.096661 0.6940260 100
# marat_table(values, x) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
# marat_tlm(values, x) 1.099669 1.059256 1.102543 1.071960 1.072881 0.9857229 100
# hoho_reduce(values, x) 85.666549 81.391495 69.089366 74.173366 66.943621 27.9766047 100
Try using table, e.g.:
"%contain%" <- function(values,x) {
tx <- table(x)
tv <- table(values)
z <- tv[names(tx)] - tx
all(z >= 0 & !is.na(z))
}
Some examples:
> c(1, 1, 1, 2, 2, 3, 4, 5, 6) %contain% c(1,2,2,2)
[1] FALSE
> c(1, 1, 1, 2, 2, 3, 4, 5, 6, 2) %contain% c(1,2,2,2)
[1] TRUE
> c(1, 1, 1, 2, 2, 3, 4, 5, 6) %contain% c(1,2,2)
[1] TRUE
> c(1, 1, 1, 2, 2, 3, 4, 5, 6) %contain% c(1,2,2,7)
[1] FALSE

Is there a weighted.median() function?

I'm looking for something similar in form to weighted.mean(). I've found some solutions via search that write out the entire function but would appreciate something a bit more user friendly.
The following packages all have a function to calculate a weighted median: 'aroma.light', 'isotone', 'limma', 'cwhmisc', 'ergm', 'laeken', 'matrixStats, 'PSCBS', and 'bigvis' (on github).
To find them I used the invaluable findFn() in the 'sos' package which is an extension for R's inbuilt help.
findFn('weighted median')
Or,
???'weighted median'
as ??? is a shortcut in the same way ?some.function is for help(some.function)
Some experience using the answers from #wkmor1 and #Jaitropmange.
I've checked 3 functions from 3 packages, isotone, laeken, and matrixStats. Only matrixStats works properly. Other two (just as the median(rep(x, times=w) solution) give integer output. As long as I calculated median age of populations, decimal places matter.
Reproducible example. Calculation of the median age of a population
df <- data.frame(age = 0:100,
pop = spline(c(4,7,9,8,7,6,4,3,2,1),n = 101)$y)
library(isotone)
library(laeken)
library(matrixStats)
isotone::weighted.median(df$age,df$pop)
# [1] 36
laeken::weightedMedian(df$age,df$pop)
# [1] 36
matrixStats::weightedMedian(df$age,df$pop)
# [1] 36.164
median(rep(df$age, times=df$pop))
# [1] 35
Summary
matrixStats::weightedMedian() is the reliable solution
To calculate the weighted median of a vector x using a same length vector of (integer) weights w:
median(rep(x, times=w))
This is just a simple solution, ready to use almost anywhere.
weighted.median <- function(x, w) {
w <- w[order(x)]
x <- x[order(x)]
prob <- cumsum(w)/sum(w)
ps <- which(abs(prob - .5) == min(abs(prob - .5)))
return(x[ps])
}
Really old post but I just came across it and did some testing of the different methods. spatstat::weighted.median() seemed to be about 14 times faster than median(rep(x, times=w)) and its actually noticeable if you want to run the function more than a couple times. Testing was with a relatively large survey, about 15,000 people.
One can also use stats::density to create a weighted PDF, then convert this to a CDF, as elaborated here:
my_wtd_q = function(x, w, prob, n = 4096)
with(density(x, weights = w/sum(w), n = n),
x[which.max(cumsum(y*(x[2L] - x[1L])) >= prob)])
Then my_wtd_q(x, w, .5) will be the weighted median.
One could also be more careful to ensure that the total area under the density is one by re-normalizing.
A way in base to get a weighted median will be to order by the values and build the cumsum of the weights and get the value(s) at sum * 0.5 of the weights.
medianWeighted <- function(x, w, q=.5) {
n <- length(x)
i <- order(x)
w <- cumsum(w[i])
p <- w[n] * q
j <- findInterval(p, w)
Vectorize(function(p,j) if(w[n] <= 0) NA else
if(j < 1) x[i[1]] else
if(j == n) x[i[n]] else
if(w[j] == p) (x[i[j]] + x[i[j+1]]) / 2 else
x[i[j+1]])(p,j)
}
What will have the following results with simple input data.
medianWeighted(c(10, 40), c(1, 2))
#[1] 40
median(rep(c(10, 40), c(1, 2)))
#[1] 40
medianWeighted(c(10, 40), c(2, 1))
#[1] 10
median(rep(c(10, 40), c(2, 1)))
#[1] 10
medianWeighted(c(10, 40), c(1.5, 2))
#[1] 40
medianWeighted(c(10, 40), c(3, 4))
#[1] 40
median(rep(c(10, 40), c(3, 4)))
#[1] 40
medianWeighted(c(10, 40), c(1.5, 1.5))
#[1] 25
medianWeighted(c(10, 40), c(3, 3))
#[1] 25
median(rep(c(10, 40), c(3, 3)))
#[1] 25
medianWeighted(c(10, 40), c(0, 1))
#[1] 40
medianWeighted(c(10, 40), c(1, 0))
#[1] 10
medianWeighted(c(10, 40), c(0, 0))
#[1] NA
It can also be used for other qantiles
medianWeighted(1:10, 10:1, seq(0, 1, 0.25))
[1] 1 2 4 6 10
Compare with other methods.
#Functions from other Answers
weighted.median <- function(x, w) {
w <- w[order(x)]
x <- x[order(x)]
prob <- cumsum(w)/sum(w)
ps <- which(abs(prob - .5) == min(abs(prob - .5)))
return(x[ps])
}
my_wtd_q = function(x, w, prob, n = 4096)
with(density(x, weights = w/sum(w), n = n),
x[which.max(cumsum(y*(x[2L] - x[1L])) >= prob)])
weighted.quantile <- function(x, w, probs = seq(0, 1, 0.25),
na.rm = FALSE, names = TRUE) {
if (any(probs > 1) | any(probs < 0)) stop("'probs' outside [0,1]")
if (length(w) == 1) w <- rep(w, length(x))
if (length(w) != length(x)) stop("w must have length 1 or be as long as x")
if (isTRUE(na.rm)) {
w <- x[!is.na(x)]
x <- x[!is.na(x)]
}
w <- w[order(x)] / sum(w)
x <- x[order(x)]
cum_w <- cumsum(w) - w * (1 - (seq_along(w) - 1) / (length(w) - 1))
res <- approx(x = cum_w, y = x, xout = probs)$y
if (isTRUE(names)) {
res <- setNames(res, paste0(format(100 * probs, digits = 7), "%"))
}
res
}
Methods
M <- alist(
medRep = median(rep(DF$x, DF$w)),
isotone = isotone::weighted.median(DF$x, DF$w),
laeken = laeken::weightedMedian(DF$x, DF$w),
spatstat1 = spatstat.geom::weighted.median(DF$x, DF$w, type=1),
spatstat2 = spatstat.geom::weighted.median(DF$x, DF$w, type=2),
spatstat4 = spatstat.geom::weighted.median(DF$x, DF$w, type=4),
survey = survey::svyquantile(~x, survey::svydesign(id=~1, weights=~w, data=DF), 0.5)$x[1],
RAndres = weighted.median(DF$x, DF$w),
matrixStats = matrixStats::weightedMedian(DF$x, DF$w),
MichaelChirico = my_wtd_q(DF$x, DF$w, .5),
Leonardo = weighted.quantile(DF$x, DF$w, .5),
GKi = medianWeighted(DF$x, DF$w)
)
Results
DF <- data.frame(x=c(10, 40), w=c(1, 2))
sapply(M, eval)
# medRep isotone laeken spatstat1 spatstat2
# 40.00000 40.00000 40.00000 40.00000 25.00000
# spatstat4 survey RAndres matrixStats MichaelChirico
# 17.50000 40.00000 10.00000 30.00000 34.15005
# Leonardo.50% GKi
# 25.00000 40.00000
DF <- data.frame(x=c(10, 40), w=c(1, 1))
sapply(M, eval)
# medRep isotone laeken spatstat1 spatstat2
# 25.00000 25.00000 40.00000 10.00000 10.00000
# spatstat4 survey RAndres matrixStats MichaelChirico
# 10.00000 10.00000 10.00000 25.00000 25.05044
# Leonardo.50% GKi
# 25.00000 25.00000
In those two cases only isotone and GKi give identical results compared to what median(rep(x, w)) returns.
If you're working with the survey package, assuming you've defined your survey design and x is your variable of interest:
svyquantile(~x, mydesign, c(0.5))
I got here looking for weighted quantiles, so I thought I might as well leave for future readers what I ended up with. Naturally, using probs = 0.5 will return the weighted median.
I started with MichaelChirico's answer, which unfortunately was off at the edges. Then I decided to switch from density() to approx(). Finally, I believe I nailed the correction factor to ensure consistency with the default algorithm of the unweighted quantile().
weighted.quantile <- function(x, w, probs = seq(0, 1, 0.25),
na.rm = FALSE, names = TRUE) {
if (any(probs > 1) | any(probs < 0)) stop("'probs' outside [0,1]")
if (length(w) == 1) w <- rep(w, length(x))
if (length(w) != length(x)) stop("w must have length 1 or be as long as x")
if (isTRUE(na.rm)) {
w <- x[!is.na(x)]
x <- x[!is.na(x)]
}
w <- w[order(x)] / sum(w)
x <- x[order(x)]
cum_w <- cumsum(w) - w * (1 - (seq_along(w) - 1) / (length(w) - 1))
res <- approx(x = cum_w, y = x, xout = probs)$y
if (isTRUE(names)) {
res <- setNames(res, paste0(format(100 * probs, digits = 7), "%"))
}
res
}
When weights are uniform, the weighted quantiles are identical to regular unweighted quantiles:
x <- rnorm(100)
stopifnot(stopifnot(identical(weighted.quantile(x, w = 1), quantile(x)))
Example using the same data as in the weighted.mean() man page.
x <- c(3.7, 3.3, 3.5, 2.8)
w <- c(5, 5, 4, 1)/15
stopifnot(isTRUE(all.equal(
weighted.quantile(x, w, 0:4/4, names = FALSE),
c(2.8, 3.33611111111111, 3.46111111111111, 3.58157894736842,
3.7)
)))
And this is for whoever solely wants the weighted median value:
weighted.median <- function(x, w, ...) {
weighted.quantile(x, w, probs = 0.5, names = FALSE, ...)
}

Resources