How to write a if else function in R to generate rule? - r

I wanna print the rule like following:
rule(0,0,0) = 0
rule(0,0,1) = 1
rule(0,1,0) = 1
rule(0,1,1) = 1
rule(1,0,0) = 1
rule(1,0,1) = 0
rule(1,1,0) = 0
rule(1,1,1) = 0
I tried to write a set of “if-else” conditions that test for zeroes and ones in the input values, returning the corresponding value from the table as output but it didnt work

A single comparison will do. This would be the most performant implementation I can think of.
rule <- function(a, b, c) +(abs(4*a + 2*b + c - 2.5) < 2)
A parsimonious solution:
rule <- function(a, b, c) +xor(a, b | c)
Note both solutions are vectorized.
Benchmarking:
rule1 <- function(a, b, c) +(abs(4*a + 2*b + c - 2.5) < 2)
rule2 <- function(a, b, c) +xor(a, b | c)
rule3 <- function(x, y, z) +((4 * x + 2 * y + z) %in% 1:4)
abc <- matrix(sample(0:1, 3e6, 1), 1e6, 3)
microbenchmark::microbenchmark(rule1 = rule1(abc[,1], abc[,2], abc[,3]),
rule2 = rule2(abc[,1], abc[,2], abc[,3]),
Thomas = rule3(abc[,1], abc[,2], abc[,3]),
check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> rule1 13.5161 16.58425 20.74505 17.69030 20.22745 53.9513 100
#> rule2 32.7552 35.05735 39.41473 36.27760 39.38165 74.8564 100
#> Thomas 24.6562 28.39065 33.78937 29.70875 33.19045 65.9709 100

A possibility is to concatenate the three digits and then use switch:
rule <- function(a, b, c) {
x <- paste0(a, b, c)
switch(
x,
"000" = 0,
"001" = 1,
......
)
}

You can try
rule <- function(x, y, z) {
+((4 * x + 2 * y + z) %in% 1:4)
}
or
rule <- function(x, y, z) {
bitwXor(x, bitwOr(y, z))
}
Benchmark
rule1 <- function(a, b, c) +(abs(4 * a + 2 * b + c - 2.5) < 2)
rule2 <- function(a, b, c) +xor(a, b | c)
rule3 <- function(x, y, z) +((4 * x + 2 * y + z) %in% 1:4)
rule4 <- function(x, y, z) bitwXor(x, bitwOr(y, z))
abc <- matrix(sample(0:1, 3e6, 1), 1e6, 3)
microbenchmark::microbenchmark(
rule1 = rule1(abc[, 1], abc[, 2], abc[, 3]),
rule2 = rule2(abc[, 1], abc[, 2], abc[, 3]),
Thomas1 = rule3(abc[, 1], abc[, 2], abc[, 3]),
Thomas2 = rule4(abc[, 1], abc[, 2], abc[, 3]),
check = "identical"
)
gives
Unit: milliseconds
expr min lq mean median uq max neval
rule1 16.1315 22.82880 32.91071 24.48080 28.29635 113.1915 100
rule2 33.6093 40.93665 50.12914 44.77415 48.90045 128.0033 100
Thomas1 26.6938 34.78615 43.34770 37.63255 42.49940 114.3973 100
Thomas2 9.1119 12.25080 18.46705 16.26445 18.46835 105.1263 100

In the case that your rule can't be expressed as a mathematical expression (as jblood94 and ThomasIsCoding did), you can create a list with the options, check which option was passed, and get the result associated with that option.
rule = function(n1, n2, n3){
combin = list(c(0,0,0), c(1,0,0), c(0,1,0), c(0,0,1), c(1,1,0), c(1,0,1), c(0,1,1), c(1,1,1))
result = c(0, 1, 1, 1, 0, 0, 1, 0)
index = which(sapply(combin, function(x){identical(x, c(n1, n2, n3))}))
result[index]
}
rule(0, 1, 0)
[1] 1

For the fun of brevity:
rule <- function(first, second, third){
bits_as_decimal = paste(first, second, third, sep = '') |>
strtoi(base = 2)
bits_as_decimal %in% 1:4 |> as.integer()
}
What happens:
the three arguments are considered a three bit binary (e. g. 001)
binary gets converted to its decimal equivalent (strtoi)
check if the decimal equivalent falls into the range 1-4 (which you want to convert to 1)
re-cast the boolean value to binary

Related

R: Keep function vectroized using all()

I have a function fun checking multiple conditions a, b. If all conditions are fulfilled, the function should return TRUE, else it should return FALSE.
a = 1
b = 0
fun <- function(a, b){
all(a < 1,
b < 1,
na.rm = TRUE)
}
fun(a, b)
This function does the trick. However, if I use vectors now, all() does of course not keep the vector form but rather returns a single TRUE or FALSE.
I would like to have a function that works the same as the following one:
a = 1:2
b = 0:1
funV <- function(a, b){
a < 1 & b < 1
}
funV(a, b)
but without chaining & and it should also work with missing values.
pmin + as.logical = vectorized all().
fun <- function(a, b){
as.logical(pmin(a < 1, b < 1, na.rm = TRUE))
}
fun(1:2, 0:1)
# [1] FALSE FALSE
Benchmark
# Unit: milliseconds
# expr min lq mean median uq max neval
# pmin_all(a, b) 1.816587 1.843934 2.223257 1.868905 3.004286 5.936595 100
# mapply_all(a, b) 181.204836 183.868243 188.579629 185.331190 188.332364 347.997674 100
# vec_all(a, b) 186.911905 190.187575 194.159146 192.135094 194.848294 218.416740 100
pmin_all <- function(a, b){
as.logical(pmin(a < 1, b < 1, na.rm = TRUE))
}
mapply_all <- function(a, b){
mapply(\(x, y) all(x < 1, y < 1, na.rm = TRUE), a, b)
}
vec_all <- Vectorize(function(a, b){
all(a < 1, b < 1, na.rm = TRUE)
})
a <- rnorm(1e5, mean = 1)
b <- rnorm(1e5, mean = 1)
library(microbenchmark)
bm <- microbenchmark(
pmin_all(a, b),
mapply_all(a, b),
vec_all(a, b),
check = 'identical'
)
We can use Vectorize() for this to create a vectorized function. Vectorize() uses mapply() under the hood.
fun <- function(a,b){
all(a < 1,
b < 1,
na.rm = TRUE)
}
a = 1:2
b = 0:1
funV <- Vectorize(fun)
funV(a,b)
#> [1] FALSE FALSE
Created on 2023-02-14 by the reprex package (v2.0.1)
With mapply:
fun <- function(a,b){
mapply(\(x, y) all(x < 1, y < 1, na.rm = TRUE), a, b)
}
fun(1:2, 0:1)
#[1] FALSE FALSE

Plotting CDF and PDF in R with custom function

I was wondering if there was any way to plot this PDF and CDF in R. I found these on a different question a user asked, and was curious.
I know that I have to create a function and then plot this, but I'm struggling with the different parameters and am unsure how to translate this to R. I have only ever plotted PDF/CDF using a normal distribution, or from datasets.
You can write the pdf and cdf as functions, using ifelse to specify the values within the ranges, then simply plot the functions over the desired range:
pdf <- function(x) {
ifelse(x >= 0 & x < 1, 0.5, ifelse(x < 1.5 & x >= 1, 1, 0))
}
cdf <- function(x) {
ifelse(x < 0, 0,
ifelse(x >= 0 & x < 1, 0.5 * x,
ifelse(x < 1.5 & x >= 1, x - 0.5, 1)))
}
plot(pdf, xlim = c(-1, 2), type = "s")
plot(cdf, xlim = c(-1, 2))
Created on 2022-10-27 with reprex v2.0.2
ifelse can be very slow, we can fill an output vector instead. numeric() creates a vector of zeroes of a specified length, we then simply change everything that should not yield zero.
pdf_vec <- function(x) {
out <- numeric(length(x))
out[x >= 0 & x < 1] <- .5
out[x >= 1 & x < 1.5] <- 1
out
}
cdf_vec <- function(x) {
out <- numeric(length(x))
tmp <- x >= 0 & x < 1
out[tmp] <- .5*x[tmp]
tmp <- x >= 1 & x < 1.5
out[tmp] <- x[tmp] - .5
tmp <- x >= 1.5
out[tmp] <- 1
out
}
set.seed(42)
x <- rnorm(1e6)
stopifnot(all.equal(cdf(x), cdf1(x)))
stopifnot(all.equal(pdf(x), pdf1(x)))
#Allan Camero already showed nicely how to plot it.
Microbenchmark
It's about three times faster than the ifelse solution.
microbenchmark::microbenchmark(
cdf_ifelse=cdf(x), cdf_vec=cdf1(x), check='equal'
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# cdf_ifelse 110.66285 121.25428 133.0789 133.86041 142.53296 167.64401 100 b
# cdf_vec 43.56277 45.08759 48.8924 46.83869 49.46047 74.95487 100 a
microbenchmark::microbenchmark(
pdf_ifelse=pdf(x), pdf_vec=pdf1(x), check='equal'
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# pdf_ifelse 71.39501 76.71747 101.17738 81.43037 87.82162 192.3919 100 b
# pdf_vec 27.82642 30.02056 34.55301 30.38457 31.29751 133.2798 100 a
We can try the code below
f <- function(x) 0.5 * (x >= 0) + 0.5 * (x >= 1) - (x >= 1.5)
F <- Vectorize(function(x) integrate(f, -Inf, x)$value)
plot(f, -2, 2, col = "red")
curve(F, -2, 2, add = TRUE, col = "blue")
legend(-1, 0.8,
legend = c("pdf", "cdf"),
col = c("red", "blue"),
lty = 1:2, cex = 2,
box.col = "white"
)

Split a vector into chunks such that sum of each chunk is approximately constant

I have a large data frame with more than 100 000 records where the values are sorted
For example, consider the following dummy data set
df <- data.frame(values = c(1,1,2,2,3,4,5,6,6,7))
I want to create 3 groups of above values (in sequence only) such that the sum of each group is more or less the same
So for the above group, if I decide to divide the sorted df in 3 groups as follows, their sums will be
1. 1 + 1 + 2 +2 + 3 + 4 = 13
2. 5 + 6 = 11
3. 6 + 7 = 13
How can create this optimization in R? any logic?
So, let's use pruning. I think other solutions are giving a good solution, but not the best one.
First, we want to minimize
where S_n is the cumulative sum of the first n elements.
computeD <- function(p, q, S) {
n <- length(S)
S.star <- S[n] / 3
if (all(p < q)) {
(S[p] - S.star)^2 + (S[q] - S[p] - S.star)^2 + (S[n] - S[q] - S.star)^2
} else {
stop("You shouldn't be here!")
}
}
I think the other solutions optimize over p and q independently, which won't give a global minima (expected for some particular cases).
optiCut <- function(v) {
S <- cumsum(v)
n <- length(v)
S_star <- S[n] / 3
# good starting values
p_star <- which.min((S - S_star)^2)
q_star <- which.min((S - 2*S_star)^2)
print(min <- computeD(p_star, q_star, S))
count <- 0
for (q in 2:(n-1)) {
S3 <- S[n] - S[q] - S_star
if (S3*S3 < min) {
count <- count + 1
D <- computeD(seq_len(q - 1), q, S)
ind = which.min(D);
if (D[ind] < min) {
# Update optimal values
p_star = ind;
q_star = q;
min = D[ind];
}
}
}
c(p_star, q_star, computeD(p_star, q_star, S), count)
}
This is as fast as the other solutions because it prunes a lot the iterations based on the condition S3*S3 < min. But, it gives the optimal solution, see optiCut(c(1, 2, 3, 3, 5, 10)).
For the solution with K >= 3, I basically reimplemented trees with nested tibbles, that was fun!
optiCut_K <- function(v, K) {
S <- cumsum(v)
n <- length(v)
S_star <- S[n] / K
# good starting values
p_vec_first <- sapply(seq_len(K - 1), function(i) which.min((S - i*S_star)^2))
min_first <- sum((diff(c(0, S[c(p_vec_first, n)])) - S_star)^2)
compute_children <- function(level, ind, val) {
# leaf
if (level == 1) {
val <- val + (S[ind] - S_star)^2
if (val > min_first) {
return(NULL)
} else {
return(val)
}
}
P_all <- val + (S[ind] - S[seq_len(ind - 1)] - S_star)^2
inds <- which(P_all < min_first)
if (length(inds) == 0) return(NULL)
node <- tibble::tibble(
level = level - 1,
ind = inds,
val = P_all[inds]
)
node$children <- purrr::pmap(node, compute_children)
node <- dplyr::filter(node, !purrr::map_lgl(children, is.null))
`if`(nrow(node) == 0, NULL, node)
}
compute_children(K, n, 0)
}
This gives you all the solution that are least better than the greedy one:
v <- sort(sample(1:1000, 1e5, replace = TRUE))
test <- optiCut_K(v, 9)
You need to unnest this:
full_unnest <- function(tbl) {
tmp <- try(tidyr::unnest(tbl), silent = TRUE)
`if`(identical(class(tmp), "try-error"), tbl, full_unnest(tmp))
}
print(test <- full_unnest(test))
And finally, to get the best solution:
test[which.min(test$children), ]
Here is one approach:
splitter <- function(values, N){
inds = c(0, sapply(1:N, function(i) which.min(abs(cumsum(as.numeric(values)) - sum(as.numeric(values))/N*i))))
dif = diff(inds)
re = rep(1:length(dif), times = dif)
return(split(values, re))
}
how good is it:
# I calculate the mean and sd of the maximal difference of the sums in the
#splits of 100 runs:
#split on 15 parts
set.seed(5)
z1 = as.data.frame(matrix(1:15, nrow=1))
repeat{
values = sort(sample(1:1000, 1000000, replace = T))
z = splitter(values, 15)
z = lapply(z, sum)
z = unlist(z)
z1 = rbind(z1, z)
if (nrow(z1)>101){
break
}
}
z1 = z1[-1,]
mean(apply(z1, 1, function(x) max(x) - min(x)))
[1] 1004.158
sd(apply(z1, 1, function(x) max(x) - min(x)))
[1] 210.6653
#with less splits (4)
set.seed(5)
z1 = as.data.frame(matrix(1:4, nrow=1))
repeat{
values = sort(sample(1:1000, 1000000, replace = T))
z = splitter(values, 4)
z = lapply(z, sum)
z = unlist(z)
z1 = rbind(z1, z)
if (nrow(z1)>101){
break
}
}
z1 = z1[-1,]
mean(apply(z1, 1, function(x) max(x) - min(x)))
#632.7723
sd(apply(z1, 1, function(x) max(x) - min(x)))
#260.9864
library(microbenchmark)
1M:
values = sort(sample(1:1000, 1000000, replace = T))
microbenchmark(
sp_27 = splitter(values, 27),
sp_3 = splitter(values, 3),
)
Unit: milliseconds
expr min lq mean median uq max neval cld
sp_27 897.7346 934.2360 1052.0972 1078.6713 1118.6203 1329.3044 100 b
sp_3 108.3283 116.2223 209.4777 173.0522 291.8669 409.7050 100 a
btw F. Privé is correct this function does not give the globally optimal split. It is greedy which is not a good characteristic for such a problem. It will give splits with sums closer to global sum / n in the initial part of the vector but behaving as so will compromise the splits in the later part of the vector.
Here is a test comparison of the three functions posted so far:
db = function(values, N){
temp = floor(sum(values)/N)
inds = c(0, which(c(0, diff(cumsum(values) %% temp)) < 0)[1:(N-1)], length(values))
dif = diff(inds)
re = rep(1:length(dif), times = dif)
return(split(values, re))
} #had to change it a bit since the posted one would not work - the core
#which calculates the splitting positions is the same
missuse <- function(values, N){
inds = c(0, sapply(1:N, function(i) which.min(abs(cumsum(as.numeric(values)) - sum(as.numeric(values))/N*i))))
dif = diff(inds)
re = rep(1:length(dif), times = dif)
return(split(values, re))
}
prive = function(v, N){ #added dummy N argument because of the tester function
dummy = N
computeD <- function(p, q, S) {
n <- length(S)
S.star <- S[n] / 3
if (all(p < q)) {
(S[p] - S.star)^2 + (S[q] - S[p] - S.star)^2 + (S[n] - S[q] - S.star)^2
} else {
stop("You shouldn't be here!")
}
}
optiCut <- function(v, N) {
S <- cumsum(v)
n <- length(v)
S_star <- S[n] / 3
# good starting values
p_star <- which.min((S - S_star)^2)
q_star <- which.min((S - 2*S_star)^2)
print(min <- computeD(p_star, q_star, S))
count <- 0
for (q in 2:(n-1)) {
S3 <- S[n] - S[q] - S_star
if (S3*S3 < min) {
count <- count + 1
D <- computeD(seq_len(q - 1), q, S)
ind = which.min(D);
if (D[ind] < min) {
# Update optimal values
p_star = ind;
q_star = q;
min = D[ind];
}
}
}
c(p_star, q_star, computeD(p_star, q_star, S), count)
}
z3 = optiCut(v)
inds = c(0, z3[1:2], length(v))
dif = diff(inds)
re = rep(1:length(dif), times = dif)
return(split(v, re))
} #added output to be more in line with the other two
Function for testing:
tester = function(split, seed){
set.seed(seed)
z1 = as.data.frame(matrix(1:3, nrow=1))
repeat{
values = sort(sample(1:1000, 1000000, replace = T))
z = split(values, 3)
z = lapply(z, sum)
z = unlist(z)
z1 = rbind(z1, z)
if (nrow(z1)>101){
break
}
}
m = mean(apply(z1, 1, function(x) max(x) - min(x)))
s = sd(apply(z1, 1, function(x) max(x) - min(x)))
return(c("mean" = m, "sd" = s))
} #tests 100 random 1M length vectors with elements drawn from 1:1000
tester(db, 5)
#mean sd
#779.5686 349.5717
tester(missuse, 5)
#mean sd
#481.4804 216.9158
tester(prive, 5)
#mean sd
#451.6765 174.6303
prive is the clear winner - however it takes quite a bit longer than the other 2. and can handle splitting on 3 elements only.
microbenchmark(
missuse(values, 3),
prive(values, 3),
db(values, 3)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
missuse(values, 3) 100.85978 111.1552 185.8199 120.1707 304.0303 393.4031 100 a
prive(values, 3) 1932.58682 1980.0515 2096.7516 2043.7133 2211.6294 2671.9357 100 b
db(values, 3) 96.86879 104.5141 194.0085 117.6270 306.7143 500.6455 100 a
N = 3
temp = floor(sum(df$values)/N)
inds = c(0, which(c(0, diff(cumsum(df$values) %% temp)) < 0)[1:(N-1)], NROW(df))
split(df$values, rep(1:N, ifelse(N == 1, NROW(df), diff(inds))))
#$`1`
#[1] 1 1 2 2 3 4
#$`2`
#[1] 5 6
#$`3`
#[1] 6 7

R - Vectorized implementation of ternary function

I have three vectors X, Y and Z of equal length n. I need to create an n x n x n array of a function f(X[i],Y[j],Z[k]). The straightforward way to do this is to sequentially loop through each element of each of the 3 vectors. However, the time required to compute the array grows exponentially with n. Is there a way to implement this using vectorized operations?
EDIT: As mentioned in the comments, I have added a simple example of what's needed.
set.seed(1)
X = rnorm(10)
Y = seq(11,20)
Z = seq(21,30)
F = array(0, dim=c( length(X),length(Y),length(Z) ) )
for (i in 1:length(X))
for (j in 1:length(Y))
for (k in 1:length(Z))
F[i,j,k] = X[i] * (Y[j] + Z[k])
Thanks.
You can use nested outer :
set.seed(1)
X = rnorm(10)
Y = seq(11,20)
Z = seq(21,30)
F = array(0, dim = c( length(X),length(Y),length(Z) ) )
for (i in 1:length(X))
for (j in 1:length(Y))
for (k in 1:length(Z))
F[i,j,k] = X[i] * (Y[j] + Z[k])
F2 <- outer(X, outer(Y, Z, "+"), "*")
> identical(F, F2)
[1] TRUE
A microbenchmark including the expand.grid solution proposed by Nick K :
X = rnorm(100)
Y = seq(1:100)
Z = seq(101:200)
forLoop <- function(X, Y, Z) {
F = array(0, dim = c( length(X),length(Y),length(Z) ) )
for (i in 1:length(X))
for (j in 1:length(Y))
for (k in 1:length(Z))
F[i,j,k] = X[i] * (Y[j] + Z[k])
return(F)
}
nestedOuter <- function(X, Y, Z) {
outer(X, outer(Y, Z, "+"), "*")
}
expandGrid <- function(X, Y, Z) {
df <- expand.grid(X = X, Y = Y, Z = Z)
G <- df$X * (df$Y + df$Z)
dim(G) <- c(length(X), length(Y), length(Z))
return(G)
}
library(microbenchmark)
mbm <- microbenchmark(
forLoop = F1 <- forLoop(X, Y, Z),
nestedOuter = F2 <- nestedOuter(X, Y, Z),
expandGrid = F3 <- expandGrid(X, Y, Z),
times = 50L)
> mbm
Unit: milliseconds
expr min lq mean median uq max neval
forLoop 3261.872552 3339.37383 3458.812265 3388.721159 3524.651971 4074.40422 50
nestedOuter 3.293461 3.36810 9.874336 3.541637 5.126789 54.24087 50
expandGrid 53.907789 57.15647 85.612048 88.286431 103.516819 235.45443 50
Here's as an additional option, a possible Rcpp implementation (in case you like your loops). I wasn't able to outperform #Juliens solution though (maybe someone can), but they are more or less have the same timing
library(Rcpp)
cppFunction('NumericVector RCPP(NumericVector X, NumericVector Y, NumericVector Z){
int nrow = X.size(), ncol = 3, indx = 0;
double temp(1) ;
NumericVector out(pow(nrow, ncol)) ;
IntegerVector dim(ncol) ;
for (int l = 0; l < ncol; l++){
dim[l] = nrow;
}
for (int j = 0; j < nrow; j++) {
for (int k = 0; k < nrow; k++) {
temp = Y[j] + Z[k] ;
for (int i = 0; i < nrow; i++) {
out[indx] = X[i] * temp ;
indx += 1 ;
}
}
}
out.attr("dim") = dim;
return out;
}')
Validating
identical(RCPP(X, Y, Z), F)
## [1] TRUE
A quick benchmark
set.seed(123)
X = rnorm(100)
Y = 1:100
Z = 101:200
nestedOuter <- function(X, Y, Z) outer(X, outer(Y, Z, "+"), "*")
library(microbenchmark)
microbenchmark(
nestedOuter = nestedOuter(X, Y, Z),
RCPP = RCPP(X, Y, Z),
unit = "relative",
times = 1e4)
# Unit: relative
# expr min lq mean median uq max neval
# nestedOuter 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 10000
# RCPP 1.164254 1.141713 1.081235 1.100596 1.080133 0.7092394 10000
You could use expand.grid as follows:
df <- expand.grid(X = X, Y = Y, Z = Z)
G <- df$X * (df$Y + df$Z)
dim(G) <- c(length(X), length(Y), length(Z))
all.equal(F, G)
If you had a vectorised function, this would work just as well. If not, you could use plyr::daply.

Is there a weighted.median() function?

I'm looking for something similar in form to weighted.mean(). I've found some solutions via search that write out the entire function but would appreciate something a bit more user friendly.
The following packages all have a function to calculate a weighted median: 'aroma.light', 'isotone', 'limma', 'cwhmisc', 'ergm', 'laeken', 'matrixStats, 'PSCBS', and 'bigvis' (on github).
To find them I used the invaluable findFn() in the 'sos' package which is an extension for R's inbuilt help.
findFn('weighted median')
Or,
???'weighted median'
as ??? is a shortcut in the same way ?some.function is for help(some.function)
Some experience using the answers from #wkmor1 and #Jaitropmange.
I've checked 3 functions from 3 packages, isotone, laeken, and matrixStats. Only matrixStats works properly. Other two (just as the median(rep(x, times=w) solution) give integer output. As long as I calculated median age of populations, decimal places matter.
Reproducible example. Calculation of the median age of a population
df <- data.frame(age = 0:100,
pop = spline(c(4,7,9,8,7,6,4,3,2,1),n = 101)$y)
library(isotone)
library(laeken)
library(matrixStats)
isotone::weighted.median(df$age,df$pop)
# [1] 36
laeken::weightedMedian(df$age,df$pop)
# [1] 36
matrixStats::weightedMedian(df$age,df$pop)
# [1] 36.164
median(rep(df$age, times=df$pop))
# [1] 35
Summary
matrixStats::weightedMedian() is the reliable solution
To calculate the weighted median of a vector x using a same length vector of (integer) weights w:
median(rep(x, times=w))
This is just a simple solution, ready to use almost anywhere.
weighted.median <- function(x, w) {
w <- w[order(x)]
x <- x[order(x)]
prob <- cumsum(w)/sum(w)
ps <- which(abs(prob - .5) == min(abs(prob - .5)))
return(x[ps])
}
Really old post but I just came across it and did some testing of the different methods. spatstat::weighted.median() seemed to be about 14 times faster than median(rep(x, times=w)) and its actually noticeable if you want to run the function more than a couple times. Testing was with a relatively large survey, about 15,000 people.
One can also use stats::density to create a weighted PDF, then convert this to a CDF, as elaborated here:
my_wtd_q = function(x, w, prob, n = 4096)
with(density(x, weights = w/sum(w), n = n),
x[which.max(cumsum(y*(x[2L] - x[1L])) >= prob)])
Then my_wtd_q(x, w, .5) will be the weighted median.
One could also be more careful to ensure that the total area under the density is one by re-normalizing.
A way in base to get a weighted median will be to order by the values and build the cumsum of the weights and get the value(s) at sum * 0.5 of the weights.
medianWeighted <- function(x, w, q=.5) {
n <- length(x)
i <- order(x)
w <- cumsum(w[i])
p <- w[n] * q
j <- findInterval(p, w)
Vectorize(function(p,j) if(w[n] <= 0) NA else
if(j < 1) x[i[1]] else
if(j == n) x[i[n]] else
if(w[j] == p) (x[i[j]] + x[i[j+1]]) / 2 else
x[i[j+1]])(p,j)
}
What will have the following results with simple input data.
medianWeighted(c(10, 40), c(1, 2))
#[1] 40
median(rep(c(10, 40), c(1, 2)))
#[1] 40
medianWeighted(c(10, 40), c(2, 1))
#[1] 10
median(rep(c(10, 40), c(2, 1)))
#[1] 10
medianWeighted(c(10, 40), c(1.5, 2))
#[1] 40
medianWeighted(c(10, 40), c(3, 4))
#[1] 40
median(rep(c(10, 40), c(3, 4)))
#[1] 40
medianWeighted(c(10, 40), c(1.5, 1.5))
#[1] 25
medianWeighted(c(10, 40), c(3, 3))
#[1] 25
median(rep(c(10, 40), c(3, 3)))
#[1] 25
medianWeighted(c(10, 40), c(0, 1))
#[1] 40
medianWeighted(c(10, 40), c(1, 0))
#[1] 10
medianWeighted(c(10, 40), c(0, 0))
#[1] NA
It can also be used for other qantiles
medianWeighted(1:10, 10:1, seq(0, 1, 0.25))
[1] 1 2 4 6 10
Compare with other methods.
#Functions from other Answers
weighted.median <- function(x, w) {
w <- w[order(x)]
x <- x[order(x)]
prob <- cumsum(w)/sum(w)
ps <- which(abs(prob - .5) == min(abs(prob - .5)))
return(x[ps])
}
my_wtd_q = function(x, w, prob, n = 4096)
with(density(x, weights = w/sum(w), n = n),
x[which.max(cumsum(y*(x[2L] - x[1L])) >= prob)])
weighted.quantile <- function(x, w, probs = seq(0, 1, 0.25),
na.rm = FALSE, names = TRUE) {
if (any(probs > 1) | any(probs < 0)) stop("'probs' outside [0,1]")
if (length(w) == 1) w <- rep(w, length(x))
if (length(w) != length(x)) stop("w must have length 1 or be as long as x")
if (isTRUE(na.rm)) {
w <- x[!is.na(x)]
x <- x[!is.na(x)]
}
w <- w[order(x)] / sum(w)
x <- x[order(x)]
cum_w <- cumsum(w) - w * (1 - (seq_along(w) - 1) / (length(w) - 1))
res <- approx(x = cum_w, y = x, xout = probs)$y
if (isTRUE(names)) {
res <- setNames(res, paste0(format(100 * probs, digits = 7), "%"))
}
res
}
Methods
M <- alist(
medRep = median(rep(DF$x, DF$w)),
isotone = isotone::weighted.median(DF$x, DF$w),
laeken = laeken::weightedMedian(DF$x, DF$w),
spatstat1 = spatstat.geom::weighted.median(DF$x, DF$w, type=1),
spatstat2 = spatstat.geom::weighted.median(DF$x, DF$w, type=2),
spatstat4 = spatstat.geom::weighted.median(DF$x, DF$w, type=4),
survey = survey::svyquantile(~x, survey::svydesign(id=~1, weights=~w, data=DF), 0.5)$x[1],
RAndres = weighted.median(DF$x, DF$w),
matrixStats = matrixStats::weightedMedian(DF$x, DF$w),
MichaelChirico = my_wtd_q(DF$x, DF$w, .5),
Leonardo = weighted.quantile(DF$x, DF$w, .5),
GKi = medianWeighted(DF$x, DF$w)
)
Results
DF <- data.frame(x=c(10, 40), w=c(1, 2))
sapply(M, eval)
# medRep isotone laeken spatstat1 spatstat2
# 40.00000 40.00000 40.00000 40.00000 25.00000
# spatstat4 survey RAndres matrixStats MichaelChirico
# 17.50000 40.00000 10.00000 30.00000 34.15005
# Leonardo.50% GKi
# 25.00000 40.00000
DF <- data.frame(x=c(10, 40), w=c(1, 1))
sapply(M, eval)
# medRep isotone laeken spatstat1 spatstat2
# 25.00000 25.00000 40.00000 10.00000 10.00000
# spatstat4 survey RAndres matrixStats MichaelChirico
# 10.00000 10.00000 10.00000 25.00000 25.05044
# Leonardo.50% GKi
# 25.00000 25.00000
In those two cases only isotone and GKi give identical results compared to what median(rep(x, w)) returns.
If you're working with the survey package, assuming you've defined your survey design and x is your variable of interest:
svyquantile(~x, mydesign, c(0.5))
I got here looking for weighted quantiles, so I thought I might as well leave for future readers what I ended up with. Naturally, using probs = 0.5 will return the weighted median.
I started with MichaelChirico's answer, which unfortunately was off at the edges. Then I decided to switch from density() to approx(). Finally, I believe I nailed the correction factor to ensure consistency with the default algorithm of the unweighted quantile().
weighted.quantile <- function(x, w, probs = seq(0, 1, 0.25),
na.rm = FALSE, names = TRUE) {
if (any(probs > 1) | any(probs < 0)) stop("'probs' outside [0,1]")
if (length(w) == 1) w <- rep(w, length(x))
if (length(w) != length(x)) stop("w must have length 1 or be as long as x")
if (isTRUE(na.rm)) {
w <- x[!is.na(x)]
x <- x[!is.na(x)]
}
w <- w[order(x)] / sum(w)
x <- x[order(x)]
cum_w <- cumsum(w) - w * (1 - (seq_along(w) - 1) / (length(w) - 1))
res <- approx(x = cum_w, y = x, xout = probs)$y
if (isTRUE(names)) {
res <- setNames(res, paste0(format(100 * probs, digits = 7), "%"))
}
res
}
When weights are uniform, the weighted quantiles are identical to regular unweighted quantiles:
x <- rnorm(100)
stopifnot(stopifnot(identical(weighted.quantile(x, w = 1), quantile(x)))
Example using the same data as in the weighted.mean() man page.
x <- c(3.7, 3.3, 3.5, 2.8)
w <- c(5, 5, 4, 1)/15
stopifnot(isTRUE(all.equal(
weighted.quantile(x, w, 0:4/4, names = FALSE),
c(2.8, 3.33611111111111, 3.46111111111111, 3.58157894736842,
3.7)
)))
And this is for whoever solely wants the weighted median value:
weighted.median <- function(x, w, ...) {
weighted.quantile(x, w, probs = 0.5, names = FALSE, ...)
}

Resources