Approximate match (analogue of all.equal for identical)? - r

Consider:
(tmp1 <- seq(0, 0.2, 0.01)[16])
# [1] 0.15
(tmp2 <- seq(0, 0.2, 0.05)[4])
# [1] 0.15
and
identical(tmp1, tmp2)
# [1] FALSE
all.equal(tmp1, tmp2) # test for 'near' equality
[1] TRUE
The underlying reason is to do with floating point precision. However, this leads to a problem when trying to identify subsequences within sequences using match, for example:
match(seq(0, 0.2, 0.05), seq(0, 0.2, 0.01))
# [1] 1 6 11 NA 21
Is there an alternative to match that is the analogue of all.equal for identical?

We can write a custom match called near.match, inspired by dplyr::near:
near.match <- function(x, y, tol = .Machine$double.eps^0.5){
sapply(x, function(i){
res <- which(abs(y - i) < tol, arr.ind = TRUE)[1]
if(length(res)) res else NA_integer_
})
}
near.match(seq(0, 0.2, 0.05), seq(0, 0.2, 0.01))
# [1] 1 6 11 16 21
near.match(c(seq(0, 0.2, 0.05), 0.3), seq(0, 0.2, 0.01))
# [1] 1 6 11 16 21 NA

Related

How to fill a matrix by proportion?

I'm trying to create aa 20x20 matrix filled with numbers from -1:2. However, I don't want it to be random but by proportion that I decide.
For example, I would want 0.10 of the cells to be -1, 0.60 to be 0, 0.20 to be 1, 0.10 to be 2.
This code was able to get me a matrix with all of the values I want, but I don't know how to edit it to specify the proportion of each value I want.
r <- 20
c <- 20
mat <- matrix(sample(-1:2,r*c, replace=TRUE),r,c)
We can use the prob argument from sample
matrix(sample(-1:2,r*c, replace=TRUE, prob = c(0.1, 0.6, 0.2, 0.2)), r, c)
r <- 20
c <- 20
ncell = r * c
val = c(-1, 0.2, 1, 2)
p = c(0.1, 0.6, 0.2, 0.1)
fill = rep(val, ceiling(p * ncell))[1:ncell]
mat <- matrix(data = sample(fill), nrow = r, ncol = c)
prop.table(table(mat))
#> mat
#> -1 0.2 1 2
#> 0.1 0.6 0.2 0.1
Created on 2019-09-20 by the reprex package (v0.3.0)

Create samples with different range and weights

I want to create a total sample of 3000 entries with some rules :
Category-1(low) 0.1 - 0.3
Category-2(Medium) 0.4 - 0.7
Category-3(High) 0.7 - 0.9
I want to create the sample in such a way that each category has weights for example :
Category-1(low) 20% of the dataset
Category-2(Medium) 30% of the dataset
Category-3(High) 50% of the dataset
I am unable to find pointers to do that. Can anyone help me out with the same. Thanks a lot in advance.
We can use Map to create a sequence of values between the ranges showed in the OP's post, while generating the sample on the ranges with the proportion also being passed in as argument to Map
lst1 <- Map(function(x, y, z) sample(seq(x, y, by = 0.1), z,
replace = TRUE), c(0.1, 0.4, 0.7), c(0.3, 0.7, 0.9), c(0.2, 0.3, 0.5) * 3000)
names(lst1) <- c("low", "medium", "high")
lengths(lst1)
# low medium high
# 600 900 1500
out <- unlist(lst1)
length(out)
#[1] 3000
If we need as a two column data.frame
dat <- stack(lst1)[2:1]
I like to use the simstudy package for data generation. In this case I back-filled your values that conform to category rules. Simstudy gives a data.table object, but I'm more familiar with Tidyverse syntax:
library(simstudy)
library(dplyr)
set.seed(1724)
# define data
def <- defData(varname = "category", formula = "0.2;0.3;0.5", dist = "categorical", id = "id")
def <- defData(def, varname = "value", dist = "nonrandom", formula = NA)
# generate data
df <- genData(3000, def) %>% as_tibble()
# add in values that conform to category rules
df[df$category == 1,]$value <- runif(nrow(df[df$category == 1,]), min = 0.1, max = 0.3)
df[df$category == 2,]$value <- runif(nrow(df[df$category == 2,]), min = 0.4, max = 0.7)
df[df$category == 3,]$value <- runif(nrow(df[df$category == 3,]), min = 0.7, max = 0.9)
# A tibble: 3,000 x 3
id category value
<int> <int> <dbl>
1 1 3 0.769
2 2 2 0.691
3 3 3 0.827
4 4 3 0.729
5 5 2 0.474
6 6 3 0.818
7 7 2 0.635
8 8 2 0.552
9 9 3 0.794
10 10 3 0.792
# ... with 2,990 more rows
A rather simple approach:
1. This is not that random, but depending on the application this may suffice
out <- c(runif(600, 0.1, 0.3), runif(900, 0.4, 0.7), runif(1500, 0.7, 0.9))
2. Here, you'd draw the numbers coming from each category as well: so more random...
sam <- sample(1:3, size = 3000, prob = c(0.2, 0.3, 0.5), replace = TRUE)
x1 <- sum(sam == 1)
x2 <- sum(sam == 2)
x3 <- sum(sam == 3)
out <- c(runif(x1, 0.1, 0.3), runif(x2, 0.4, 0.7), runif(x3, 0.7, 0.9))

What is the difference between matrixpower() and markov() when it comes to computing P^n?

Consider a Markov chain with state space S = {1, 2, 3, 4} and transition matrix
P = 0.1 0.2 0.4 0.3
0.4 0.0 0.4 0.2
0.3 0.3 0.0 0.4
0.2 0.1 0.4 0.3
And, take a look at the following source code:
# markov function
markov <- function(init,mat,n,labels)
{
if (missing(labels))
{
labels <- 1:length(init)
}
simlist <- numeric(n+1)
states <- 1:length(init)
simlist[1] <- sample(states,1,prob=init)
for (i in 2:(n+1))
{
simlist[i] <- sample(states, 1, prob = mat[simlist[i-1],])
}
labels[simlist]
}
# matrixpower function
matrixpower <- function(mat,k)
{
if (k == 0) return (diag(dim(mat)[1]))
if (k == 1) return(mat)
if (k > 1) return( mat %*% matrixpower(mat, k-1))
}
tmat = matrix(c(0.1, 0.2, 0.4, 0.3,
0.4, 0.0, 0.4, 0.2,
0.3, 0.3, 0.0, 0.4,
0.2, 0.1, 0.4, 0.3), nrow=4, ncol=4, byrow=TRUE)
p10 = matrixpower(mat = tmat, k=10)
rowMeans(p10)
nn <- 10
alpha <- c(0.25, 0.25, 0.25, 0.25)
set.seed(1)
steps <- markov(init=alpha, mat=tmat, n=nn)
table(steps)/(nn + 1)
Output
> rowMeans(p10)
[1] 0.25 0.25 0.25 0.25
>
.
.
.
> table(steps)/(nn + 1)
steps
1 2 3 4
0.09090909 0.18181818 0.18181818 0.54545455
> ?rowMeans
Why are results so different?
What is the difference between using matrixpower() and markov() when it come to compute Pn?
Currently you are comparing completely different things. First, I'll focus not on computing Pn, but rather A*Pn, where A is the initial distribution. In that case matrixpower does the job:
p10 <- matrixpower(mat = tmat, k = 10)
alpha <- c(0.25, 0.25, 0.25, 0.25)
alpha %*% p10
# [,1] [,2] [,3] [,4]
# [1,] 0.2376945 0.1644685 0.2857105 0.3121265
those are the true probabilities of being in states 1, 2, 3, 4, respectively, after 10 steps (after the initial draw made using A).
Meanwhile, markov(init = alpha, mat = tmat, n = nn) is only a single realization of length nn + 1 and only the last number of this realization is relevant for A*Pn. So, as to try to get similar numbers to the theoretical ones, we need many realizations with nn <- 10, as in
table(replicate(markov(init = alpha, mat = tmat, n = nn)[nn + 1], n = 10000)) / 10000
#
# 1 2 3 4
# 0.2346 0.1663 0.2814 0.3177
where I simulate 10000 realizations and take only the last state of each realization.

Using optimisation function (optimise) together with dbinom in R (optimisation issue)

When p = 0.5, n = 5 and x = 3
dbinom(3,5,0.5) = 0.3125
Lets say I dont know p (n and x is known) and want to find it.
binp <- function(bp) dbinom(3,5,bp) - 0.3125
optimise(binp, c(0,1))
It does not return 0.5. Also, why is
dbinom(3,5,0.5) == 0.3125 #FALSE
But,
x <- dbinom(3,5,0.5)
x == dbinom(3,5,0.5) #TRUE
optimize() searches the param that minimizes the output of function. Your function can return a negative value (e.g., binp(0.1) is -0.3044). If you search the param that minimizes difference from zero, it would be good idea to use sqrt((...)^2). If you want the param that makes output zero, uniroot would help you. And the param what you want isn't uniquely decided. (note; x <- dbinom(3, 5, 0.5); x == dbinom(3, 5, 0.5) is equibalent to dbinom(3, 5, 0.5) == dbinom(3, 5, 0.5))
## check output of dbinom(3, 5, prob)
input <- seq(0, 1, 0.001)
output <- Vectorize(dbinom, "prob")(3, 5, input)
plot(input, output, type="l")
abline(h = dbinom(3, 5, 0.5), col = 2) # there are two answers
max <- optimize(function(x) dbinom(3, 5, x), c(0, 1), maximum = T)$maximum # [1] 0.6000006
binp <- function(bp) dbinom(3,5,bp) - 0.3125 # your function
uniroot(binp, c(0, max))$root # [1] 0.5000036
uniroot(binp, c(max, 1))$root # [1] 0.6946854
binp2 <- function(bp) sqrt((dbinom(3,5,bp) - 0.3125)^2)
optimize(binp2, c(0, max))$minimum # [1] 0.499986
optimize(binp2, c(max, 1))$minimum # [1] 0.6947186
dbinom(3, 5, 0.5) == 0.3125 # [1] FALSE
round(dbinom(3, 5, 0.5), 4) == 0.3125 # [1] TRUE
format(dbinom(3, 5, 0.5), digits = 16) # [1] "0.3124999999999999"

Is there a weighted.median() function?

I'm looking for something similar in form to weighted.mean(). I've found some solutions via search that write out the entire function but would appreciate something a bit more user friendly.
The following packages all have a function to calculate a weighted median: 'aroma.light', 'isotone', 'limma', 'cwhmisc', 'ergm', 'laeken', 'matrixStats, 'PSCBS', and 'bigvis' (on github).
To find them I used the invaluable findFn() in the 'sos' package which is an extension for R's inbuilt help.
findFn('weighted median')
Or,
???'weighted median'
as ??? is a shortcut in the same way ?some.function is for help(some.function)
Some experience using the answers from #wkmor1 and #Jaitropmange.
I've checked 3 functions from 3 packages, isotone, laeken, and matrixStats. Only matrixStats works properly. Other two (just as the median(rep(x, times=w) solution) give integer output. As long as I calculated median age of populations, decimal places matter.
Reproducible example. Calculation of the median age of a population
df <- data.frame(age = 0:100,
pop = spline(c(4,7,9,8,7,6,4,3,2,1),n = 101)$y)
library(isotone)
library(laeken)
library(matrixStats)
isotone::weighted.median(df$age,df$pop)
# [1] 36
laeken::weightedMedian(df$age,df$pop)
# [1] 36
matrixStats::weightedMedian(df$age,df$pop)
# [1] 36.164
median(rep(df$age, times=df$pop))
# [1] 35
Summary
matrixStats::weightedMedian() is the reliable solution
To calculate the weighted median of a vector x using a same length vector of (integer) weights w:
median(rep(x, times=w))
This is just a simple solution, ready to use almost anywhere.
weighted.median <- function(x, w) {
w <- w[order(x)]
x <- x[order(x)]
prob <- cumsum(w)/sum(w)
ps <- which(abs(prob - .5) == min(abs(prob - .5)))
return(x[ps])
}
Really old post but I just came across it and did some testing of the different methods. spatstat::weighted.median() seemed to be about 14 times faster than median(rep(x, times=w)) and its actually noticeable if you want to run the function more than a couple times. Testing was with a relatively large survey, about 15,000 people.
One can also use stats::density to create a weighted PDF, then convert this to a CDF, as elaborated here:
my_wtd_q = function(x, w, prob, n = 4096)
with(density(x, weights = w/sum(w), n = n),
x[which.max(cumsum(y*(x[2L] - x[1L])) >= prob)])
Then my_wtd_q(x, w, .5) will be the weighted median.
One could also be more careful to ensure that the total area under the density is one by re-normalizing.
A way in base to get a weighted median will be to order by the values and build the cumsum of the weights and get the value(s) at sum * 0.5 of the weights.
medianWeighted <- function(x, w, q=.5) {
n <- length(x)
i <- order(x)
w <- cumsum(w[i])
p <- w[n] * q
j <- findInterval(p, w)
Vectorize(function(p,j) if(w[n] <= 0) NA else
if(j < 1) x[i[1]] else
if(j == n) x[i[n]] else
if(w[j] == p) (x[i[j]] + x[i[j+1]]) / 2 else
x[i[j+1]])(p,j)
}
What will have the following results with simple input data.
medianWeighted(c(10, 40), c(1, 2))
#[1] 40
median(rep(c(10, 40), c(1, 2)))
#[1] 40
medianWeighted(c(10, 40), c(2, 1))
#[1] 10
median(rep(c(10, 40), c(2, 1)))
#[1] 10
medianWeighted(c(10, 40), c(1.5, 2))
#[1] 40
medianWeighted(c(10, 40), c(3, 4))
#[1] 40
median(rep(c(10, 40), c(3, 4)))
#[1] 40
medianWeighted(c(10, 40), c(1.5, 1.5))
#[1] 25
medianWeighted(c(10, 40), c(3, 3))
#[1] 25
median(rep(c(10, 40), c(3, 3)))
#[1] 25
medianWeighted(c(10, 40), c(0, 1))
#[1] 40
medianWeighted(c(10, 40), c(1, 0))
#[1] 10
medianWeighted(c(10, 40), c(0, 0))
#[1] NA
It can also be used for other qantiles
medianWeighted(1:10, 10:1, seq(0, 1, 0.25))
[1] 1 2 4 6 10
Compare with other methods.
#Functions from other Answers
weighted.median <- function(x, w) {
w <- w[order(x)]
x <- x[order(x)]
prob <- cumsum(w)/sum(w)
ps <- which(abs(prob - .5) == min(abs(prob - .5)))
return(x[ps])
}
my_wtd_q = function(x, w, prob, n = 4096)
with(density(x, weights = w/sum(w), n = n),
x[which.max(cumsum(y*(x[2L] - x[1L])) >= prob)])
weighted.quantile <- function(x, w, probs = seq(0, 1, 0.25),
na.rm = FALSE, names = TRUE) {
if (any(probs > 1) | any(probs < 0)) stop("'probs' outside [0,1]")
if (length(w) == 1) w <- rep(w, length(x))
if (length(w) != length(x)) stop("w must have length 1 or be as long as x")
if (isTRUE(na.rm)) {
w <- x[!is.na(x)]
x <- x[!is.na(x)]
}
w <- w[order(x)] / sum(w)
x <- x[order(x)]
cum_w <- cumsum(w) - w * (1 - (seq_along(w) - 1) / (length(w) - 1))
res <- approx(x = cum_w, y = x, xout = probs)$y
if (isTRUE(names)) {
res <- setNames(res, paste0(format(100 * probs, digits = 7), "%"))
}
res
}
Methods
M <- alist(
medRep = median(rep(DF$x, DF$w)),
isotone = isotone::weighted.median(DF$x, DF$w),
laeken = laeken::weightedMedian(DF$x, DF$w),
spatstat1 = spatstat.geom::weighted.median(DF$x, DF$w, type=1),
spatstat2 = spatstat.geom::weighted.median(DF$x, DF$w, type=2),
spatstat4 = spatstat.geom::weighted.median(DF$x, DF$w, type=4),
survey = survey::svyquantile(~x, survey::svydesign(id=~1, weights=~w, data=DF), 0.5)$x[1],
RAndres = weighted.median(DF$x, DF$w),
matrixStats = matrixStats::weightedMedian(DF$x, DF$w),
MichaelChirico = my_wtd_q(DF$x, DF$w, .5),
Leonardo = weighted.quantile(DF$x, DF$w, .5),
GKi = medianWeighted(DF$x, DF$w)
)
Results
DF <- data.frame(x=c(10, 40), w=c(1, 2))
sapply(M, eval)
# medRep isotone laeken spatstat1 spatstat2
# 40.00000 40.00000 40.00000 40.00000 25.00000
# spatstat4 survey RAndres matrixStats MichaelChirico
# 17.50000 40.00000 10.00000 30.00000 34.15005
# Leonardo.50% GKi
# 25.00000 40.00000
DF <- data.frame(x=c(10, 40), w=c(1, 1))
sapply(M, eval)
# medRep isotone laeken spatstat1 spatstat2
# 25.00000 25.00000 40.00000 10.00000 10.00000
# spatstat4 survey RAndres matrixStats MichaelChirico
# 10.00000 10.00000 10.00000 25.00000 25.05044
# Leonardo.50% GKi
# 25.00000 25.00000
In those two cases only isotone and GKi give identical results compared to what median(rep(x, w)) returns.
If you're working with the survey package, assuming you've defined your survey design and x is your variable of interest:
svyquantile(~x, mydesign, c(0.5))
I got here looking for weighted quantiles, so I thought I might as well leave for future readers what I ended up with. Naturally, using probs = 0.5 will return the weighted median.
I started with MichaelChirico's answer, which unfortunately was off at the edges. Then I decided to switch from density() to approx(). Finally, I believe I nailed the correction factor to ensure consistency with the default algorithm of the unweighted quantile().
weighted.quantile <- function(x, w, probs = seq(0, 1, 0.25),
na.rm = FALSE, names = TRUE) {
if (any(probs > 1) | any(probs < 0)) stop("'probs' outside [0,1]")
if (length(w) == 1) w <- rep(w, length(x))
if (length(w) != length(x)) stop("w must have length 1 or be as long as x")
if (isTRUE(na.rm)) {
w <- x[!is.na(x)]
x <- x[!is.na(x)]
}
w <- w[order(x)] / sum(w)
x <- x[order(x)]
cum_w <- cumsum(w) - w * (1 - (seq_along(w) - 1) / (length(w) - 1))
res <- approx(x = cum_w, y = x, xout = probs)$y
if (isTRUE(names)) {
res <- setNames(res, paste0(format(100 * probs, digits = 7), "%"))
}
res
}
When weights are uniform, the weighted quantiles are identical to regular unweighted quantiles:
x <- rnorm(100)
stopifnot(stopifnot(identical(weighted.quantile(x, w = 1), quantile(x)))
Example using the same data as in the weighted.mean() man page.
x <- c(3.7, 3.3, 3.5, 2.8)
w <- c(5, 5, 4, 1)/15
stopifnot(isTRUE(all.equal(
weighted.quantile(x, w, 0:4/4, names = FALSE),
c(2.8, 3.33611111111111, 3.46111111111111, 3.58157894736842,
3.7)
)))
And this is for whoever solely wants the weighted median value:
weighted.median <- function(x, w, ...) {
weighted.quantile(x, w, probs = 0.5, names = FALSE, ...)
}

Resources