Why getting similar output for quantile in r? - r

I have two diffent columns
a = c(65.96924, 7.084438, 81.65288 )
b= c(2197.62974660844, 1934.40212085843, 1939.64438773692,
)`
They are different, but I wonder why I go similar output when applying:
quants <- seq(0, 1, length.out = 51)
ecdf(a)(quantile(a, quants))
ecdf(b)(quantile(b, quants))

This can be better understood if you calculate the ecdf by hand instead of using the ecdf() function, which is just a step function that increases up by 1/n at each of the n data points. Since both are the same length you get the same result.
quants <- seq(0, 1, length.out = 51)
A <- sort(a)
B <- sort(b)
A_ecdf <- 1:length(A)/length(A)
B_ecdf <- 1:length(B)/length(B)
plot(A, A_ecdf, type = "s", col = 1)
par(new = TRUE)
plot(B, B_ecdf, type = "s", col = 2)
Though I am not sure exactly what you are trying to do, a possible solution from How to find quantiles of an empirical cumulative density function (ECDF) may be:
my_quantile <- function(x, prob) {
n <- length(x)
approx(seq(0, 1, length = n), x, prob)$y
}
my_quantile(A, quants)
my_quantile(B, quants)

You are calculating the Empirical Cumulative Distribution Function of the quantiles itself and not of the distribution e.g. ecdf(b)(2000). By definition, you end up with a straight line.

Related

R - fast interpolation between CDF quantiles

Say we have a data.frame where the columns represent the quantiles for a given set of probabilities. Each row represents a different subject and the quantiles vary by subject. The goal is to take n_draws for each subject.
n <- 1e5
alphas <- c(.05, .25, .5, .75, .95)
n_draws <- 100
dt <- data.frame(quantile_05 = runif(n),
quantile_25 = runif(n, min = 10, max = 20),
quantile_5 = runif(n, min = 30, max = 40),
quantile_75 = runif(n, min = 50, max = 60),
quantile_95 = runif(n, min = 70, max = 80))
R has stats::approx. The issue is that it can only be applied to 1 row at once.
draws <- apply(X = dt, MARGIN = 1, function(q){
stats::approx(x=alphas, y=q, yleft = 0, rule = 2, method="linear", xout = seq(.0001, .99999, length.out = n_draws))$y
})
Naturally, one way to speed this up is parallelization:
library(parallel)
registerDoParallel(cores=8)
cl <- makeCluster(8)
clusterExport(cl, c('alphas', 'n_draws'))
draws <- parApply(cl=cl, X = dt, MARGIN = 1, function(q){
stats::approx(x=alphas, y=q, yleft = 0, rule = 2, method="linear", xout = seq(.0001, .99999, length.out = n_draws))$y
})
This parallel code is several times faster on my machine. I am curious if anyone has suggestions for further speed ups.
To be a little more clear, this is just a simplification of what I'm using in real life. I have more than 5 quantiles. I want to use this concept to estimate not only other quantiles, but also the mean. Further, I'd like to use the draws to model other quantities like max(y - 10, 0) (or y - any arbitrary value).
Your proposed method is rather inefficient for the desired goal. You end up storing 100 points for each desired ECDF and you will still need to write a function to extract a desired draw. Instead I suggest you consider using the approxfun functions. It will return a more compact set of values which will be individual functions with associated environments that contain the knots for later calculation. The calculation will be done with a C call that is accessed with the invisible helper function, .approxfun.
Demonstrating the internals (up to a point) that I'm suggesting:
out <- approxfun(y=dt[1,], x=alphas,yleft = 0, rule = 2, method="linear")
# So out is now a single instance using the knots in the first row
out
#function (v)
#.approxfun(x, y, v, method, yleft, yright, f, na.rm)
#<bytecode: 0x558366535968>
#<environment: 0x5583690a04f8>
ls(environment(out))
#[1] "f" "method" "na.rm" "x" "y" "yleft" "yright"
environment(out)$x
#[1] 0.05 0.25 0.50 0.75 0.95
environment(out)$y
#[1] 0.4038727 17.7069735 33.4438595 57.2753257 77.2024894
If you wanted the estimated 55th percentile for the first case, you could get it with:
out(55/100)
#[1] 38.21015
And now that I've suggested a way to speed up you creation of this list of functions, I'm not even sure it's worth it. I think you could just leave that dt dataframe in place and call approxfun when needed. But that's your call.
Note: This is essentially the method used by the ecdf function:
ecdf
function (x)
{
x <- sort(x)
n <- length(x)
if (n < 1)
stop("'x' must have 1 or more non-missing values")
vals <- unique(x)
rval <- approxfun(vals, cumsum(tabulate(match(x, vals)))/n,
method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered")
class(rval) <- c("ecdf", "stepfun", class(rval))
assign("nobs", n, envir = environment(rval))
attr(rval, "call") <- sys.call()
rval
}
<bytecode: 0x558364a0f360>
<environment: namespace:stats>
And it's possible that you might want to use the ecdf function because it has some class-associated functions.

How do I make clusters based on a fixed capacity of each cluster in R?

So I have a dataset with 600 points, their latitude, longitude, and demands.
I have to make clusters such that for each cluster the points will be near each other and the total capacity of that cluster will not exceed a certain limit.
A sample dataset for the problem:
set.seed(123)
id<- seq(1:600)
lon <- rnorm(600, 88.5, 0.125)
lat <- rnorm(600, 22.4, 0.15)
demand <- round(rnorm(600, 40, 20))
df<- data.frame(id, lon, lat, demand)
What I want to have approximately:
What I am getting (cluster boundaries are approximated):
The code I've written:
library(tidyverse)
constrained_cluster <- function(df,capacity=170){
lon_max <- max(df$lon)
lat_max <- max(df$lat)
#Calculating the distance between an extreme point and all other points
df$distance<-6377.83*acos(sin(lat_max*p)*sin(df$lat*p) + cos(lat_max*p)*cos(df$lat*p) * cos((lon_max-df$lon)*p))
df<- df[order(df$distance, decreasing = FALSE),]
d<-0
cluster_number<-1
cluster_list<- c()
i<-1
#Writing a loop to form the cluster which will fill up the cluster_list accordingly
while (i <= length(df$distance)){
d <- d+ df$demand[i]
if(d<=capacity){
cluster_list[i] <- cluster_number
i<- i+1
}
else{
cluster_number <- cluster_number+1
d <- 0
i<-i
}
}
#Return a dataframe with the list of clusters
return(cbind(df,as.data.frame(cluster_list)))
}
df_with_cluster<- constrained_cluster(df, capacity = 1000)
Here is one possible approach, in which I treat the problem directly as an optimisation problem.
Suppose you have a feasible partition of the rows into groups. Not necessarily a good one, but one that does not violate the constraints. For every
group (cluster), you compute the centre. Then you
compute the distances of all points in a group to the
group's centre, and sum them. In this way, you have a
measure of quality of your initial partition.
Now, randomly pick on row, and move it into another
group. You get new solution. Complete the steps
as before, and compare the new solution's quality with
the previous one. If it's better, keep it. If it's
worse, stay with the old solution. Now repeat this
whole procedure for a fixed number of iterations.
This process is called a Local Search. Of course, it is
not guaranteed it will take you to an optimum
solution. But it will likely give you a good
solution. (Note that k-means implementations are
typically stochastic as well, and there is no guaranty
for an "optimal" partition.)
The good thing about a Local Search is that it gives
you much flexibility. For instance, I assumed you
started with a feasible solution. Suppose you make a
random move (i.e. move one row into another cluster),
but now this new cluster is to big. You could now simply discard this new, infeasible solution, and draw a new one.
Here is a code example, really just an outline; but with luck it is useful for you.
set.seed(123)
id<- seq(1:600)
lon <- rnorm(600, 88.5, 0.125)
lat <- rnorm(600, 22.4, 0.15)
demand <- round(rnorm(600, 40, 20))
df<- data.frame(id, lon, lat, demand)
Fix a number of clusters, k.
k <- 5
Start with kmeans and plot the solution.
par(mfrow = c(1, 3))
km <- kmeans(cbind(df$lat, df$lon), centers = k)
cols <- hcl.colors(n = k, "Cold")
plot(df$lon,
df$lat,
type = "p", pch = 19, cex = 0.5,
main = "kmeans")
for (i in seq_len(k)) {
lines(df$lon[km$cluster == i],
df$lat[km$cluster == i],
type = "p", pch = 19,
col = cols[i])
}
Now a Local Search. I use an implementation in package NMOF (which I maintain).
library("NMOF")
## a random initial solution
x0 <- sample(1:k, length(id), replace = TRUE)
X <- as.matrix(df[, 2:3])
The objective function: it takes a partition x and computes the sum of distances, for all clusters.
sum_diff <- function(x, X, k, ...) {
groups <- seq_len(k)
d_centre <- numeric(k)
for (g in groups) {
centre <- colMeans(X[x == g, ], )
d <- t(X[x == g, ]) - centre
d_centre[g] <- sum(sqrt(colSums(d * d)))
}
sum(d_centre)
}
The neighbourhood function: it takes a partition and moves
one row into another cluster.
nb <- function(x, k, ...) {
groups <- seq_len(k)
x_new <- x
p <- sample.int(length(x), 1)
g_ <- groups[-x_new[p]]
x_new[p] <- g_[sample.int(length(g_), 1)]
x_new
}
Run the Local Search. I actually use a method called Threshold Accepting, which is based on Local Search, but can move away from local minima. See ?NMOF::TAopt for references on that method.
sol <- TAopt(sum_diff,
list(x0 = x0,
nI = 20000,
neighbour = nb),
X = as.matrix(df[, 2:3]),
k = k)
Plot the solution.
plot(df$lon,
df$lat,
type = "p", pch = 19, cex = 0.5,
main = "Local search")
for (i in seq_len(k)) {
lines(df$lon[sol$xbest == i],
df$lat[sol$xbest == i],
type = "p", pch = 19,
col = cols[i])
}
Now, one way to add a capacity constraint. We start with a feasible solution.
## CAPACITY-CONSTRAINED
max.demand <- 6600
all(tapply(df$demand, x0, sum) < max.demand)
## TRUE
The constraint is handled in the neighbourhood. If the new solution exceeds the capacity, it is discarded.
nb_constr <- function(x, k, demand, max.demand,...) {
groups <- seq_len(k)
x_new <- x
p <- sample.int(length(x), 1)
g_ <- groups[-x_new[p]]
x_new[p] <- g_[sample.int(length(g_), 1)]
## if capacity is exceeded, return
## original solution
if (sum(demand[x_new == x_new[p]]) > max.demand)
x
else
x_new
}
Run the method and compare the results.
sol <- TAopt(sum_diff,
list(x0 = x0,
nI = 20000,
neighbour = nb_constr),
X = as.matrix(df[, 2:3]),
k = k,
demand = df$demand,
max.demand = max.demand)
plot(df$lon,
df$lat,
type = "p", pch = 19, cex = 0.5,
main = "Local search w/ constraint")
for (i in seq_len(k)) {
lines(df$lon[sol$xbest == i],
df$lat[sol$xbest == i],
type = "p", pch = 19,
col = cols[i])
}
all(tapply(df$demand, sol$xbest, sum) < max.demand)
## TRUE
This is really just an example and could be improved. For instance, the objective function here recomputes the distance of all groups, when it would only need to look at the changed groups.
Something like this might get you started?
nmax <- 100
num.centers <- 1
km <- kmeans(cbind(df$lat, df$lon), centers = num.centers)
#check if there are no clusters larger than nmax
while (prod(km$size < nmax) == 0) {
num.centers <- num.centers + 1
km <- kmeans(cbind(df$lat, df$lon), centers = num.centers)
}
plot(df$lon, df$lat, col = km$cluster, pch = 20)

Plotting a sum in R Studio

I am trying to plot the following function in R Studio using the curve function as follows:
loglikelihood.func = function(x, mu){
n = length(x)
n*mu - sum(x) - sum(exp(mu)/(exp(x)))
}
curve(expr = loglikelihood.func(x = data, mu), xname = "mu", from
= 0, to = 15)
Now, I have a vector of data that contains 50 data points and in the function, it is currently summing both mu and x (my data) i.e. sum(exp(mu)/(exp(x))) is equivalent to sum(exp(mu))/sum(exp(x))
I want my function to work so that I have the fraction sum for each different data point in x while keeping mu constant i.e. exp(mu)/exp(x1) + exp(mu)/exp(x2) + exp(mu)/exp(x3) + ... and repeat this for each separate mu when it plots in the curve function.
If I change my function to exp(mu)/sum(exp(x)) it's doing exp(mu)/[exp(x1)+exp(x2)+...] which is not what I want. Can someone offer some advice here?
Edit:
This is a subset of my data,
data = c(8.5,8.9,9.1,8.9,8.4,9.7,9.1,9.6,8.7,9.3,9.6,9.3,8.7,9.0,8.8,8.9,8.9,12.2)
Not sure I get this right... But curve() can't do as much by itself. You can define your function:
loglikelihood.func = function(x, mu) {
length(x) * mu - sum(x) - sum(exp(mu)/(exp(x)))
}
Then define a range (for x) over which you'd like to plot it, and specify the limits (xlim, ylim) on a first plot. If you don't want to have a sequence, you can use your data instead.
xrange <- seq(from=1, to=10, by = 0.1)
plot(x=xrange, y = sapply(xrange, function(x) loglikelihood.func(x, mu=0)),
xlim = c(1, 10),
ylim = c(-10, 0),
type = "l")
Then add other curves, specifying different mu's:
lines(x=xrange, y = sapply(xrange, function(x) loglikelihood.func(x, mu=1)))
lines(x=xrange, y = sapply(xrange, function(x) loglikelihood.func(x, mu=2)))
(More practical doing with a loop if you don't need extra graph parameters)

Adding two random variables via convolution in R

I would like to compute the convolution of two probability distributions in R and I need some help. For the sake of simplicity, let's say I have a variable x that is normally distributed with mean = 1.0 and stdev = 0.5, and y that is log-normally distributed with mean = 1.5 and stdev = 0.75. I want to determine z = x + y. I understand that the distribution of z is not known a priori.
As an aside the real world example I am working with requires addition to two random variables that are distributed according to a number of different distributions.
Does anyone know how to add two random variables by convoluting the probability density functions of x and y?
I have tried generating n normally distributed random values (with above parameters) and adding them to n log-normally distributed random values. However, I wish to know if I can use the convolution method instead. Any help would be greatly appreciated.
EDIT
Thank you for these answers. I define a pdf, and try to do the convolution integral, but R complains on the integration step. My pdfs are Log Pearson 3 and are as follows
dlp3 <- function(x, a, b, g) {
p1 <- 1/(x*abs(b) * gamma(a))
p2 <- ((log(x)-g)/b)^(a-1)
p3 <- exp(-1* (log(x)-g) / b)
d <- p1 * p2 * p3
return(d)
}
f.m <- function(x) dlp3(x,3.2594,-0.18218,0.53441)
f.s <- function(x) dlp3(x,9.5645,-0.07676,1.184)
f.t <- function(z) integrate(function(x,z) f.s(z-x)*f.m(x),-Inf,Inf,z)$value
f.t <- Vectorize(f.t)
integrate(f.t, lower = 0, upper = 3.6)
R complains at the last step since the f.t function is bounded and my integration limits are probably not correct. Any ideas on how to solve this?
Here is one way.
f.X <- function(x) dnorm(x,1,0.5) # normal (mu=1.5, sigma=0.5)
f.Y <- function(y) dlnorm(y,1.5, 0.75) # log-normal (mu=1.5, sigma=0.75)
# convolution integral
f.Z <- function(z) integrate(function(x,z) f.Y(z-x)*f.X(x),-Inf,Inf,z)$value
f.Z <- Vectorize(f.Z) # need to vectorize the resulting fn.
set.seed(1) # for reproducible example
X <- rnorm(1000,1,0.5)
Y <- rlnorm(1000,1.5,0.75)
Z <- X + Y
# compare the methods
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
Same thing using package distr.
library(distr)
N <- Norm(mean=1, sd=0.5) # N is signature for normal dist
L <- Lnorm(meanlog=1.5,sdlog=0.75) # same for log-normal
conv <- convpow(L+N,1) # object of class AbscontDistribution
f.Z <- d(conv) # distribution function
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
I was having trouble getting integrate() to work for different density parameters, so I came up with an alternative to #jlhoward's using Riemann approximation:
set.seed(1)
#densities to be convolved. could also put these in the function below
d1 <- function(x) dnorm(x,1,0.5) #
d2 <- function(y) dlnorm(y,1.5, 0.75)
#Riemann approximation of convolution
conv <- function(t, a, b, d) { #a to b needs to cover the range of densities above. d needs to be small for accurate approx.
z <- NA
x <- seq(a, b, d)
for (i in 1:length(t)){
print(i)
z[i] <- sum(d1(x)*d2(t[i]-x)*d)
}
return(z)
}
#check against sampled convolution
X <- rnorm(1000, 1, 0.5)
Y <- rlnorm(1000, 1.5, 0.75)
Z <- X + Y
t <- seq(0, 50, 0.05) #range to evaluate t, smaller increment -> smoother curve
hist(Z, breaks = 50, freq = F, xlim = c(0,30))
lines(t, conv(t, -100, 100, 0.1), type = "s", col = "red")

Bootstrap variables correlation in R

My intention was to write several functions aimed at finding the overall similarity between two covariance matrices, either by multiplying them with random vectors and correlating the response vectors or by bootstrapping one of the matrices to obtain the correlation coefficient distribution that can serve for comparison. But in both cases I got erroneous results. The observed between-matrix correlation was high up to 0.93, but the distribution only ranged up to 0.2 the most. This is the function`s code:
resamplerSimAlt <- function(mat1, mat2, numR, graph = FALSE)
{
statSim <- numeric(numR)
mat1vcv <- cov(mat1)
mat2vcvT <- cov(mat2)
ltM1 <- mat1vcv[col(mat1vcv) <= row(mat1vcv)]
ltM2T <- mat2vcvT[col(mat2vcvT) <= row(mat2vcvT)]
statObs <- cor(ltM1, ltM2T)
indice <- c(1:length(mat2))
resamplesIndices <- lapply(1:numR, function(i) sample(indice, replace = F))
for (i in 1:numR)
{
ss <- mat2[sample(resamplesIndices[[i]])]
ss <- matrix(ss, nrow = dim(mat2)[[1]], ncol = dim(mat2)[[2]])
mat2ss <- cov(ss)
ltM2ss <- mat2ss[col(mat2ss) <= row(mat2ss)]
statSim[i] <- cor(ltM1, ltM2ss)
}
if (graph == TRUE)
{
plot(1, main = "resampled data density distribution", xlim = c(0, statObs+0.1), ylim = c(0,14))
points(density(statSim), type="l", lwd=2)
abline(v = statObs)
text(10, 10, "observed corelation = ")
}
list( obs = statObs , sumFit = sum(statSim > statObs)/numR)
}
In fact it is hard for me to belive that correlation coefficient between two original matrices is high, and the one between the first original matrix and the second re-sampled one is maximal 0.2 after 10000 bootstrap repetitions.
Any comments on the validity of the code?
Sorry, I am not enough educated to catch up your goals about checking the correlation efficient between two covariance matrices, but I tried to apprehend your code per se.
If I am right, you are making up 10.000 different matrices from the same matrix (mat2) by reordering the cells all round, and recomputing the correlation between the covariance matrix of mat1 and the covariance matrix of the resampled array. Those are stored in the statSim variable.
You said the original correaltion efficient was high (statObs), but the maximum of statSim was low, which is strange. I think the problem is with your result list:
list( obs = statObs , sumFit = sum(statSim > statObs)/numR)
Where you return the original correaltion coefficient (obs), but not the written maximum with sumFit. There you might use eg. max(statSim). I see the point in returning sumFit for checking if the resampling did any improvement to the correlation efficient, but based on your code, I see no problem about the theory.
Updated function with max of simulated correlation coefficients:
resamplerSimAlt <- function(mat1, mat2, numR, graph = FALSE)
{
statSim <- numeric(numR)
mat1vcv <- cov(mat1)
mat2vcvT <- cov(mat2)
ltM1 <- mat1vcv[col(mat1vcv) <= row(mat1vcv)]
ltM2T <- mat2vcvT[col(mat2vcvT) <= row(mat2vcvT)]
statObs <- cor(ltM1, ltM2T)
indice <- c(1:length(mat2))
resamplesIndices <- lapply(1:numR, function(i) sample(indice, replace = F))
for (i in 1:numR)
{
ss <- mat2[sample(resamplesIndices[[i]])]
ss <- matrix(ss, nrow = dim(mat2)[[1]], ncol = dim(mat2)[[2]])
mat2ss <- cov(ss)
ltM2ss <- mat2ss[col(mat2ss) <= row(mat2ss)]
statSim[i] <- cor(ltM1, ltM2ss)
}
if (graph == TRUE)
{
plot(1, main = "resampled data density distribution", xlim = c(0, statObs+0.1), ylim = c(0,14))
points(density(statSim), type="l", lwd=2)
abline(v = statObs)
text(10, 10, "observed corelation = ")
}
list( obs = statObs , sumFit = sum(statSim > statObs)/numR, max=max(statSim))
}
What I had run:
> mat1 <- matrix(runif(25),5,5)
> mat2 <- mat1+0.2
> resamplerSimAlt(mat1, mat2, 10000)
$obs
[1] 1
$sumFit
[1] 0
$max
[1] 0.94463
And with random mat2:
> mat2 <- matrix(runif(25),5,5)
> resamplerSimAlt(mat1, mat2, 10000)
$obs
[1] 0.31144
$sumFit
[1] 0.9124
$max
[1] 0.9231
My answer might not be a real answer. If that would be the case, please give more details about the problem :)

Resources