I need to rescale a series of numbers with certain constraints.
Let's say I have a vector like this:
x <- c(0.5, 0.3, 0.6, 0.4, 0.9, 0.1, 0.2, 0.3, 0.6)
The sum of x must be 6. Right now the sum of x = 3.9.
The numbers cannot be lower than 0
The numbers cannot be higher than 1
I know how to do 1 and 2+3 separately, but not together.
How do I rescale this?
EDIT: As was tried by r2evans, preferably the relative relationships of the numbers is preserved
I don't know that this can be done with a simple expression, but we can optimize our way through it:
opt <- optimize(function(z) abs(6 - sum( z + (1-z) * (x - min(x)) / diff(range(x)) )),
lower=0, upper=1)
opt
# $minimum
# [1] 0.2380955
# $objective
# [1] 1.257898e-06
out <- ( opt$minimum + (1-opt$minimum) * (x - min(x)) / diff(range(x)) )
out
# [1] 0.6190477 0.4285716 0.7142858 0.5238097 1.0000000 0.2380955 0.3333335 0.4285716 0.7142858 1.0000000
sum(out)
# [1] 6.000001
Because that is note perfectly 6, we can do one more step to safeguard it:
out <- out * 6/sum(out)
out
# [1] 0.6190476 0.4285715 0.7142857 0.5238096 0.9999998 0.2380954 0.3333335 0.4285715 0.7142857 0.9999998
sum(out)
# [1] 6
This process preserves the relative relationships of the numbers. If there are more "low" numbers than "high" numbers, scaling so that the sum is 6 will bring the higher numbers above 1. To compensate for that, we shift the lower-end (z in my code), so that all numbers are nudged up a little (but the lower numbers will be nudged up proportionately more).
The results should always be that the numbers are in [opt$minimum,1], and the sum will be 6.
Should be possible with a while loop to increase the values of x (to an upper limit of 1)
x <- c(0.5, 0.3, 0.6, 0.4, 0.9, 0.1, 0.2, 0.3, 0.6)
current_sum = sum(x)
target_sum = 6
while (!current_sum == target_sum) {
print(current_sum)
perc_diff <- (target_sum - current_sum) / target_sum
x <- x * (1 + perc_diff)
x[which(x > 1)] <- 1
current_sum = sum(x)
}
x <- c(0.833333333333333, 0.5, 1, 0.666666666666667, 1, 0.166666666666667,
0.333333333333333, 0.5, 1)
There is likely a more mathematical way
Related
I have two sampled vectors, there one vector maximum value is 0.8 and the minimum value is -0.8. 2nd vector minimum value is 0.2 and maximum value is 0.3. There 1st and 2nd vector, I want to re map all the values in sampled vector between -1 to 1. How do I do that. Im looking for method which applying outlined two vectors. Thank you in advance.
Sampled vector 1.
[ -0.8, 0.7 , -0.23, 0.56, 0.456, -0.344, -0.75, 0.8]
Sampled vector 2.
[ 0.2, 0.23, 0.21, 0.29, 0.26, 0.25, 0.3]
General formula to map xmin..xmax range onto new_min..new_max one:
X' = new_min + (new_max - new_min)*(X - xmin)/(xmax-xmin)
for destination range -1..1:
X' = -1 + 2 * (X - xmin) / (xmax-xmin)
for source range 0.2..0.3:
X' = -1 + 20 * (X - 0.2) = -5 + 20 * X
You are trying to normalize between [-1, 1]
Obtain the ratio:
norm_ratio = 1/.8
and multiply every element in your vector to get the desired result
I'm interested in finding the mean and covariance of a truncated normal random vector. Suppose Y is a vector containing [Y1 Y2 Y3]. Y follows a multivariate normal distribution with the following mean and covariance:
mu <- c(0.5, 0.5, 0.5)
sigma <- matrix(c( 1, 0.6, 0.3,
0.6, 1, 0.2,
0.3, 0.2, 2), 3, 3)
The truncation region is the set of Ys such that AY >= 0. For instance,
A <- matrix(c(1, -2, -0.5, 1.5, -2, 0, 3, -1, -1, 4, 0, -2), byrow = TRUE, nrow = 4)
> A
[,1] [,2] [,3]
[1,] 1.0 -2 -0.5
[2,] 1.5 -2 0.0
[3,] 3.0 -1 -1.0
[4,] 4.0 0 -2.0
For the following draw of Y, it does not satisfy AY >= 0:
set.seed(3)
Y <- rmvnorm(n = 1, mean = mu, sigma = sigma)
> all(A %*% as.matrix(t(Y)) >= 0)
[1] FALSE
But for other draws of Y, they will satisfy AY >= 0, and I want to find the mean and covariance of those Ys that satisfy AY >= 0.
There are existing packages in R that compute the mean and covariance of a truncated normal distribution. For example, mtmvnorm from the tmvtnorm package:
library(tmvtnorm)
mtmvnorm(mu, sigma, lower = ???, upper = ???)
However, the truncation set that I have, i.e, set of Ys that satisfy AY >= 0, cannot be described by just lower and upper bounds. Is there another way to R to compute the mean and covariance of a truncated normal?
You had correct understanding (or maybe noticed) that this is NOT truncated multivariate normal distribution. You have AY>=0 as a linear constraint over Y, rather than simple element-wise lower/upper bounds.
If you are not a math guy, i.e., pursuing explicit solutions of mean and covariance, I guess a straightforward and efficient way is using Monte Carlo simulation.
More specifically, you can presume a sufficient large N to generate big enough set of samples Y and then filter out the samples that satisfy the constraint AY>=0. In turn, you can compute mean and covariance over the selected samples. An attempt is given as below
N <- 1e7
Y <- rmvnorm(n = N, mean = mu, sigma = sigma)
Y_h <- subset(Y, colSums(tcrossprod(A, Y) >= 0) == nrow(A))
mu_h <- colMeans(Y_h)
sigma_h <- cov(Y_h)
and you will see
> mu_h
[1] 0.8614791 -0.1365222 -0.3456582
> sigma_h
[,1] [,2] [,3]
[1,] 0.5669915 0.29392671 0.37487421
[2,] 0.2939267 0.36318397 0.07193513
[3,] 0.3748742 0.07193513 1.37194669
Another way follows the similar idea, but we can presume the set size of selected samples, i.e., N samples Y all should make AY>=0 stand. Then we can use while loop to do this
N <- 1e6
Y_h <- list()
nl <- 0
while (nl < N) {
Y <- rmvnorm(n = N, mean = mu, sigma = sigma)
v <- subset(Y, colSums(tcrossprod(A, Y) >= 0) == nrow(A))
nl <- nl + nrow(v)
Y_h[[length(Y_h) + 1]] <- v
}
Y_h <- head(do.call(rbind, Y_h), N)
mu_h <- colMeans(Y_h)
sigma_h <- cov(Y_h)
and you will see
> mu_h
[1] 0.8604944 -0.1364895 -0.3463887
> sigma_h
[,1] [,2] [,3]
[1,] 0.5683498 0.29492573 0.37524248
[2,] 0.2949257 0.36352022 0.07252898
[3,] 0.3752425 0.07252898 1.37427521
Note: The advantage of the second option is that, it gives you the sufficient large number of selected Y_h as you want.
I don't understand the following behavior with quantile. With type=2 it should average at discontinuities, but this doesn't seem to happen always. If I create a list of 100 numbers and look at the percentiles, then shouldn't I take the average at every percentile? This behavior happens for some, but not for all (i.e. 7th percentile).
quantile(seq(1, 100, 1), 0.05, type=2)
# 5%
# 5.5
quantile(seq(1, 100, 1), 0.06, type=2)
# 6%
# 6.5
quantile(seq(1, 100, 1), 0.07, type=2)
# 7%
# 8
quantile(seq(1, 100, 1), 0.08, type=2)
# 8%
# 8.5
Is this related to floating point issues?
100*0.06 == 6
#TRUE
100*0.07 == 7
#FALSE
sprintf("%.20f", 100*0.07)
#"7.00000000000000088818"
As far as I can tell, it is related to floating points as 0.07 is not exactly representable with floating points.
p <- seq(0, 0.1, by = 0.001)
q <- quantile(seq(1, 100, 1), p, type=2)
plot(p, q, type = "b")
abline(v = 0.07, col = "grey")
If you think of the quantile (type 2) as a function of p, you will never evaluate the function at exactly 0.07, hence your results.Try e.g. decreasing by in the above. In that sense, the function returns exactly as expected. In practice with continuous data, I cannot imagine it would be of any consequence (but that is a poor argument I know).
I am generating random variables with specified range and dimension.I have made a following code for this.
generateRandom <- function(size,scale){
result<- round(runif(size,1,scale),1)
return(result)
}
flag=TRUE
x <- generateRandom(300,6)
y <- generateRandom(300,6)
while(flag){
corrXY <- cor(x,y)
if(corrXY>=0.2){
flag=FALSE
}
else{
x <- generateRandom(300,6)
y <- generateRandom(300,6)
}
}
I want following 6 variables with size 300 and scale of all is between 1 to 6 except for one variable which would have scale 1-7 with following correlation structure among them.
1 0.45 -0.35 0.46 0.25 0.3
1 0.25 0.29 0.5 -0.3
1 -0.3 0.1 0.4
1 0.4 0.6
1 -0.4
1
But when I try to increase threshold value my program gets very slow.Moreover,I want more than 7 variables of size 300 and between each pair of those variables I want some specific correlation threshold.How would I do it efficiently?
This answer is directly inspired from here and there.
We would like to generate 300 samples of a 6-variate uniform distribution with correlation structure equal to
Rhos <- matrix(0, 6, 6)
Rhos[lower.tri(Rhos)] <- c(0.450, -0.35, 0.46, 0.25, 0.3,
0.25, 0.29, 0.5, -0.3, -0.3,
0.1, 0.4, 0.4, 0.6, -0.4)
Rhos <- Rhos + t(Rhos)
diag(Rhos) <- 1
We first generate from this correlation structure the correlation structure of the Gaussian copula:
Copucov <- 2 * sin(Rhos * pi/6)
This matrix is not positive definite, we use instead the nearest positive definite matrix:
library(Matrix)
Copucov <- cov2cor(nearPD(Copucov)$mat)
This correlation structure can be used as one of the inputs of MASS::mvrnorm:
G <- mvrnorm(n=300, mu=rep(0,6), Sigma=Copucov, empirical=TRUE)
We then transform G into a multivariate uniform sample whose values range from 1 to 6, except for the last variable which ranges from 1 to 7:
U <- matrix(NA, 300, 6)
U[, 1:5] <- 5 * pnorm(G[, 1:5]) + 1
U[, 6] <- 6 * pnorm(G[, 6]) + 1
After rounding (and taking the nearest positive matrix to the copula's covariance matrix etc.), the correlation structure is not changed much:
Ur <- round(U, 1)
cor(Ur)
I am estimating the probability of a species dispersing across a gridded landscape, given a dispersal kernel (a function of distance) with a maximum dispersal distance. I'm attempting to calculate area-to-area dispersal probabilities as described in eqn. 8 of this (open access) paper. This involves quadruple integration, evaluating the value of the dispersal function for every possible combination of source and target point in the source and target cells, respectively.
I've implemented this with adaptIntegrate from the cubature package, as follows, for source cell A, target cell B, and a simplified dispersal kernel where dispersal is 1 when the inter-point distance > 1.25 and 0 otherwise. This is shown graphically below, where the red region of cell B is unreachable since no point in cell A is within a distance of 1.25.
library(cubature)
f <- function(xmin, xmax, ymin, ymax) {
adaptIntegrate(function(x) {
r <- sqrt((x[3] - x[1])^2 + (x[4] - x[2])^2)
ifelse(r > 1.25, 0, 1)
},
lowerLimit=c(-0.5, -0.5, xmin, ymin),
upperLimit=c(0.5, 0.5, xmax, ymax),
maxEval=1e5)
}
f(xmin=1.5, xmax=2.5, ymin=-0.5, ymax=0.5)
# $integral
# [1] 0.01949567
#
# $error
# [1] 0.001225998
#
# $functionEvaluations
# [1] 100035
#
# $returnCode
# [1] 0
I get a different integral when considering a target cell, C, that is placed the same distance away, but above rather than to the right of cell A.
f(xmin=-0.5, xmax=0.5, ymin=1.5, ymax=2.5)
# $integral
# [1] 0.01016105
#
# $error
# [1] 0.0241325
#
# $functionEvaluations
# [1] 100035
#
# $returnCode
# [1] 0
Why are these integrals so different (0.01949567 vs 0.01016105)? Have I coded it incorrectly? Changing the tolerance and maximum number of evaluations appears to make no great difference. Alternatively, is there a better approach to coding a solution to this type of problem?
I realise that questions about the general approach are probably better suited to stats.stackexchange.com, but I've posted here since I suspect there may be something that I'm overlooking with the coding itself.
EDIT:
For the A -> B case, nested integrate returns a solution similar to the first adaptIntegrate solution. For the A -> C case, it returns Error in integrate(function(ky) { : the integral is probably divergent.
g <- function(Bx, By, Ax, Ay) {
r <- sqrt((Ax - Bx)^2 + (Ay - By)^2)
ifelse(r > 1.25, 0, 1)
}
integrate(function(Ay) {
sapply(Ay, function(Ay) {
integrate(function(Ax) {
sapply(Ax, function(Ax) {
integrate(function(By) {
sapply(By, function(By) {
integrate(function(Bx) g(Bx, By, Ax, Ay), 1.5, 2.5)$value # Bx
})
}, -0.5, 0.5)$value # By
})
}, -0.5, 0.5)$value # Ax
})
}, -0.5, 0.5)$value # Ay
# [1] 0.019593
The reason for this seems to be the way adaptIntegrate works since, clearly, the only thing that you change is the order of integration. Nonidentical results are likely because of approximate integration alone (see the first response here), but this seems to be more like a bug.
Here are the values of r when computing f(xmin=1.5, xmax=2.5, ymin=-0.5, ymax=0.5)
and f(xmin=-0.5, xmax=0.5, ymin=1.5, ymax=2.5)
so there must be something going on inside the function since the range of values differs dramatically.
One alternative for this is Monte Carlo integration which is good in this case since your points are distributed uniformly.
MCI <- function(Ax, Ay, Bx, By, N, r) {
d <- sapply(list(Ax, Ay, Bx, By), function(l) runif(N, l[1], l[2]))
sum(sqrt((d[, 1] - d[, 3])^2 + (d[, 2] - d[, 4])^2) <= r) / N
}
set.seed(123)
MCI(c(-0.5, 0.5), c(-0.5, 0.5), c(1.5, 2.5), c(-0.5, 0.5), 100000, 1.25)
# [1] 0.0194
MCI(c(-0.5, 0.5), c(-0.5, 0.5), c(-0.5, 0.5), c(1.5, 2.5), 100000, 1.25)
# [1] 0.01929
Generally distance measures are (x1-x2)^2+(y1-y2)^2. Can you explain why you are subtracting the x's from y's when constructing r? Consider the alternate version:
f <- function(xmin, xmax, ymin, ymax) {
adaptIntegrate(function(x) {
r <- sqrt((x[4] - x[3])^2 + (x[2] - x[1])^2)
ifelse(r > 1.25, 0, 1)
},
lowerLimit=c(-0.5, -0.5, xmin, ymin),
upperLimit=c(0.5, 0.5, xmax, ymax),
maxEval=1e5)
}
f(xmin=1.5, xmax=2.5, ymin=-0.5, ymax=0.5)
#-------------
$integral
[1] 0.01016105
$error
[1] 0.0241325
$functionEvaluations
[1] 100035
$returnCode
[1] 0
#---------
f(xmin=-0.5, xmax=0.5, ymin=1.5, ymax=2.5)
#---------
$integral
[1] 0.01016105
$error
[1] 0.0241325
$functionEvaluations
[1] 100035
$returnCode
[1] 0
The maintainer of the R cubature package (Naras) has informed me that the Cubature C library gives the same results as I report in the question above, and that this is unlikely to be a bug; rather, the h-adaptive cubature routine (to which the R package is an interface) is in some cases less accurate than Cubature's p-adaptive routine, which doubles the number of sampling points in appropriate regions.
Naras also provided the following julia code that demonstrates consistent pcubature solutions for the two cases presented in my question (elements of the returned value are the estimated integral followed by the estimated absolute error).
using Cubature
# integrand
f = x -> ifelse(sqrt((x[3] - x[1])^2 + (x[4] - x[2])^2) > 1.25, 0, 1)
# A to B case
pcubature(f, [-0.5, -0.5, 1.5, -0.5], [0.5, 0.5, 2.5, 0.5], abstol=1e-5)
# (0.019593408732917292,3.5592555263398717e-6)
# A to C case
pcubature(f, [-0.5, -0.5, -0.5, 1.5], [0.5, 0.5, 0.5, 2.5], abstol=1e-5)
# (0.019593408732918302,3.559255527241928e-6)