I have a question:
If X is a random variable in a density function which is uniform between -2 and 3.
I want to find these two questions:
What is the upper quartile of X?
What is the 44% quantile of X?
Now the things I have tried are below:
z <- 1 - punif(0.75, min = -2, max = 3, lower.tail = TRUE)
answer: 0.45
y <- qunif(0.44, min = -2, max = 3, lower.tail = TRUE)
answer: 0.2
First is this even the right way to go about it.
Second, I understand that Punif finds the accumulate probability of X. What does qunif find, and what does the result tell me about X and the distribution?
If is you have a random variable x with uniform distribution from a to b
X ~ U(a,b)
Then punif(x, a, b) is the probability that U <= x
And qunif(x, a, b) finds the value y such that Pr(U <= y)=x
You can visualize these plots with
curve(punif(x, -2, 3), from=-2, to=3, main="punif")
curve(qunif(x, -2, 3), from=0, to=1, main="qunif")
Note how punif expects a value anywhere between a and b but qunif expects a probability so it must be between 0 and 1.
Related
I am trying to change the value range of a variable (array, set of values) while keeping its properties. I don't know the exact name in math, but I mean such a kind of transformation that the variable array has exactly the same properties, the spacing between the values is the same, but the range is different. Maybe the code below will explain what I mean.
I just want to "linearly transpose" (or smth?) values to some other range and the distribution should remain same. In other words - I'll just change the scope of the variable using the regression equation y = a * x + b. I assume that the transformation will be completely linear, the correlation between the variables is exactly 1, and I calculate new variable (array) from a regression equation, actually a system of equations where I simply substitute the maximum ranges of both variables:
minimum.y1 = minimum.x1 * a + b
maximum.y2 = maximum.x2 * a + b
from which I can work out the following code to obtain a and b coefficients:
# this is my input variable
x <- c(-1, -0.5, 0, 0.5, 1)
# this is the range i want to obtain
y.pred <- c(1,2,3,4,5)
max_y = 5
min_y = 1
min_x = min(x)
max_x = max(x)
c1 = max_x-min_x
c2 = max_y-min_y
a.coeff = c2/c1
b.coeff = a.coeff-min_x
y = x * a.coeff + b.coeff
y
# hey, it works! :)
[1] 1 2 3 4 5
the correlation between the variable before and after the transformation is exactly 1. So we have a basis for further action. Let's get it as a function:
linscale.to.int <- function(max.lengt, vector) {
max_y = max.lengt
min_y = 1
min_x = min(vector)
max_x = max(vector)
c1 = max_x-min_x
c2 = max_y-min_y
a.coeff = c2/c1
b.coeff = a.coeff-min_x
return(vector * a.coeff + b.coeff)
}
x <- c(-1, -0.5, 0, 0.5, 1)
linscale.to.int(5,x)
[1] 1 2 3 4 5
and it works again. But here's the thing: when i aplly this function to random distribution, like this:
x.rand <- rnorm(50)
y.rand <- linscale.to.int(5,x.rand)
plot(x.rand, y.rand)
or better seable this:
x.rand <- rnorm(500)
y.rand <- linscale.to.int(20,x.rand)
plot(x.rand, y.rand)
I get the values of the second variable completely out of range; it should be between 1 and 20 but i get scope of valuest about -1 to 15:
And now the question arises - what am I doing wrong here? Where do I go wrong with such a transformation?
What you are trying to do is very straightforward using rescale from the scales package (which you will already have installed if you have ggplot2 / tidyverse installed). Simply give it the new minimum / maximum values:
x <- c(-1, -0.5, 0, 0.5, 1)
scales::rescale(x, c(1, 5))
#> [1] 1 2 3 4 5
If you want to have your own function written in base R, the following one-liner should do what you want:
linscale_to_int <- function(y, x) (x - min(x)) * (y - 1) / diff(range(x)) + 1
(Note that it is good practice in R to avoid periods in function names because this can cause confusion with S3 method dispatch)
Testing, we have:
x <- c(-1, -0.5, 0, 0.5, 1)
linscale_to_int(5, x)
#> [1] 1 2 3 4 5
x.rand <- rnorm(50)
y.rand <- linscale_to_int(5, x.rand)
plot(x.rand, y.rand)
y.rand <- linscale_to_int(20, x.rand)
plot(x.rand, y.rand)
Created on 2022-08-31 with reprex v2.0.2
I'm interested in finding the mean and covariance of a truncated normal random vector. Suppose Y is a vector containing [Y1 Y2 Y3]. Y follows a multivariate normal distribution with the following mean and covariance:
mu <- c(0.5, 0.5, 0.5)
sigma <- matrix(c( 1, 0.6, 0.3,
0.6, 1, 0.2,
0.3, 0.2, 2), 3, 3)
The truncation region is the set of Ys such that AY >= 0. For instance,
A <- matrix(c(1, -2, -0.5, 1.5, -2, 0, 3, -1, -1, 4, 0, -2), byrow = TRUE, nrow = 4)
> A
[,1] [,2] [,3]
[1,] 1.0 -2 -0.5
[2,] 1.5 -2 0.0
[3,] 3.0 -1 -1.0
[4,] 4.0 0 -2.0
For the following draw of Y, it does not satisfy AY >= 0:
set.seed(3)
Y <- rmvnorm(n = 1, mean = mu, sigma = sigma)
> all(A %*% as.matrix(t(Y)) >= 0)
[1] FALSE
But for other draws of Y, they will satisfy AY >= 0, and I want to find the mean and covariance of those Ys that satisfy AY >= 0.
There are existing packages in R that compute the mean and covariance of a truncated normal distribution. For example, mtmvnorm from the tmvtnorm package:
library(tmvtnorm)
mtmvnorm(mu, sigma, lower = ???, upper = ???)
However, the truncation set that I have, i.e, set of Ys that satisfy AY >= 0, cannot be described by just lower and upper bounds. Is there another way to R to compute the mean and covariance of a truncated normal?
You had correct understanding (or maybe noticed) that this is NOT truncated multivariate normal distribution. You have AY>=0 as a linear constraint over Y, rather than simple element-wise lower/upper bounds.
If you are not a math guy, i.e., pursuing explicit solutions of mean and covariance, I guess a straightforward and efficient way is using Monte Carlo simulation.
More specifically, you can presume a sufficient large N to generate big enough set of samples Y and then filter out the samples that satisfy the constraint AY>=0. In turn, you can compute mean and covariance over the selected samples. An attempt is given as below
N <- 1e7
Y <- rmvnorm(n = N, mean = mu, sigma = sigma)
Y_h <- subset(Y, colSums(tcrossprod(A, Y) >= 0) == nrow(A))
mu_h <- colMeans(Y_h)
sigma_h <- cov(Y_h)
and you will see
> mu_h
[1] 0.8614791 -0.1365222 -0.3456582
> sigma_h
[,1] [,2] [,3]
[1,] 0.5669915 0.29392671 0.37487421
[2,] 0.2939267 0.36318397 0.07193513
[3,] 0.3748742 0.07193513 1.37194669
Another way follows the similar idea, but we can presume the set size of selected samples, i.e., N samples Y all should make AY>=0 stand. Then we can use while loop to do this
N <- 1e6
Y_h <- list()
nl <- 0
while (nl < N) {
Y <- rmvnorm(n = N, mean = mu, sigma = sigma)
v <- subset(Y, colSums(tcrossprod(A, Y) >= 0) == nrow(A))
nl <- nl + nrow(v)
Y_h[[length(Y_h) + 1]] <- v
}
Y_h <- head(do.call(rbind, Y_h), N)
mu_h <- colMeans(Y_h)
sigma_h <- cov(Y_h)
and you will see
> mu_h
[1] 0.8604944 -0.1364895 -0.3463887
> sigma_h
[,1] [,2] [,3]
[1,] 0.5683498 0.29492573 0.37524248
[2,] 0.2949257 0.36352022 0.07252898
[3,] 0.3752425 0.07252898 1.37427521
Note: The advantage of the second option is that, it gives you the sufficient large number of selected Y_h as you want.
I would like to find the t-value for 90% confidence interval with 17 observation.
In Excel, I can do this calculation with t=T.INV.2T(.10, 16)=1.75 however in R I cannot find the correct way to get the same result.
qt(p = 1-.9, df = 17-1) = -1.34
qt(p = (1-.9)/2, df = 17-1) = -1.75 # trying with two-tailed?
What is the function R doing the same computation as T.INV.2T in Excel.
Similarly, we have also T.DIST.2T in Excel, what is the same function in R?
You need the 1 - .1 / 2 = 0.95 quantile from the t-distribution with 17 - 1 = 16 degrees of freedom:
qt(0.95, 16)
# [1] 1.745884
Explanation
Excel describes T.INV.2T as
Returns the two-tailed inverse of the Student's t-distribution
which is the quantile in math talk (though I would never use the term 2 tailed quantile). The p% quantile q is defined as the point which satisfies P(X <= q) >= p%.
In R we get that with the function qt (q for quantile, t for t-distribution). Now we just have to sort out what is meant by a two-tailed inverse. It turns out we are looking for the point q which satisfies P(X <= -|q| | X >= |q|) >= .1. Since the t-distribution is symmetrical this simplifies to P(X >= |q|) >= .1 / 2.
You can easily verify that in R with the use of the probability function pt:
pt(qt(0.05, 16), 16, lower.tail = TRUE) +
pt(qt(0.95, 16), 16, lower.tail = FALSE)
# [1] 0.1
As you correctly guessed, you do it by estimating the two-sided interval (alpha/2 = 0.1/2 = 0.05)
> qt(p = 0.95, df = 16)
[1] 1.745884
So 5 % off the upper and lower interval. I don't know Excel, but I am guessing that's what that function is doing.
As for dist, that is I assume the two-sided CDF
pt(-1.745884, df=16, lower.tail=T) +
pt(1.745884, df=16, lower.tail=F)
which is equal to 0.09999994.
I looked at other questions like this, this and this, but all these calculate the shortest distance to a line segment defined by two endpoints, whereas I've not been able to do the same but for a line defined by an intercept and a slope.
This is my data, which I plot and add a line that will always have an intercept of 0 and slope defined by the means of the two variables.
df <- data.frame(x = seq(1, 10, 1),
y = seq(1, 10, 2),
id = head(letters, 10))
plot(df$x, df$y,
abline(a = 0, b = (mean(df$x) / mean(df$y))))
I am trying to calculate the shortest distance from each point to the line.
Test this (Modified from here)
#Perpendicular distance from point 'a' to a line with 'slope' and 'intercept'
dist_point_line <- function(a, slope, intercept) {
b = c(1, intercept+slope)
c = c(-intercept/slope,0)
v1 <- b - c
v2 <- a - b
m <- cbind(v1,v2)
return(abs(det(m))/sqrt(sum(v1*v1)))
}
dist_point_line(c(2,1), 1, 0)
#[1] 0.7071068
In your case you could do something like this
apply(df, 1, function(x) dist_point_line(as.numeric(x[1:2]), slope = 1, intercept = 0) )
#[1] 0.0000000 0.7071068 1.4142136 2.1213203 2.8284271 3.5355339 2.8284271 2.1213203 1.4142136 0.7071068
It's my understanding that when calculating quantiles in R, the entire dataset is scanned and the value for each quantile is determined.
If you ask for .8, for example it will give you a value that would occur at that quantile. Even if no such value exists, R will nonetheless give you the value that would have occurred at that quantile. It does this through linear interpolation.
However, what if one wishes to calculate quantiles and then proceed to round up/down to the nearest actual value?
For example, if the quantile at .80 gives a value of 53, when the real dataset only has a 50 and a 54, then how could one get R to list either of these values?
Try this:
#dummy data
x <- c(1,1,1,1,10,20,30,30,40,50,55,70,80)
#get quantile at 0.8
q <- quantile(x, 0.8)
q
# 80%
# 53
#closest match - "round up"
min(x[ x >= q ])
#[1] 55
#closest match - "round down"
max(x[ x <= q ])
#[1] 50
There are many estimation methods implemented in R's quantile function. You can choose which type to use with the type argument as documented in https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html.
x <- c(1, 1, 1, 1, 10, 20, 30, 30, 40, 50, 55, 70, 80)
quantile(x, c(.8)) # default, type = 7
# 80%
# 53
quantile(x, c(.8), FALSE, TRUE, 7) # equivalent to the previous invocation
# 80%
# 53
quantile(x, c(.8), FALSE, TRUE, 3) # type = 3, nearest sample
# 80%
# 50