Related
For a simulation study, I want to generate a set of random variables (both continuous and binary) that have predefined associations to an already existing binary variable, denoted here as x.
For this post, assume that x is generated following the code below. But remember: in real life, x is an already existing variable.
set.seed(1245)
x <- rbinom(1000, 1, 0.6)
I want to generate both a binary variable and a continuous variable. I have figured out how to generate a continuous variable (see code below)
set.seed(1245)
cor <- 0.8 #Correlation
y <- rnorm(1000, cor*x, sqrt(1-cor^2))
But I can't find a way to generate a binary variable that is correlated to the already existing variable x. I found several R packages, such as copula which can generate random variables with a given dependency structure. However, they do not provide a possibility to generate variables with a set dependency on an already existing variable.
Does anyone know how to do this in an efficient way?
Thanks!
If we look at the formula for correlation:
For the new vector y, if we preserve the mean, the problem is easier to solve. That means we copy the vector x and try to flip a equal number of 1s and 0s to achieve the intended correlation value.
If we let E(X) = E(Y) = x_bar , and E(XY) = xy_bar, then for a given rho, we simplify the above to:
(xy_bar - x_bar^2) / (x_bar - x_bar^2) = rho
Solve and we get:
xy_bar = rho * x_bar + (1-rho)*x_bar^2
And we can derive a function to flip a number of 1s and 0s to get the result:
create_vector = function(x,rho){
n = length(x)
x_bar = mean(x)
xy_bar = rho * x_bar + (1-rho)*x_bar^2
toflip = sum(x == 1) - round(n * xy_bar)
y = x
y[sample(which(x==0),toflip)] = 1
y[sample(which(x==1),toflip)] = 0
return(y)
}
For your example it works:
set.seed(1245)
x <- rbinom(1000, 1, 0.6)
cor(x,create_vector(x,0.8))
[1] 0.7986037
There are some extreme combinations of intended rho and p where you might run into problems, for example:
set.seed(111)
res = lapply(1:1000,function(i){
this_rho = runif(1)
this_p = runif(1)
x = rbinom(1000,1,this_p)
data.frame(
intended_rho = this_rho,
p = this_p,
resulting_cor = cor(x,create_vector(x,this_rho))
)
})
res = do.call(rbind,res)
ggplot(res,aes(x=intended_rho,y=resulting_cor,col=p)) + geom_point()
Here's a binomial one - the formula for q only depends on the mean of x and the correlation you desire.
set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- 1/((1-p)/cor^2+p)
y <- rbinom(100000, 1, q)
z <- x*y
cor(x,z)
#> [1] 0.7984781
This is not the only way to do this - note that mean(z) is always less than mean(x) in this construction.
The continuous variable is even less well defined - do you really not care about its mean/variance, or anything else about its distibution?
Here's another simple version where it flips the variable both ways:
set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- (1+cor/sqrt(1-(2*p-1)^2*(1-cor^2)))/2
y <- rbinom(100000, 1, q)
z <- x*y+(1-x)*(1-y)
cor(x,z)
#> [1] 0.8001219
mean(z)
#> [1] 0.57908
Find the MLE of the non-linear distribution (in R, using a Gauss-Newton method):
y = sin(x*theta) + epsilon
where epsilon ~ N(0 , 0.01^2)
To do this, I've been asked to generate some data that is uniformly (and randomly) distributed from 0 <= x <= 10 , with n = 200 and theta = 2 (just for generation).
For instance, values that are close to the maximum of the sin function (1, 4 etc.) will converge but others won't.
EDITED
I now understand what theta.iter means but I cannot seem to understand why it converges only sometimes and even then, which values to input to get a useful output of. Can someone explain?
theta <- 2
x <- runif(200, 0, 10)
x <- sort(x) #this is just to sort the generated data so that plotting it
#actually looks like a sine funciton
y <- sin(x*theta) + rnorm(200, mean = 0, sd = 0.1^2)
GN_sin <- function(theta.iter, x , y, epsilon){
index <- TRUE
while (index){
y.iter <- matrix(y - sin(x*theta.iter), 200, 1)
x.iter <- matrix(theta.iter*cos(x*theta.iter), 200, 1)
theta.new <- theta.iter +
solve(t(x.iter)%*%x.iter)%*%t(x.iter)%*%y.iter
if (abs(theta.new-theta.iter) < epsilon) {index <- FALSE}
theta.iter <- as.vector(theta.new)
cat(theta.iter, '\n')
}
}
Let a data set with 3 variables (here Normal and independent but they can be correlated)
data = data.frame(x1 = rnorm(10000),
x2 = rnorm(10000),
x3 = rnorm(10000))
I would like to get the narrowest range for x1, x2 and x3 such that 95% of the observation fall within all three ranges.
So far I have the code below.
is.between <- function(x, a, b){
x <= max(c(a, b)) & x >= min(c(a, b))
}
getlims <- function(lims, x1, x2, x3){
abs(mean(
is.between(x1, lims[1], lims[2]) &
is.between(x2, lims[3], lims[4]) &
is.between(x3, lims[5], lims[6])
) - 0.95)
}
optim(initial_values, getlims, x1=x1,x2=x2,x3=x3)
Where lims[1,2] is the range for x1, lims[3,4] is the range for x2 and lims[5,6] is the range for x3.
It provides limits that contain 95% of my observation, but doesn't guarantee that it will be the smaller volume lims[1,2]*lims[3,4]*lims[5,6].
I think this is actually a problem in discrete optimization. It was given in three dimensions, I have reformulated it in two dimensions to have a better visualization, it can be immediately extended to more dimensions.
Let us try to solve it as a nonlinear optimization problem with constraints.
set.seed(1009)
N <- 1000
x <- rnorm(N); y <- rnorm(N)
The 0.05 and 0.95 quantiles of these coordinates will also be needed.
q1 <- quantile(x, 0.05); q2 <- quantile(x, 0.95)
q3 <- quantile(y, 0.05); q4 <- quantile(y, 0.95)
We define two functions, fmin the function that is to be minimized, and fbnd the function that defines the constraints. That is, we require fbnd(x) >= 0, in this way expressing that at least 95% of the points lie inside the rectangle.
fmin <- function(p) (p[2]-p[1]) * (p[4]-p[3])
fbnd <- function(p) {
c(0.05 - sum(x < p[1] | x > p[2] | y < p[3] | y > p[4]) / N,
q1 - p[1], p[2] - q2,
q3 - p[3], p[4] - q4 )
}
As starting point we can take the ranges of x- and y-coordinates.
start <- c(range(x), range(y))
The optimization solver has to minimize a function with nonlinear constraints. The auglag routine in package nloptr is a candidate solver.
S <- nloptr::auglag(start, fn=fmin, hin=fbnd)
S$par; S$value
# [1] -2.301263 2.308038 -2.079166 2.130744
# [1] 19.40474
We can improve the solution by moving the rectangle boundaries to the next upper or lower x- resp. y-coordinate. This is a necessary step as the objective function is locally constant.
r <- S$par
r[1] <- min(x[x >= r[1]]); r[2] <- max(x[x <= r[2]])
r[3] <- min(y[y >= r[3]]); r[4] <- max(y[y <= r[4]])
r
# [1] -2.299467 2.281395 -2.079166 2.127260
We can see, 50 points lie outside the rectangle and the area is 19.26905.
(r[2]-r[1]) * (r[4]-r[3]) # 19.26905
sum(x < r[1] | x > r[2] | y < r[3] | y > r[4]) # 50
The solution could still be a local minimum. Fortunately, the objective function is also locally monotone, so this will normally not happen. Of course, one could verify the solution by applying a global solver.
I would like to compute the convolution of two probability distributions in R and I need some help. For the sake of simplicity, let's say I have a variable x that is normally distributed with mean = 1.0 and stdev = 0.5, and y that is log-normally distributed with mean = 1.5 and stdev = 0.75. I want to determine z = x + y. I understand that the distribution of z is not known a priori.
As an aside the real world example I am working with requires addition to two random variables that are distributed according to a number of different distributions.
Does anyone know how to add two random variables by convoluting the probability density functions of x and y?
I have tried generating n normally distributed random values (with above parameters) and adding them to n log-normally distributed random values. However, I wish to know if I can use the convolution method instead. Any help would be greatly appreciated.
EDIT
Thank you for these answers. I define a pdf, and try to do the convolution integral, but R complains on the integration step. My pdfs are Log Pearson 3 and are as follows
dlp3 <- function(x, a, b, g) {
p1 <- 1/(x*abs(b) * gamma(a))
p2 <- ((log(x)-g)/b)^(a-1)
p3 <- exp(-1* (log(x)-g) / b)
d <- p1 * p2 * p3
return(d)
}
f.m <- function(x) dlp3(x,3.2594,-0.18218,0.53441)
f.s <- function(x) dlp3(x,9.5645,-0.07676,1.184)
f.t <- function(z) integrate(function(x,z) f.s(z-x)*f.m(x),-Inf,Inf,z)$value
f.t <- Vectorize(f.t)
integrate(f.t, lower = 0, upper = 3.6)
R complains at the last step since the f.t function is bounded and my integration limits are probably not correct. Any ideas on how to solve this?
Here is one way.
f.X <- function(x) dnorm(x,1,0.5) # normal (mu=1.5, sigma=0.5)
f.Y <- function(y) dlnorm(y,1.5, 0.75) # log-normal (mu=1.5, sigma=0.75)
# convolution integral
f.Z <- function(z) integrate(function(x,z) f.Y(z-x)*f.X(x),-Inf,Inf,z)$value
f.Z <- Vectorize(f.Z) # need to vectorize the resulting fn.
set.seed(1) # for reproducible example
X <- rnorm(1000,1,0.5)
Y <- rlnorm(1000,1.5,0.75)
Z <- X + Y
# compare the methods
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
Same thing using package distr.
library(distr)
N <- Norm(mean=1, sd=0.5) # N is signature for normal dist
L <- Lnorm(meanlog=1.5,sdlog=0.75) # same for log-normal
conv <- convpow(L+N,1) # object of class AbscontDistribution
f.Z <- d(conv) # distribution function
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
I was having trouble getting integrate() to work for different density parameters, so I came up with an alternative to #jlhoward's using Riemann approximation:
set.seed(1)
#densities to be convolved. could also put these in the function below
d1 <- function(x) dnorm(x,1,0.5) #
d2 <- function(y) dlnorm(y,1.5, 0.75)
#Riemann approximation of convolution
conv <- function(t, a, b, d) { #a to b needs to cover the range of densities above. d needs to be small for accurate approx.
z <- NA
x <- seq(a, b, d)
for (i in 1:length(t)){
print(i)
z[i] <- sum(d1(x)*d2(t[i]-x)*d)
}
return(z)
}
#check against sampled convolution
X <- rnorm(1000, 1, 0.5)
Y <- rlnorm(1000, 1.5, 0.75)
Z <- X + Y
t <- seq(0, 50, 0.05) #range to evaluate t, smaller increment -> smoother curve
hist(Z, breaks = 50, freq = F, xlim = c(0,30))
lines(t, conv(t, -100, 100, 0.1), type = "s", col = "red")
I've been mystified by the R quantile function all day.
I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.
From the docs:
Q[i](p) = (1 - gamma) x[j] + gamma
x[j+1],
I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma
where 1 <= i <= 9, (j-m)/n <= p <
(j-m+1)/ n, x[j] is the jth order
statistic, n is the sample size, and m
is a constant determined by the sample
quantile type. Here gamma depends on
the fractional part of g = np+m-j.
So, how calculate j? m?
For the continuous sample quantile
types (4 through 9), the sample
quantiles can be obtained by linear
interpolation between the kth order
statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta
+ 1),
where α and β are constants determined
by the type. Further, m = alpha + p(1
- alpha - beta), and gamma = g.
Now I'm really lost. p, which was a constant before, is now apparently a function.
So for Type 7 quantiles, the default...
Type 7
p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.
Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.
I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.
quantile.R source code
or type: quantile.default
You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)
Also, that last sentence is just wrong. It should read
Here gamma depends on the fractional part of np+m, g = np+m-j
As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.
As for the last bit, R is just stating what S uses.
The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.
BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)
There are various ways of computing quantiles when you give it a vector, and don't have a known CDF.
Consider the question of what to do when your observations don't fall on quantiles exactly.
The "types" are just determining how to do that. So, the methods say, "use a linear interpolation between the k-th order statistic and p(k)".
So, what's p(k)? One guy says, "well, I like to use k/n". Another guy says, "I like to use (k-1)/(n-1)" etc. Each of these methods have different properties that are better suited for one problem or another.
The \alpha's and \beta's are just ways to parameterize the functions p. In one case, they're 1 and 1. In another case, they're 3/8 and -1/4. I don't think the p's are ever a constant in the documentation. They just don't always show the dependency explicitly.
See what happens with the different types when you put in vectors like 1:5 and 1:6.
(also note that even if your observations fall exactly on the quantiles, certain types will still use linear interpolation).
I believe the R help documentation is clear after the revisions noted in #RobHyndman's comment, but I found it a bit overwhelming. I am posting this answer in case it helps someone move quickly through the options and their assumptions.
To get a grip on quantile(x, probs=probs), I wanted to check out the source code. This too was trickier than I anticipated in R so I actually just grabbed it from a github repo that looked recent enough to run with. I was interested in the default (type 7) behavior, so I annotated that some, but didn't do the same for each option.
You can see how the "type 7" method interpolates, step by step, both in the code and also I added a few lines to print some important values as it goes.
quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
, type = 7, ...){
if(is.factor(x)) { #worry about non-numeric data
if(!is.ordered(x) || ! type %in% c(1L, 3L))
stop("factors are not allowed")
lx <- levels(x)
} else lx <- NULL
if (na.rm){
x <- x[!is.na(x)]
} else if (anyNA(x)){
stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
}
eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
stop("'probs' outside [0,1]")
#####################################
# here is where terms really used in default type==7 situation get defined
n <- length(x) #how many observations are in sample?
if(na.p <- any(!p.ok)) { # set aside NA & NaN
o.pr <- probs
probs <- probs[p.ok]
probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
}
np <- length(probs) #how many quantiles are you computing?
if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
if(type == 7) { # be completely back-compatible
index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
lo <- floor(index) #this is the observed order statistic just below each quantile
hi <- ceiling(index) #above
x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort,
#and it only guarantees that sorting is "right" at these order statistics, important for large vectors
#ties are not broken and tied elements just stay in their original order
qs <- x[lo] #the values associated with the "floor" order statistics
i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value
#this is the difference between the order statistic and the available ranks, i think
h <- (index - lo)[i] # > 0 by construction
## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high.
# h is the arithmetic difference between the desired order statistic amd the available ranks
#interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd.
# This means having a more extreme 99th observation doesn't matter when computing the .75 quantile
###################################
# print all of these things
cat("floor pos=", c(lo))
cat("\nceiling pos=", c(hi))
cat("\nfloor values= ", c(x[lo]))
cat( "\nwhich floors not targets? ", c(i))
cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
cat( "\nadjustment values= ", c(h))
cat("\nquantile estimates:")
}else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
} else {n * probs} # m = 0
j <- floor(nppm)
h <- switch(type,
(nppm > j), # type 1
((nppm > j) + 1)/2, # type 2
(nppm != j) | ((j %% 2L) == 1L)) # type 3
} else{
## Types 4 through 9 are continuous sample qs.
switch(type - 3,
{a <- 0; b <- 1}, # type 4
a <- b <- 0.5, # type 5
a <- b <- 0, # type 6
a <- b <- 1, # type 7 (unused here)
a <- b <- 1 / 3, # type 8
a <- b <- 3 / 8) # type 9
## need to watch for rounding errors here
fuzz <- 4 * .Machine$double.eps
nppm <- a + probs * (n + 1 - a - b) # n*probs + m
j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
h <- nppm - j
if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
x <- sort(x, partial =
unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
)
x <- c(x[1L], x[1L], x, x[n], x[n])
## h can be zero or one (types 1 to 3), and infinities matter
#### qs <- (1 - h) * x[j + 2] + h * x[j + 3]
## also h*x might be invalid ... e.g. Dates and ordered factors
qs <- x[j+2L]
qs[h == 1] <- x[j+3L][h == 1]
other <- (0 < h) & (h < 1)
if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]
}
} else {
qs <- rep(NA_real_, np)}
if(is.character(lx)){
qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
if(names && np > 0L) {
names(qs) <- format_perc(probs)
}
if(na.p) { # do this more elegantly (?!)
o.pr[p.ok] <- qs
names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
names(o.pr)[p.ok] <- names(qs)
o.pr
} else qs
}
####################
# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)
#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)
# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.
#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))