I have correlated one set number with .9, .5, .0
A derives from rnorm(30,-0.5,1)
B derives from rnorm(30,.5,2)
and want to make A & B correlated with .9, .5, .0.
You are describing a multivariate normal distribution, which can be computed with the mvrnorm function:
library(MASS)
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
set.seed(144)
vals <- mvrnorm(10000, c(meanA, meanB), matrix(c(sdA^2, correlation*sdA*sdB,
correlation*sdA*sdB, sdB^2), nrow=2))
mean(vals[,1])
# [1] -0.4883265
mean(vals[,2])
# [1] 0.5201586
sd(vals[,1])
# [1] 0.9994628
sd(vals[,2])
# [1] 1.992816
cor(vals[,1], vals[,2])
# [1] 0.8999285
As an alternative, please consider the following. Let the random variables X ~ N(0,1) and Y ~ N(0,1) independently. Then the random variables X and rho X + sqrt(1 - rho^2) Y are both distributed N(0,1), but are now correlated with correlation rho. So possible R code could be
# Define the parameters
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
n <- 10000 # You want 30
# Generate from independent standard normals
x <- rnorm(n, 0, 1)
y <- rnorm(n, 0, 1)
# Transform
x2 <- x # could be avoided
y2 <- correlation*x + sqrt(1 - correlation^2)*y
# Fix up means and standard deviations
x3 <- meanA + sdA*x2
y3 <- meanB + sdB*y2
# Check summary statistics
mean(x3)
# [1] -0.4981958
mean(y3)
# [1] 0.4999068
sd(x3)
# [1] 1.014299
sd(y3)
# [1] 2.022377
cor(x3, y3)
# [1] 0.9002529
I created the correlate package to be able to create a correlation between any type of variable (regardless of distribution) given a certain amount of toleration. It does so by permutations.
install.packages('correlate')
library('correlate')
A <- rnorm(30, -0.5, 1)
B <- rnorm(30, .5, 2)
C <- correlate(cbind(A,B), 0.9)
# 0.9012749
D <- correlate(cbind(A,B), 0.5)
# 0.5018054
E <- correlate(cbind(A,B), 0.0)
# -0.00407327
You can pretty much decide the whole matrix if you want (for multiple variables), by giving a matrix as second argument.
Ironically, you can also use it to create a multivariate normal.....
Related
Let's assume we want to generate n samples from a multinomial distribution from given probabilities p. This works well with sample or rmultinorm. The totals can then be counted with table. Now I wonder if there is a direct way (or another distribution) available to get the result of table directly without generating complete sample vectors.
Here an example:
set.seed(123)
n <- 10000 # sample size
p <- c(0.1, 0.2, 0.7) # probabilities, sum up to 1.0
## 1) approach with sample
x <- sample(1:3, size = n, prob = p, replace = TRUE)
table(x)
# x
# 1 2 3
# 945 2007 7048
## 2) approach with rmultinorm
x <- rmultinom(n, size = 1, prob = p) * 1:3
table(x[x != 0])
# 1 2 3
# 987 1967 7046
Create a simulated dataset of 100 observations, where x is a random normal variable with mean 0 and standard deviation 1, and y = 0.1 + 2 * X + e, where epsilon is also a random normal error with mean 0 and sd 1.
set.seed(1)
# simulate a data set of 100 observations
x <- rnorm(100)
y.1 <- 0.1 + 2*x + rnorm(100)
Now extract the first 5 observations.
y1.FirstFive <- (y.1[1:5]) # extract first 5 observations from y
x.FirstFive <- (x[1:5]) # extract first 5 observations from x
y1.FirstFive # extracted 5 observations from y1
[1] -1.7732743 0.5094025 -2.4821789 3.4485904 0.1044309
x.FirstFive # extracted 5 observations from x
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
Assuming the mean and sd of the sample that you calculated from the first five observations would not change, what is the minimum total number of additional observations you would need to be able to conclude that the true mean of the population is different from 0 at the p = 0.01 confidence level?
alpha <- 0.01
mu <- 0
for (i in 5:2000) {
# Recalculate the standard error and CI
stand_err <- Sd_y1 / sqrt(i)
ci <- sample_mean_y1 + c(qt(alpha/2, i-1), qt(1-alpha/2, i-1))*stand_err
if (ci[2] < mu)
break # condition met, exit loop
}
i
[1] 2000
Here, I wrote a loop that iteratively increases n from the initial n=5 to n=2000, uses pt to find the p value (given a fixed y-bar and sd), and stops when p < 0.01. However I keep getting the wrong output. Such that, the output is always the number of the maximum range that I give (here, it is 2000) instead of giving me the specific minimum n sample in order to reject the null that mu_y = 0 at the p=0.01 level. Any suggestions as to how to fix the code?
additional info: the sd of y1.FirstFive = 2.3 and mean of y1.FirstFive = -0.04
Assuming:
Sd_y1 = sd(y1.FirstFive)
sample_mean_y1 = mean(y1.FirstFive)
sample_mean_y1
[1] -0.03860587
As pointed out by #jblood94, you need to go for larger sample size.
You don't need a for loop for this, most of your functions are vectorized, so something like this:
n = 5:30000
stand_err = Sd_y1 / sqrt(n)
ub = sample_mean_y1 + qt(1-alpha/2, n-1)*stand_err
n[min(which(ub<0))]
[1] 23889
It's because n > 2000.
set.seed(1)
x <- rnorm(100)
y.1 <- 0.1 + 2*x + rnorm(100)
Sd_y1 <- sd(y.1[1:5])
sample_mean_y1 <- mean(y.1[1:5])
alpha <- 0.01
sgn <- 2*(sample_mean_y1 > 0) - 1
f <- function(n) qt(alpha/2, n - 1)*Sd_y1 + sgn*sample_mean_y1*sqrt(n)
upper <- 2
while (f(upper) < 0) upper <- upper*2
(n <- ceiling(uniroot(f, lower = upper/2, upper = upper, tol = 0.5)$root))
#> [1] 23889
When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.
The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.
I'm looking to simulate an age variable (constrained range 18-35) that is correlated 0.1 with an existing binary variable called use. Most of the examples I've come across demonstrate how to simulate both variables simultaneously.
# setup
set.seed(493)
n <- 134
dat <- data.frame(partID=seq(1, n, 1),
trt=c(rep(0, n/2),
rep(1, n/2)))
# set proportion
a <- .8
b <- .2
dat$use <- c(rbinom(n/2, 1, b),
rbinom(n/2, 1, a))
Not sure if this is the best way to approach this, but you might get close using the answer from here: https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable
For example (using the code from the link):
x1 <- dat$use # fixed given data
rho <- 0.1 # desired correlation = cos(angle)
theta <- acos(rho) # corresponding angle
x2 <- rnorm(n, 2, 0.5) # new random data
X <- cbind(x1, x2) # matrix
Xctr <- scale(X, center=TRUE, scale=FALSE) # centered columns (mean 0)
Id <- diag(n) # identity matrix
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE])) # QR-decomposition, just matrix Q
P <- tcrossprod(Q) # = Q Q' # projection onto space defined by x1
x2o <- (Id-P) %*% Xctr[ , 2] # x2ctr made orthogonal to x1ctr
Xc2 <- cbind(Xctr[ , 1], x2o) # bind to matrix
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2))) # scale columns to length 1
x <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1] # final new vector
dat$age <- (1 + x) * 25
cor(dat$use, dat$age)
# 0.1
summary(dat$age)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 20.17 23.53 25.00 25.00 26.59 30.50
I have:
x = rnorm(100)
# Partie b
z = rbinom(100,1,0.60)
# Partie c
y = 1.4 + 0.7*x - 0.5*z
# Partie d
x1 = abs(x)
y1 = abs(y)
Don<-cbind(y1,x1,z)
Don1 <- data.frame(Don)
Reg <- glm(y1~x1+z,family=poisson(link="log"),Don1)
# Partie e
#Biais de beta
Reg.cf <- coef(Reg)
biais0 = Reg.cf[1] - 1.4
biais1 = Reg.cf[2] - 0.7
biais2 = Reg.cf[3] + 0.5
And I need to repeat all this 100 times in order to have different coefficient and calculate the bias and then put the mean of each biais in a text file.
I don't know how to implement I taught about repeat{if()break;} But how do I do that? I tried the loop for but it didn't work out.
I'd be inclined to do it this way.
get.bias <- function(i) { # the argument i is not used
x <- rnorm(100)
z <- rbinom(100,1,0.60)
y <- 1.4 + 0.7*x - 0.5*z
df <- data.frame(y1=abs(y), x1=abs(x), z)
coef(glm(y1~x1+z,family=poisson(link="log"),df)) - c(1.4,0.7,-0.5)
}
set.seed(1) # for reproducible example; you may want to comment out this line
result <- t(sapply(1:100,get.bias))
head(result)
# (Intercept) x1 z
# [1,] -1.129329 -0.4992925 0.076027012
# [2,] -1.205608 -0.5642966 0.215998775
# [3,] -1.089448 -0.5834090 0.081211412
# [4,] -1.206076 -0.4629789 0.004513795
# [5,] -1.203938 -0.6980701 0.201001466
# [6,] -1.366077 -0.5640367 0.452784690
colMeans(result)
# (Intercept) x1 z
# -1.1686845 -0.5787492 0.1242588
sapply(list,fun) "applies" the list element-wise to the function; e.g. it calls the function once for each element in the list, and assembles the results into a matrix. So here get.bias(...) will be called 100 times and the results returned each time will be assembled into a matrix. This matrix has one column for each result, but we want the results in rows with one column for each parameter, so we transpose with t(...).