how to simulate correlated binary data with R? [duplicate] - r

This question already has answers here:
Generate correlated random numbers from binomial distributions
(3 answers)
Closed 9 years ago.
Supposing I want 2 vectors of binary data with specified phi coefficients, how could I simulate it with R?
For example, how can I create two vectors like x and y of specified vector length with the cor efficient of 0.79
> x = c(1, 1, 0, 0, 1, 0, 1, 1, 1)
> y = c(1, 1, 0, 0, 0, 0, 1, 1, 1)
> cor(x,y)
[1] 0.7905694

The bindata package is nice for generating binary data with this and more complicated correlation structures. (Here's a link to a working paper (warning, pdf) that lays out the theory underlying the approach taken by the package authors.)
In your case, assuming that the independent probabilities of x and y are both 0.5:
library(bindata)
## Construct a binary correlation matrix
rho <- 0.7905694
m <- matrix(c(1,rho,rho,1), ncol=2)
## Simulate 10000 x-y pairs, and check that they have the specified
## correlation structure
x <- rmvbin(1e5, margprob = c(0.5, 0.5), bincorr = m)
cor(x)
# [,1] [,2]
# [1,] 1.0000000 0.7889613
# [2,] 0.7889613 1.0000000

Related

Extracting the functional form of the likelihood function that gets formed by the msm function in R

Is there a way of extracting the functional form of the likelihood function that gets formed by the msm function in R?
How can I extract the likelihood function that gets formed in the example below? I want to try and implement my own version of the quasi-Newton maximisation algorithm to improve my understanding.
library(msm)
# look at transition counts
statetable.msm(state, PTNUM, data = cav)
# define transition intensity matrix
# 1's mean a transition can occur
# 0's mean a transition should not occur
# any number can be placed on the diagonal as R overwrites the diagonals
# prior to maximising
q <- rbind(
c(0, 1, 0, 1),
c(1, 0, 1, 1),
c(0, 1, 0, 1),
c(0, 0, 0, 0)
)
# fit msm to the data
# the fnscale rescales the likelihood to prevent overflow
msm.fit <- msm(state ~ years, PTNUM, data = cav, qmatrix = q, control=list(fnscale=4000))

Generate viable sampling distributions of discrete data in R

I'm trying to simulate 2 X 2 data that would yield a relatively strong negative phi coefficients.
I'm using the library GenOrd as follows:
library(GenOrd)
# Specify sample size N
N <- 40
# Marginal distribution
marginal <- list(c(.5), c(.5))
# Matrix
Sigma <- matrix(c(1.0, -.71, -.71, 1.0), 2, 2, byrow=TRUE)
# Generate a sample of the categorical variables with specified parameters
m <- ordsample(N, marginal, Sigma)
However, I'm getting the following error whenever I input a correlation larger than -.70.
Error in contord(list(marginal[[q]], marginal[[r]]), matrix(c(1, Sigma[q, :
Correlation matrix not valid!
I'm clearly specifying something untenable somewhere - but I don't know what it is.
Help appreciated.
I'll give a go at answering this as a coding question. The error points to where the packages spots the problem beginning: at your Sigma entry. Given your marginal distribution, having -.71 in your corr. matrix is out of bounds and the packages is warning you of this. You can see this by altering the signs in your Sigma:
Sigma <- matrix(c(1.0, .71, .71, 1.0), 2, 2, byrow=TRUE)
m <- ordsample(N, marginal, Sigma)
> m
[,1] [,2]
[1,] 1 1
[2,] 1 2
....
As to WHY -.71 is not valid, you may want to direct that statistical question to Cross Validated for a succinct answer.
I'm not exactly sure "why", however, I found no problems simulating 2 X 2 data that would yield a relatively strong negative correlation using the generate.binary() function from the MultiOrd package.
For example, the following code will work for the complete range of correlation inputs. The documentation for the generate.binary() function indicates that the matrix specified is interpreted as a tetrachoric correlation matrix.
library(MultiOrd)
# Specify sample size N
N <- 40
# Marginal distribution for two variables as a vector for MultiOrd rather than a list
marginal <- c(.5, .5)
# Correlation (tetrachoric) matrix as target for simulated relationship between variables
Sigma <- matrix(c(1.0, -.71, -.71, 1.0), 2, 2, byrow=TRUE)
# Generate a sample of the categorical variables with specified parameters
m <- generate.binary(40, marginal, Sigma)

R: what is the vector of quantiles in density function dvmnorm

library(mvtnorm)
dmvnorm(x, mean = rep(0, p), sigma = diag(p), log = FALSE)
The dmvnorm provides the density function for a multivariate normal distribution. What exactly does the first parameter, x represent? The documentation says "vector or matrix of quantiles. If x is a matrix, each row is taken to be a quantile."
> dmvnorm(x=c(0,0), mean=c(1,1))
[1] 0.0585
Here is the sample code on the help page. In that case are you generating the probability of having quantile 0 at a normal distribution with mean 1 and sd 1 (assuming that's the default). Since this is a multivariate normal density function, and a vector of quantiles (0, 0) was passed in, why isn't the output a vector of probabilities?
Just taking bivariate normal (X1, X2) as an example, by passing in x = (0, 0), you get P(X1 = 0, X2 = 0) which is a single value. Why do you expect to get a vector?
If you want a vector, you need to pass in a matrix. For example, x = cbind(c(0,1), c(0,1)) gives
P(X1 = 0, X2 = 0)
P(X1 = 1, X2 = 1)
In this situation, each row of the matrix is processed in parallel.

Predict the next probable hidden state via RHmm package for discrete distribution

I have a train sequence and model with finite set of values (discrete distribution). I'm training this model, getting the hidden states for X sequence by Viterbi algorithm and I want to predict the next hidden state. How can I calculate it?
library(RHmm)
seq.train <- rbinom(1000, 1, 0.5)
hmm <- HMMFit(seq.train, dis = 'DISCRETE', nStates = 3)
x <- c(1, 1, 1, 0, 0, 1)
v <- viterbi(hmm, x)
You don't need Viterbi algorithm to compute the next hidden state. All you need is the estimated transition matrix, and the posterior state distribution of the last training observation.
> Gamma <- RHmm::forwardbackward(hmm, seq.train)$Gamma
> Gamma[nrow(Gamma), ]
[1] 0.008210024 0.035381361 0.956408615
> Gamma[nrow(Gamma), ] %*% hmm$HMM$transMat
[,1] [,2] [,3]
[1,] 0.2222393 0.293037 0.4847237
See this CrossValidated answer.

Calculation of mutual information in R

I am having problems interpreting the results of the mi.plugin() (or mi.empirical()) function from the entropy package. As far as I understand, an MI=0 tells you that the two variables that you are comparing are completely independent; and as MI increases, the association between the two variables is increasingly non-random.
Why, then, do I get a value of 0 when running the following in R (using the {entropy} package):
mi.plugin( rbind( c(1, 2, 3), c(1, 2, 3) ) )
when I'm comparing two vectors that are exactly the same?
I assume my confusion is based on a theoretical misunderstanding on my part, can someone tell me where I've gone wrong?
Thanks in advance.
Use mutinformation(x,y) from package infotheo.
> mutinformation(c(1, 2, 3), c(1, 2, 3) )
[1] 1.098612
> mutinformation(seq(1:5),seq(1:5))
[1] 1.609438
and normalized mutual information will be 1.
the mi.plugin function works on the joint frequency matrix of the two random variables. The joint frequency matrix indicates the number of times for X and Y getting the specific outcomes of x and y.
In your example, you would like X to have 3 possible outcomes - x=1, x=2, x=3, and Y should also have 3 possible outcomes, y=1, y=2, y=3.
Let's go through your example and calculate the joint frequency matrix:
> X=c(1, 2, 3)
> Y=c(1, 2, 3)
> freqs=matrix(sapply(seq(max(X)*max(Y)), function(x) length(which(((X-1)*max(Y)+Y)==x))),ncol=max(X))
> freqs
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
This matrix shows the number of occurrences of X=x and Y=y. For example there was one observation for which X=1 and Y=1. There were 0 observations for which X=2 and Y=1.
You can now use the mi.plugin function:
> mi.plugin(freqs)
[1] 1.098612

Resources