I am having problems interpreting the results of the mi.plugin() (or mi.empirical()) function from the entropy package. As far as I understand, an MI=0 tells you that the two variables that you are comparing are completely independent; and as MI increases, the association between the two variables is increasingly non-random.
Why, then, do I get a value of 0 when running the following in R (using the {entropy} package):
mi.plugin( rbind( c(1, 2, 3), c(1, 2, 3) ) )
when I'm comparing two vectors that are exactly the same?
I assume my confusion is based on a theoretical misunderstanding on my part, can someone tell me where I've gone wrong?
Thanks in advance.
Use mutinformation(x,y) from package infotheo.
> mutinformation(c(1, 2, 3), c(1, 2, 3) )
[1] 1.098612
> mutinformation(seq(1:5),seq(1:5))
[1] 1.609438
and normalized mutual information will be 1.
the mi.plugin function works on the joint frequency matrix of the two random variables. The joint frequency matrix indicates the number of times for X and Y getting the specific outcomes of x and y.
In your example, you would like X to have 3 possible outcomes - x=1, x=2, x=3, and Y should also have 3 possible outcomes, y=1, y=2, y=3.
Let's go through your example and calculate the joint frequency matrix:
> X=c(1, 2, 3)
> Y=c(1, 2, 3)
> freqs=matrix(sapply(seq(max(X)*max(Y)), function(x) length(which(((X-1)*max(Y)+Y)==x))),ncol=max(X))
> freqs
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
This matrix shows the number of occurrences of X=x and Y=y. For example there was one observation for which X=1 and Y=1. There were 0 observations for which X=2 and Y=1.
You can now use the mi.plugin function:
> mi.plugin(freqs)
[1] 1.098612
Related
i want to generate a second vector in r which is correlated to my first vector.
the first vector is simply created as following:
x <- rbinom(n=10,1,p=0.8)
x
[1] 0 0 1 0 1 1 1 1 0 0
my second vector should be generated with a defined correlation e.g. 0.8.
i know that you can use mvrnorm() for the normal distribution, but i dont know how to do it for the binomial distribution. i tried to find some solution but the suggestions were a bit too complicated for me or i could not apply to my code.
I second the recommendation that you visit Cross Validated. It is not clear how you are planning to use the correlated binomial distribution. Given your stipulation that you are starting with one vector and you want to create a second based on the first, all you need to do is adjust the probabilities of the second vector:
set.seed(42)
x <- rbinom(n=1000, size=1, p=0.8) # Your first vector
y <- rbinom(n=1000, size=1, p=ifelse(x==1, .95, .05))
cor(x, y)
# [1] 0.8505885
y <- rbinom(n=1000, size=1, p=ifelse(x==1, .94, .06))
cor(x, y)
# [1] 0.821918
y <- rbinom(n=1000, size=1, p=ifelse(x==1, .93, .07))
cor(x, y)
# [1] 0.7679597
In generating the second vector, the probability for the second vector is greater than .8 (the probability of a 1) if the value is 1 in the first vector and less than .2 (the probability of a 0) if the value is 0 in the first vector.
This question already has an answer here:
Simulating correlated Bernoulli data
(1 answer)
Closed 1 year ago.
An apparently simple problem: I want to generate 2 (simulated) variables (x, y) from a bivariate distribution with a given matrix of correlation between them. In other wprds, I want two variables/vectors with values of either 0 or 1, and a defined correlations between them.
The case of normal distribution is easy with the MASS package.
df_norm = mvrnorm(
100, mu = c(x=0,y=0),
Sigma = matrix(c(1,0.5,0.5,1), nrow = 2),
empirical = TRUE) %>%
as.data.frame()
cor(df_norm)
x y
x 1.0 0.5
y 0.5 1.0
Yet, how could I generate binary data from the given matrix correlation?
This is not working:
df_bin = df_norm %>%
mutate(
x = ifelse(x<0,0,1),
y = ifelse(y<0,0,1))
x y
1 0 1
2 0 1
3 1 1
4 0 1
5 1 0
6 0 0
7 1 1
8 1 1
9 0 0
10 1 0
Although this creates binary variables, but the correlation is not (even close to) 0.5.
cor(df_bin)
x y
x 1.0000000 0.2994996
y 0.2994996 1.0000000
Ideally I would like to be able to specify the type of distribution as an argument in the function (as in the lm() function).
Any idea?
I guessed that you weren't looking for binary, as in values of either zero or one. If that is what you're looking for, this isn't going to help.
I think what you want to look at is the construction of binary pair-copula.
You said you wanted to specify the distribution. The package VineCopula would be a good start.
You can use the correlation matrix to simulate the data after selecting the distribution. You mentioned lm() and Gaussian is an option - (normal distribution).
You can read about this approach through Lin and Chagnaty (2021). The package information isn't based on their work, but that's where I started when I looked for your answer.
I used the correlation of .5 as an example and the Gaussian copula to create 100 sets of points in this example:
# vine-copula
library(VineCopula)
set.seed(246543)
df <- BiCopSim(100, 1, .5)
head(df)
# [,1] [,2]
# [1,] 0.07585682 0.38413426
# [2,] 0.44705686 0.76155029
# [3,] 0.91419758 0.56181837
# [4,] 0.65891869 0.41187594
# [5,] 0.49187672 0.20168128
# [6,] 0.05422541 0.05756005
I have a matrix of probability. Each row is the probability that observation i is fall in level 1, 2, 3. For example, row 1: this represent the first observation fall in level1 with probability = 0.2 , level2 = 0.3, and level3 = 0.5. At the end I want to get a column using the probability matrix to locate each observation to level 1,2, or 3, something similar to 1,2,3,3,2,.......
I tried to use rmultinom by sampling one sample from each row with the corresponding probability, but I'm not sure if it is the correct way or there is a better method.
px1=c(0.2, 0.3,0.5)
px2=c(0.1, 0.2,0.7)
px3=c(0.5, 0.1,0.4)
px4=c(0.3, 0.3,0.4)
px5=c(0.4, 0.3,0.3)
px6=c(0.5, 0.1,0.4)
px7=c(0.2, 0.3,0.5)
px8=c(0.5,0.4,0.1)
px9=c(0.2,0 .5,0.3)
px10=c(0.6,0.3,0.1)
prob1=matrix(c(px1,px2,px3,px4,px5,px6,px7,px8,px9,px10), ncol=3, nrow=10)
x1=rmultinom(1,1,prob=prob1[1,])
> x1
[,1]
[1,] 0
[2,] 1
[3,] 0
Dose that mean observation 1 is in level 2?
Yes, in your example, that output means you sampled the first observation as falling into level 2. Using rmultinom is okay, but it would probably be more convenient to use the sample function:
lvls <- sapply(1:nrow(prob1),function(x) sample(1:3,1,prob=prob1[x,]))
If you wanted to use rmultinom, you could do so as:
lvls <- sapply(1:nrow(prob1),function(x) which(rmultinom(1,1,prob=prob1[x,])==1))
i have a problem with calculation the Spectral decomposition, i guess, with the sorting of eigen.
According to this website http://www.deltaquants.com/cleaning-correlation-matrices.html i would like to do the same calculation in R
Input <- data.frame(read.csv2(file="testmatrix.csv", header=FALSE, sep=";"))
# same matrix as the example on the website
Eigen <- eigen(Input, only.values=FALSE, symmetric = TRUE)
#Get the eigenvalues/eigenvectors
Eigen$values
Eigen$vectors
The result on the website (excel):
The result from eigen (R)
As the result the new correlation matrix C is not correct.
Thanks for the help. I could provide further information e.c. Code or more details - if it helps.
If you want to order the eigenvalue of a matrix in increasing order, just index eigenvectors and eigenvalues with the output of the order function:
mat <- matrix(c(1, 2, 3, 2, 7, 4, 3, 4, 0), nrow=3)
e <- eigen(mat)
o <- order(e$values, decreasing=FALSE)
e$values[o]
# [1] -2.961797 1.056689 9.905108
e$vectors[,o]
# [,1] [,2] [,3]
# [1,] 0.5110650 0.7915817 -0.3349790
# [2,] 0.2299503 -0.5014262 -0.8340831
# [3,] -0.8282122 0.3492421 -0.4382859
The eigenvalues are in a different order. Both results are correct.
Note that it is accepted practice to order the eigenvalues according to nonincreasing absolute value, as is returned by R, but not by Excel. So if one answer is "wrong," it is Excel's.
I'm trying to compute the PCA scores, and, part of the algorithm says: subtract the mean of the matrix, divided by the standard deviation
I have the following 2x2 matrix given by: A = [1 3; 2 4] let's say in Matlab, I do the following:
mean(A) -> This gives me back a vector of 2 values (column based) so.. 1.5 and 3.5. Which to me in this instance this would be correct.
In R however, when computing the mean mean(A) the mean is just one value. This is the same for the standard deviation.
So my question is, which is right? For the purposes of this function (in the algorithm):
function(x) {(x - mean(x))/sd(x) (http://strata.uga.edu/software/pdf/pcaTutorial.pdf)
Should I be subtracting the mean based on two values by Matlab or 1 value by R?
Thanks
The R command that will do this in one swoop for matrices or dataframes is scale()
> A = matrix(c(1, 3, 2, 4), 2)
> scale(A)
[,1] [,2]
[1,] -0.7071068 -0.7071068
[2,] 0.7071068 0.7071068
attr(,"scaled:center")
[1] 2 3
attr(,"scaled:scale")
[1] 1.414214 1.414214
It's done by column. When you used 'mean' you got the mean for all four numbers rather than by column. That is not what you would want if you are doing PCA calculations.