Locate each observation to level by using the probability - r

I have a matrix of probability. Each row is the probability that observation i is fall in level 1, 2, 3. For example, row 1: this represent the first observation fall in level1 with probability = 0.2 , level2 = 0.3, and level3 = 0.5. At the end I want to get a column using the probability matrix to locate each observation to level 1,2, or 3, something similar to 1,2,3,3,2,.......
I tried to use rmultinom by sampling one sample from each row with the corresponding probability, but I'm not sure if it is the correct way or there is a better method.
px1=c(0.2, 0.3,0.5)
px2=c(0.1, 0.2,0.7)
px3=c(0.5, 0.1,0.4)
px4=c(0.3, 0.3,0.4)
px5=c(0.4, 0.3,0.3)
px6=c(0.5, 0.1,0.4)
px7=c(0.2, 0.3,0.5)
px8=c(0.5,0.4,0.1)
px9=c(0.2,0 .5,0.3)
px10=c(0.6,0.3,0.1)
prob1=matrix(c(px1,px2,px3,px4,px5,px6,px7,px8,px9,px10), ncol=3, nrow=10)
x1=rmultinom(1,1,prob=prob1[1,])
> x1
[,1]
[1,] 0
[2,] 1
[3,] 0
Dose that mean observation 1 is in level 2?

Yes, in your example, that output means you sampled the first observation as falling into level 2. Using rmultinom is okay, but it would probably be more convenient to use the sample function:
lvls <- sapply(1:nrow(prob1),function(x) sample(1:3,1,prob=prob1[x,]))
If you wanted to use rmultinom, you could do so as:
lvls <- sapply(1:nrow(prob1),function(x) which(rmultinom(1,1,prob=prob1[x,])==1))

Related

Generating different percentages of MAR data in R

The following​ two R functions are from the book "Flexible Imputation of Missing Data" (page no. 59 and 63). The first one generates missing completely at random(MCAR) data and the second on generates missing at random(MAR) data. Both functions give approximately 50% missing values. ​
In MCAR function, we can generate different percentages of missing data by changing the p value. But in MAR function, ​I don't understand ​which parameter should we change to generate different percentages of missing data like 10% or 30%?
MCAR
makemissing <- function(data, p=0.5){
rx <- rbinom(nrow(data), 1, p)
data[rx==0,"y"] <- NA
return(data)
}
MAR
logistic <- function(x) exp(x)/(1+exp(x))
set.seed(32881)
n <- 10000
y <- mvrnorm(n=n,mu=c(5,5),Sigma=matrix(c(1,0.6,0.6,1),nrow=2))
p2.marright <- 1 - logistic(-5 + y[,1])
r2.marright <- rbinom(n, 1, p2.marright)
yobs <- y
yobs[r2.marright==0, 2] <- NA
The probability of an observation being missing is 50% for every case for the MCAR function because, by definition, the missingness is random. For the MAR version, the probability of an observation being missing is different for each observation, since it depends on the values of y[,1]. In your code, the probability of missingness on y[,2] is saved in the variable p2.marright. You can perhaps see this more easily by lining up all of the values in a dataframe:
df <- data.frame(y1 = y[,1], y2_ori = y[,2], y2_mis = yobs[,2], p2.marright = p2.marright, r2.marright)
head(df)
y1 y2_ori y2_mis p2.marright r2.marright
1 2.086475 3.432803 3.432803 0.9485110 1
2 3.784675 5.005584 5.005584 0.7712399 1
3 4.818409 5.356688 NA 0.5452733 0
4 2.937422 3.898014 3.898014 0.8872124 1
5 6.422158 5.032659 5.032659 0.1943236 1
6 4.115106 5.083162 5.083162 0.7078354 1
You can see that whether or not an observation will be NA on y2 is encoded in r2.marright, which is a probabilistic binary version of p2.marright --- for higher values of p2.marright, r2.marright is more likely to 1. To change the overall rate of missingness, you can change the calculation of p2.marright to bias it higher or lower.
You can manipulate p2.marright by changing the constant in the logistic transformation (-5 in the example). If you increase it (make it less negative, e.g. -4) then p2.marright will decrease, resulting in more missing values on y2. If you decrease it (make it more negative, e.g. -6) then you'll end up with fewer missing values on y2. (The reason -5 is resulting in 50% missingness is because 5 is the mean of the variable being transformed, y1.) This works, but the mechanism is rather opaque, and it might be difficult for you to control it easily. For example, it's not obvious what you should set the constant to be if you want 20% missingness on y2.

How to create a matrix with probability distribution in R

I want to create a matrix in R with element [-1,0,1] with probability [1/6, 2/3, 1/6] respectively. The probability may change during runtime. for static probability I have got the output but the problem is dynamic change in the probability.
for example, If I create a matrix for the above probability with [sqrt(3),0,-sqrt(3)], the required output is.
Note: The Probability should not be static as mentioned. It may vary during runtime.
Kindly help to solve this.
Supposing you want a 2x3 matrix:
matrix(sample(c(-1,0,1), size=6, replace=TRUE, prob=c(1/6,2/3,1/6)), nrow=2)
So you sample from the values you want, with probabilities defined in prob. This is just a vector, but you can make it into a matrix of the desired shape using matrix afterwards. Replace the probabilities by a variable instead of values to not make it static.
If the numbers should be distributed according to a certain scheme rather than randomly drawn according to a probability, replicate the vector elements and shuffle them:
matrix(sample(rep(c(-1,0,1), times=c(1,4,1))), nrow=2)
You can try this to generate a mxn matrix:
sample.dynamic.matrix <- function(pop.symbols, probs, m, n) {
samples <- sample(pop.symbols, m*n, prob = probs, replace=TRUE)
return(matrix(samples, nrow=m))
}
set.seed(123)
sample.dynamic.matrix(-1:1, c(1/6,2/3,1/6), 2, 3)
# [,1] [,2] [,3]
#[1,] 0 0 -1
#[2,] 1 -1 0

Confusion Between 'sample' and 'rbinom' in R

Why are these not equivalent?
#First generate 10 numbers between 0 and .5
set.seed(1)
x <- runif(10, 0, .5)
These are the two statements I'm confused by:
#First
sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
#Second
rbinom(length(x), size = 1, prob=x)
I was originally trying to use 'sample'. What I thought I was doing was generating ten (0,1) pairs, then assigning the probability that each would return either a 0 or a 1.
The second one works and gives me the output I need (trying to run a sim). So I've been able to solve my problem. I'm just curious as to what's going on under the hood with 'sample' so that I can understand R better.
The first area of difference is the location of the length of the vector specification in the parameter list. The names size have different meanings in these two functions. (I hadn't thought about that source of confusion before, and I'm sure I have made this error myself many times.)
The random number generators (starting with r and having a distribution suffix) have that choice as the first parameter, whereas sample has it as the second parameter. So the length of the second one is 10 and the length of the first is 1. In sample the draw is from the values in the first argument, while 'size' is the length of the vector to create. In the rbinom function, n is the length of the vector to create, while size is the number of items to hypothetically draw from a theoretical urn having a distribution determined by 'prob'. The result returned is the number of "ones". Try:
rbinom(length(x), size = 10, prob=x)
Regarding the argument to prob: I don't think you need the c().
The difference between the two function is quite simple.
Think of a pack of shuffled cards, and choose a number of cards from it. That is exactly the situation that sample simulates.
This code,
> set.seed(123)
> sample(1:40, 5)
[1] 12 31 16 33 34
randomly extract five numbers from the 1:40 vector of numbers.
In your example, you set size = 1. It means you choose only one element from the pool of possible values. If you set size = 10 you will get ten values as you desire.
set.seed(1)
x <- runif(10, 0, .5)
> sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
[1] 0 0 0 0 0 0 0 1 0 1
Instead, the goal of the rbinom function is to simulate events where the results are "discrete", such as the flip of a coin. It considers, as parameters, the probability of success on a trial, such as the flip of the coin, according to a given probability of 0.5. Here we simulate 100 flips. If you think that the coin could be stacked in order to favor one specific outcome, we could simulate this behaviour by setting probability equals to 0.8, as in the example below.
> set.seed(123)
> table(rbinom(100, 1, prob = 0.5))
0 1
53 47
> table(rbinom(100, 1, prob = 0.8))
0 1
19 81

Calculation of mutual information in R

I am having problems interpreting the results of the mi.plugin() (or mi.empirical()) function from the entropy package. As far as I understand, an MI=0 tells you that the two variables that you are comparing are completely independent; and as MI increases, the association between the two variables is increasingly non-random.
Why, then, do I get a value of 0 when running the following in R (using the {entropy} package):
mi.plugin( rbind( c(1, 2, 3), c(1, 2, 3) ) )
when I'm comparing two vectors that are exactly the same?
I assume my confusion is based on a theoretical misunderstanding on my part, can someone tell me where I've gone wrong?
Thanks in advance.
Use mutinformation(x,y) from package infotheo.
> mutinformation(c(1, 2, 3), c(1, 2, 3) )
[1] 1.098612
> mutinformation(seq(1:5),seq(1:5))
[1] 1.609438
and normalized mutual information will be 1.
the mi.plugin function works on the joint frequency matrix of the two random variables. The joint frequency matrix indicates the number of times for X and Y getting the specific outcomes of x and y.
In your example, you would like X to have 3 possible outcomes - x=1, x=2, x=3, and Y should also have 3 possible outcomes, y=1, y=2, y=3.
Let's go through your example and calculate the joint frequency matrix:
> X=c(1, 2, 3)
> Y=c(1, 2, 3)
> freqs=matrix(sapply(seq(max(X)*max(Y)), function(x) length(which(((X-1)*max(Y)+Y)==x))),ncol=max(X))
> freqs
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
This matrix shows the number of occurrences of X=x and Y=y. For example there was one observation for which X=1 and Y=1. There were 0 observations for which X=2 and Y=1.
You can now use the mi.plugin function:
> mi.plugin(freqs)
[1] 1.098612

Random sampling of two vectors, finding mean of sample, then making a matrix in R?

Thanks for your time!
My data frame is simple. Two columns: the first has genotype (1-39) and second has trait values (numerical, continuous). I would like to choose 8 genotypes and calculate the mean and stdev of the associated trait values.
In the end I would like to sample 8 genotypes 10,000 times and for each sample I would like to have the stdev and mean of the associated trait values. Ideally this would be in a matrix where each row represented a sample, 8 columns for each genotype, and 2 final columns for stdev and mean of the trait values associated with those genotypes. This could be oriented the other way too.
How do you sample from two different columns in a data frame so that both values show up in your new sample? i.e genotypes and trait values with mean and stdev calculated
How do you get this sample into a matrix as I've described above?
How do you repeat the process 10,000 times?
Thanks again!
This would return a single sample of all rows with genotype in a random sample of 8 traits:
dat[ dat$genotype %in% sample(1:39, 8), ]
The replicate function is designed to repeat random process. Repeat 3 times getting the sd of "trait" from such a sample of 2 genotypes:
dat <- data.frame(genotype=sample(1:5, 25,replace=TRUE), trait=rnorm(25) )
replicate ( 3, sd(dat[ dat$genotype %in% sample(1:5, 2), "trait" ]) )
[1] 0.7231686 0.9225318 0.9225318
This records the sample ids with the means and sd values:
replicate ( 3, {c( samps =sample(1:5, 2),
sds=sd(dat[ dat$genotype %in% samps, "trait" ]) ,
means = mean(dat[ dat$genotype %in% samps, "trait" ]) )} )
[,1] [,2] [,3]
samps1 1.0000000 1.0000000 5.0000000
samps2 5.0000000 3.0000000 1.0000000
sds 0.8673977 0.8673977 0.8673977
means 0.2835325 0.2835325 0.2835325

Resources