Generating random Vectors in R - r

I am concerned with the following programming exercise in R:
Generate 10.000 4 dimensional vectors.
The components of the vector are generated from Bernoulli distribution with probability 0.5.
Detect all vectors with at least 3 '1'.
In order to generate one such sample I employ
sample(0:1, 4, replace = TRUE)
In order to generate vectors I use
x <- c(sample(0:1, 4, replace = TRUE))
Since I need 10.000 vectors, I use a for loop:
for(i in 1:10000){c(sample(0:1, 4, replace = TRUE))}
So, now I have 10.000 vectors.
In order to continue with the task, I should put all of the into a list.
Then, using a suitable if condition, I think it should be possible to conclude the task.
Can anyone help me?

Here is a solution for your problem:
set.seed(135)
n <- 10000
X <- matrix(rbinom(4*n, size=1, prob=0.5), nrow=n)
apply(X, 1, function(x) sum(x)>2)

#MarcoSandri's solution will be faster, but you could modify your solution this way to make it work
num = 0
for(i in 1:10000){
x = c(sample(0:1, 4, replace = TRUE))
if(sum(x) >= 3){
num = num + 1
}
}

Related

how to use apply (or sapply) with columns of matrix or dataframe as function args

I know this is a bonehead newbie question, but I've been trying to figure it out for quite awhile and need some input. Basically, I'm trying to learn how to use the apply family to omit for loops, specifically how to set up the call so that columns of a matrix serve as arguments to the function. I'll use a simple call to the rbinom function as an example.
Example: this for loop works fine. The data are a set of integers and a set of probabilities
success <- rep(-1, times=10) # initialize result var
num <- sample.int(20, 10) # get 10 random integers
p <- runif(10) # get 10 random probabilities
for (i in 1:10) {
success[i]= rbinom(n=1, size=num[i],prob=p[i]) # number successes in 1 trial
}
But how to do the same thing with the apply family? I first put the data into 2 columns of a matrix, thinking that was the right start. However, the following does NOT work, obviously due to my
poor understanding of how to set up a call to apply.
myData <- matrix(nrow=10, ncol=2)
myData[,1] <- num
myData[,2] <- p
success <- apply(myData, rbinom, n=1, size=myData[,1], prob=myData[,2])
Any tips are greatly appreciated! I'm coming to R from Fortran, and trying to port over a lot of code that is loaded with DO loops, so I really need to get my head around this.
lapply, sapply, apply only deal with one vector/list at a time. That is, apply will only call its function for one column at a time. What you need is mapply or Map.
myData <- matrix(nrow=10, ncol=2)
myData[,1] <- num
myData[,2] <- p
mapply(rbinom, n = 1, myData[,1], myData[,2])
# [1] 5 4 11 8 3 3 17 8 0 11
Just like lapply returns a list, so does Map; similarly, just like sapply, mapply will return a vector or array if all return values are compatible, otherwise it returns a list as well.
These calls are equivalent:
sapply(1:3, function(z) z + 1)
mapply(function(z) z + 1, 1:3)
but mapply and Map allow arbitrary number of lists/vectors, so for instance
func <- function(X,Y,Z) X^2+2*Y-Z
Map(func, 1:9, 11:19, 21:29)
## effectively the same as
list(
func(1, 11, 21),
func(2, 12, 22),
func(3, 13, 33),
...,
func(9, 19, 29)
)
The equivalent call of that with sapply for your data would be
sapply(seq_len(nrow(myData)), function(ind) {
rbinom(n = 1, size = myData[ind,1], prob = myData[ind,2])
})
though I personally feel that mapply is easier to read.

Sample draw in sapply without replacement

How does one draw a sample within a sapply function without replacement? Consider the following MWE below. What I am trying to achieve is for a number in idDRAW to receive a letter from chrSMPL (given the sample size of chrSMPL). Whether a number from idDRAW receives a letter is determined by the respective probabilities, risk factors and categories. This is calculated in the sapply function and stored in tmp.
The issue is sample replacement, leading to a number being named with a letter more than once. How can one avoid replacement whilst still using the sapply function? I have tried to adjust the code from this question (Alternative for sample) to suit my needs, but no luck. Thanks in advance.
set.seed(3)
chr<- LETTERS[1:8]
chrSMPL<- sample(chr, size = 30, replace = TRUE)
idDRAW<- sort(sample(1:100, size = 70, replace = FALSE))
p_mat<- matrix(runif(16, min = 0, max = 0.15), ncol = 2); rownames(p_mat) <- chr ## probability matrix
r_mat <- matrix(rep(c(0.8, 1.2), each = length(chr)), ncol = 2); rownames(r_mat) <- chr ## risk factor matrix
r_cat<- sample(1:2, 70, replace = TRUE) ## risk categories
# find number from `idDRAW` to be named a letter:
Out<- sapply(chrSMPL, function(x){
tmp<- p_mat[x, 1] * r_mat[x, r_cat]
sample(idDRAW, 1, prob = tmp)
})
> sort(Out)[1:3]
G B B
5 5 5
I managed with an alternative solution using a for loop as seen below. If anyone can offer suggestions on how the desired result can be achieved without using a for loop it would be greatly appreciated.
set.seed(3)
Out <- c()
for(i in 1:length(chrSMPL)){
tmp <- p_mat[chrSMPL[i], 1] * r_mat[chrSMPL[i], r_cat]
Out <- c(Out, sample(idDRAW, 1, prob = tmp))
rm <- which(idDRAW == Out[i])
idDRAW <- idDRAW[-rm]
r_cat <- r_cat[-rm]
}
names(Out) <- chrSMPL
sort(Out)[1:3]

Sampling from a subset of data

I have the following problem.
I have multiple subarrays (say 2) that I have populated with character labels (1, 2, 3, 4, 5). My algorithm selects labels at random based on occurrence probabilities.
How can I get R to instead select labels 1:3 for subarray 1 and 4:5 for subarray 2, say, without using subsetting (i.e., []). That is, I want a random subset of labels to be selected for each subarray, instead of all labels assigned to each subarray manually using [].
I know sample() should help.
Using subsetting (which I don't want) one would do
x <- 1:5
sample(x[1:3], size, prob = probs[1:3])
but this assigns labels 1:3 to ALL subarrays.
Would
sample(sample(x), size, replace = TRUE, prob = probs)
work?
Any ideas? Please let me know if this is unclear.
Here is a small example, which selects labels from 1:5 for each of 10 subarrays.
set.seed(1)
N <- 10
K <- 2
Hstar <- 5
probs <- rep(1/Hstar, Hstar)
perms <- 5
## Set up container(s) to hold the identity of each individual from each permutation ##
num.specs <- ceiling(N / K)
## Create an ID for each haplotype ##
haps <- 1:Hstar
## Assign individuals (N) to each subpopulation (K) ##
specs <- 1:num.specs
## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
gen.perms <- function() {
sample(haps, size = num.specs, replace = TRUE, prob = probs) # I would like each subarray to contain a random subset of 1:5.
}
pop <- array(dim = c(perms, num.specs, K))
for (i in 1:K) {
pop[,, i] <- replicate(perms, gen.perms())
}
pop
Hopefully this helps.
I think what you actually want is something like that
num.specs <- 3
haps[sample(seq(haps),size = num.specs,replace = F)]
[1] 3 5 4
That is a random subset of your vector haps ?
Not quite what you want (returns list of matrices instead of 3D array) but this might help
lapply(split(1:5, cut(1:5, breaks=c(0, 2, 5))), function(i) matrix(sample(i, 25, replace=TRUE), ncol=5))
Use cut and split to partition your vector of character labels before sampling them. Here I split your character labels at the value 2. Also, rather than sampling 5 numbers 5 times, you can sample 25 numbers once, and convert to matrix.

Simulate 5000 samples of size 5 from a normal distribution with mean 5 and standard deviation 3

I am trying to simulate 5000 samples of size 5 from a normal distribution with mean 5 and standard deviation 3. I want to then compute the mean of each sample and make a histogram of the sample means
My current code is not giving me an error but I don't think it's right:
nrSamples = 5000
e <- list(mode="vector",length=nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means <- matrix(NA, 5000,1)
for (i in 1:5000){
sample_means[i] <- mean(e[[i]])
}
Any idea on how to tackle this? I am very very new to R!
You don't need a list in this case. It is a common mistake of new R users to use lists excessively.
observations <- matrix(rnorm(25000, mean=5, sd=3), 5000, 5)
means <- rowMeans(observations)
Now means is a vector of 5000 elements.
You can actually do this without for loops. replicate can be used to create the 5000 samples. Then use sapply to return the mean of each sample. Wrap the sapply call in hist() to get the histogram of means.
dat = replicate(5000, rnorm(5,5,3), simplify=FALSE)
hist(sapply(dat, mean))
Or, if you want to save the means:
sample.means = sapply(dat,mean)
hist(sample.means)
I think your code is giving valid results. list(mode="vector",length=nrSamples) isn't doing what I think you intended (run it in the console and see what happens), but it works out because the first two list elements get overwritten in the loop.
Although there's no need to use loops here, just for illustration here are two modified versions of your code using loops:
# 1. Store random samples in a list
e <- vector("list", nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means = rep(NA, nrSamples)
for (i in 1:nrSamples){
sample_means[i] <- mean(e[[i]])
}
# 2. Store random samples in a matrix
e <- matrix(rep(NA, 5000*5), nrow=5)
for (i in 1:nrSamples) {
e[,i] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means = rep(NA, nrSamples)
for (i in 1:nrSamples){
sample_means[i] <- mean(e[, i])
}
Your code is fine (see below), but I would suggest you try the following:
yourlist <- lapply(1:nrSamples, function(x) rnorm(n=5, mean = 5, sd = 3 ))
yourmeans <- sapply(yourlist, mean)
Here, for each element of the sequence 1, 2, 3, ... nrSamples that I supply as the first argument, lapply executes an function with the given element of the sequence as argument (i.e. x). The function that I have supplied does not depend on x, however, so it is just replicated 5000 times, and the output is stored in a list (this is what lapply does). It is an easy way to avoid loops in situations like these. Needless to say, you could also just run
yourmeans <- sapply(1:nrSamples, function(x) mean(rnorm(n=5, mean = 5, sd = 3)))
Apart from the means, the latter does not store your results though, which may not be what you want. Also note that I call sapply to return a vector, which you can then use to plot your histogram, using e.g. hist(yourmeans).
To show that your code is fine, consider the following:
set.seed(42)
nrSamples = 5000
e <- list(mode="vector",length=nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means <- matrix(NA, 5000,1)
for (i in 1:5000){
sample_means[i] <- mean(e[[i]])
}
set.seed(42)
yourlist <- lapply(1:nrSamples, function(x) rnorm(n=5, mean = 5, sd = 3 ))
yourmeans <- sapply(yourlist, mean)
all.equal(as.vector(sample_means), yourmeans)
[1] TRUE
Here, I set the seed to the random number generator to make sure that the random numbers are the same. As you see, your code works fine, though as others have pointed out, loops can easily be avoided.

How to skip an error in a loop

I want to skip an error (if there is any) in a loop and continue the next iteration. I want to compute 100 inverse matrices of a 2 by 2 matrix with elements randomly sampled from {0, 1, 2}. It is possible to have a singular matrix (for example,
1 0
2 0
Here is my code
set.seed(1)
count <- 1
inverses <- vector(mode = "list", 100)
repeat {
x <- matrix(sample(0:2, 4, replace = T), 2, 2)
inverses[[count]] <- solve(x)
count <- count + 1
if (count > 100) break
}
At the third iteration, the matrix is singular and the code stops running with an error message. In practice, I would like to bypass this error and continue to the next loop. I know I need to use a try or tryCatch function but I don't know how to use them. Similar questions have been asked here, but they are all really complicated and the answers are far beyond my understanding. If someone can give me a complete code specifically for this question, I really appreciate it.
This would put NULLs into inverses for the singular matrices:
inverses[[count]] <- tryCatch(solve(x), error=function(e) NULL)
If the first expression in a call to tryCatch raises an error, it executes and returns the value of the function supplied to its error argument. The function supplied to the error arg has to take the error itself as an argument (here I call it e), but you don't have to do anything with it.
You could then drop the NULL entries with inverses[! is.null(inverses)].
Alternatively, you could use the lower level try. The choice is really a matter of taste.
count <- 0
repeat {
if (count == 100) break
count <- count + 1
x <- matrix(sample(0:2, 4, replace = T), 2, 2)
x.inv <- try(solve(x), silent=TRUE)
if ('try-error' %in% class(x.inv)) next
else inverses[[count]] <- x.inv
}
If your expression generates an error, try returns an object with class try-error. It will print the message to screen if silent=FALSE. In this case, if x.inv has class try-error, we call next to stop the execution of the current iteration and move to the next one, otherwise we add x.inv to inverses.
Edit:
You could avoid using the repeat loop with replicate and lapply.
matrices <- replicate(100, matrix(sample(0:2, 4, replace=T), 2, 2), simplify=FALSE)
inverses <- lapply(matrices, function(mat) if (det(mat) != 0) solve(mat))
It's interesting to note that the second argument to replicate is treated as an expression, meaning it gets executed afresh for each replicate. This means you can use replicate to make a list of any number of random objects that are generated from the same expression.
Instead of using tryCatch you could simply calculate the determinant of the matrix with the function det. A matrix is singular if and only if the determinant is zero.
Hence, you could test whether the determinant is different from zero and calculate the inverse only if the test is positive:
set.seed(1)
count <- 1
inverses <- vector(mode = "list", 100)
repeat {
x <- matrix(sample(0:2, 4, replace = T), 2, 2)
# if (det(x)) inverses[[count]] <- solve(x)
# a more robust replacement for the above line (see comment):
if (is.finite(determinant(x)$modulus)) inverses[[count]] <- solve(x)
count <- count + 1
if (count > 100) break
}
Update:
It is, however, possible to avoid generating singular matrices. The determinant of a 2-by-2 matrix mat is definded as mat[1] * mat[4] - mat[3] * mat[2]. You could use this knowledge for sampling random numbers. Just do not sample numbers which will produce a singular matrix. This, of course, depends on the numbers sampled before.
set.seed(1)
count <- 1
inverses <- vector(mode = "list", 100)
set <- 0:2 # the set of numbers to sample from
repeat {
# sample the first value
x <- sample(set, 1)
# if the first value is zero, the second and third one are not allowed to be zero.
new_set <- ifelse(x == 0, setdiff(set, 0), set)
# sample the second and third value
x <- c(x, sample(new_set, 2, replace = T))
# calculate which 4th number would result in a singular matrix
not_allowed <- abs(-x[3] * x[2] / x[1])
# remove this number from the set
new_set <- setdiff(0:2, not_allowed)
# sample the fourth value and build the matrix
x <- matrix(c(x, sample(new_set, 1)), 2, 2)
inverses[[count]] <- solve(x)
count <- count + 1
if (count > 100) break
}
This procedure is a guarantee that all generated matrices will have an inverse.
try is just a way of telling R: "If you commit an error inside the following parentheses, then skip it and move on."
So if you're worried that x <- matrix(sample(0:2, 4, replace = T), 2, 2) might give you an error, then all you have to do is:
try(x <- matrix(sample(0:2, 4, replace = T), 2, 2))
However, keep in mind then that x will be undefined if you do this and it ends up not being able to compute the answer. That could cause a problem when you get to solve(x) - so you can either define x before try or just "try" the whole thing:
try(
{
x <- matrix(sample(0:2, 4, replace = T), 2, 2)
inverses[[count]] <- solve(x)
}
)
The documentation for try explains your problem pretty well. I suggest you go through it completely.
Edit: The documentation example looked pretty straightforward and very similar to the op's question. Thanks for the suggestion though. Here goes the answer following the example in the documentation page:
# `idx` is used as a dummy variable here just to illustrate that
# all 100 entries are indeed calculated. You can remove it.
set.seed(1)
mat_inv <- function(idx) {
print(idx)
x <- matrix(sample(0:2, 4, replace = T), nrow = 2)
solve(x)
}
inverses <- lapply(1:100, function(idx) try(mat_inv(idx), TRUE))

Resources