R - Dealing with zeros in radomized subsamples - r

I've run into a little problem, simulating the throw of dice. Basically im doing this to get familiar with loops and their output.
Intention is to simulate the throw of two dice as follows:
R = 100
d6 = c(1:6)
d = 60
DICE = NULL
for (i in 1:R)
{
i <- as.factor((sample(d6, size=d, replace = T)) + (sample(d6, size=d, replace = T)))
j <- summary(i)
DICE = rbind(DICE, j)
}
head(DICE)
HIS = colMeans(DICE)
boxplot(DICE)
title(main= "Result 2d6", ylab= "Throws", xlab="")
relHIS = (HIS / sum(HIS))*100
relHIS
Problems occur if the result in one cathegorie is 0 (result did not occur in the sample). If this happens randomly in the first subsample one or more the categories (numbers 2-12) are missing. This causes problems ("number of columns of result is not a multiple of vector length (arg 2)") in the following subsamples.
Im sure there is a really simple solution for this, by defining everything beforehand...
Thanks for your help!

Here are some fixes:
R = 100
d6 = c(1:6)
d = 60
DICE = matrix(nrow = R, ncol = 11) #pre-allocate
colnames(DICE) <- 2:12
for (i in 1:R)
{
sim <- ordered((sample(d6, size=d, replace = T)) + (sample(d6, size=d, replace = T)),
levels = 2:12) #define the factor levels
sumsim <- table(sim)
DICE[i,] <- sumsim #sub-assign
}
head(DICE)
HIS = colMeans(DICE)
boxplot(DICE)
title(main= "Result 2d6", ylab= "Throws", xlab="")
prop.table(HIS) * 100
Always pre-allocate your result data structure. Growing it in a loop is terribly slow and you know how big it needs to be. Also, don't use the same symbol for the iteration variable and something else.

Omit as.factor()in your seventh row

Related

R code for replacing the values of Matrix

Hey everyone, I have a large Matrix X with the dimensions (654x7095). I wanted to subset this matrix and replace the values of this subsetted matrix of X with another matrix which I have created. The R-code is as follows -
install.packages("Matrix")
install.packages("base")
library(Matrix)
library(base)
T = 215
n = 3
k = 33
X = matrix(0,T*n,T*k)
IN = diag(n)
K1 = Matrix(0, n*n, n*(n-1)/2, sparse = TRUE)
for(i in 1:(n-1)){
K1[(2+(i-1)*(n+1)):(i*n), (1+(i-1)*(n-i/2)):(i*(n-i)*(i+1)/2)] <- diag(n-i)
}
yin = matrix(rnorm(645), ncol = 3)
Xu = matrix(rnorm(2150), ncol = 10)
#Till yet I have defined the variables and matrices which will be used in subsetting.
Above codes are perfectly fine, however, the code below is showing error -
#Loop for X subsetting
for(i in 1:T){
X[(((i-1)*n)+1):(i*n), (((i-1)*k)+1):(i*k)] <- cbind( (t(kronecker(yin[i,],IN))%*%K1) , (t(kronecker(Xu[i,],IN))))
}
# in this Kronecker() finds the Kronecker tensor product of two Matrix A and B. This function can be used with the help of "base" library.
When I am running this above code, the error which is showing is -
Error in X[(((i - 1) * n) + 1):(i * n), ] <- cbind((t(kronecker(yin[i, :
number of items to replace is not a multiple of replacement length
However, when I am running this same command in MATLAB it is working perfectly fine. MATLAB CODE -
X = zeros(T*n,T*k);
for i = 1:T
X((i-1)*n+1:i*n,(i-1)*k+1:i*k) = [kron(yin(i,:),IN)*K1, kron(Xu(i,:),IN)];
end
The output which MATLAB is giving is that it fills up the values in number of rows and columns which is defined in the Loop for subsetting the X. I have attached the snapshot of the desired output which MATLAB is giving. However, error is showing in R for the same.
Can someone enlighten me as where I am going wrong with the R code?
I appreciate the help, Many thanks.
I think the problem is how the class 'dgeMatrix' is handled. Try
for (i in 1:T) {
X[(((i-1)*n)+1):(i*n), (((i-1)*k)+1):(i*k)] <- as.matrix(cbind((t(kronecker(yin[i,],IN))%*%K1) , (t(kronecker(Xu[i,],IN)))))
}

Store values from a loop

I am simulating dice throws, and would like to save the output in a single object, but cannot find a way to do so. I tried looking here, here, and here, but they do not seem to answer my question.
Here is my attempt to assign the result of a 20 x 3 trial to an object:
set.seed(1)
Twenty = for(i in 1:20){
trials = sample.int(6, 3, replace = TRUE)
print(trials)
i = i+1
}
print(Twenty)
What I do not understand is why I cannot recall the function after it is run?
I also tried using return instead of print in the function:
Twenty = for(i in 1:20){
trials = sample.int(6, 3, replace = TRUE)
return(trials)
i = i+1
}
print(Twenty)
or creating an empty matrix first:
mat = matrix(0, nrow = 20, ncol = 3)
mat
for(i in 1:20){
mat[i] = sample.int(6, 3, replace = TRUE)
print(mat)
i = i+1
}
but they seem to be worse (as I do not even get to see the trials).
Thanks for any hints.
There are several things wrong with your attempts:
1) A loop is not a function nor an object in R, so it doesn't make sense to assign a loop to a variable
2) When you have a loop for(i in 1:20), the loop will increment i so it doesn't make sense to add i = i + 1.
Your last attempt implemented correctly would look like this:
mat <- matrix(0, nrow = 20, ncol = 3)
for(i in 1:20){
mat[i, ] = sample.int(6, 3, replace = TRUE)
}
print(mat)
I personally would simply do
matrix(sample.int(6, 20 * 3, replace = TRUE), nrow = 20)
(since all draws are independent and with replacement, it doesn't matter if you make 3 draws 20 times or simply 60 draws)
Usually, in most programming languages one does not assign objects to for loops as they are not formally function objects. One uses loops to interact iteratively on existing objects. However, R maintains the apply family that saves iterative outputs to objects in same length as inputs.
Consider lapply (list apply) for list output or sapply (simplified apply) for matrix output:
# LIST OUTPUT
Twenty <- lapply(1:20, function(x) sample.int(6, 3, replace = TRUE))
# MATRIX OUTPUT
Twenty <- sapply(1:20, function(x) sample.int(6, 3, replace = TRUE))
And to see your trials, simply print out the object
print(Twenty)
But since you never use the iterator variable, x, consider replicate (wrapper to sapply which by one argument can output a matrix or a list) that receives size and expression (no sequence inputs or functions) arguments:
# MATRIX OUTPUT (DEFAULT)
Twenty <- replicate(20, sample.int(6, 3, replace = TRUE))
# LIST OUTPUT
Twenty <- replicate(20, sample.int(6, 3, replace = TRUE), simplify = FALSE)
You can use list:
Twenty=list()
for(i in 1:20){
Twenty[[i]] = sample.int(6, 3, replace = TRUE)
}

Avoiding a loop when populating data frames in R

I have an empty data frame T_modelled with 2784 columns and 150 rows.
T_modelled <- data.frame(matrix(ncol = 2784, nrow = 150))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
where
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
I filled T_modelled by column with a nested for loop, based on a formula:
for (i in 1:ncol(T_modelled)) {
col_tmp <- colnames(T_modelled)[i]
for (j in 1:nrow(T_modelled)) {
z_tmp <- z[j]-0.1
T_tmp <- MANSRT+As*e^(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
T_modelled[j ,col_tmp] <- T_tmp
}
}
where
MANSRT <- -2.051185
As <- 11.59375
omega <- (2*pi)/(347.875*24*60*60)
c <- 790
k <- 0.00219
pb <- 2600
K <- (k*1000)/(c*pb)
e <- exp(1)
I do get the desired results but I keep thinking there must be a more efficient way of filling that data frame. The loop is quite slow and looks cumbersome to me. I guess there is an opportunity to take advantage of R's vectorized way of calculating. I just cannot see myself how to incorporate the formula in an easier way to fill T_modelled.
Anyone got any ideas how to get the same result in a faster, more "R-like" manner?
I believe this does it.
Run this first instruction right after creating T_modelled, it will be needed to test that the results are equal.
Tm <- T_modelled
Now run your code then run the code below.
z_tmp <- z - 0.1
for (i in 1:ncol(Tm)) {
T_tmp <- MANSRT + As*exp(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
Tm[ , i] <- T_tmp
}
all.equal(T_modelled, Tm)
#[1] TRUE
You don't need the inner loop, that's the only difference.
(I also used exp directly but that is of secondary importance.)
Much like your previous question's solution which you accepted, consider simply using sapply, iterating through the vector, t_sec_ERT, which is the same length as your desired dataframe's number of columns. But first adjust every element of z by 0.1. Plus, there's no need to create empty dataframe beforehand.
z_adj <- z - 0.1
T_modelled2 <- data.frame(sapply(t_sec_ERT, function(ert)
MANSRT+As*e^(-z_adj*(omega/(2*K))^0.5)*sin(omega*ert-((omega/(2*K))^0.5)*z_adj)))
colnames(T_modelled2) <- paste0("t=", t_sec_ERT)
rownames(T_modelled2) <- paste0("z=", z)
all.equal(T_modelled, T_modelled2)
# [1] TRUE
Rui is of course correct, I just want to suggest a way of reasoning when writing a loop like this.
You have two numeric vectors. Functions for numerics in R are usually vectorized. By which I mean you can do stuff like this
x <- c(1, 6, 3)
sum(x)
not needing something like this
x_ <- 0
for (i in x) {
x_ <- i + x_
}
x_
That is, no need for looping in R. Of course looping takes place none the less, it just happens in the underlying C, Fortran etc. code, where it can be done more efficiently. This is usually what we mean when we call a function vectorized: looping takes place "under the hood" as it were. The output of Vectorize() thus isn't strictly vectorized by this definition.
When you have two numeric vectors you want to loop over you have to first see if the constituent functions are vectorized, usually by reading the docs.
If it is, you continue by constructing that central vectorized compound function and and start testing it with one vector and one scalar. In your case it would be something like this (testing with just the first element of t_sec_ERT).
z_tmp <- z - 0.1
i <- 1
T_tmp <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
Looks OK. Then you start looping over the elements of t_sec_ERT.
T_tmp <- matrix(nrow=length(z), ncol=length(t_sec_ERT))
for (i in 1:length(t_sec_ERT)) {
T_tmp[, i] <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
}
Or you can do it with sapply() which is often neater.
f <- function(x) {
MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*x - ((omega/(2*K))^0.5)*z_tmp)
}
T_tmp <- sapply(t_sec_ERT, f)
I would prefer to put the data in a long format, with all combinations of z and t_sec_ERT as two columns, in order to take advantage of vectorization. Although I usually prefer tidyr for switching between long and wide formats, I've tried to keep this as a base solution:
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
v <- expand.grid(t_sec_ERT, z)
names(v) <- c("t_sec_ERT", "z")
v$z_tmp <- v$z-0.1
v$T_tmp <- MANSRT+As*e^(-v$z_tmp*(omega/(2*K))^0.5)*sin(omega*v$t_sec_ERT-((omega/(2*K))^0.5)*v$z_tmp)
T_modelled <- data.frame(matrix(v$T_tmp, nrow = length(z), ncol = length(t_sec_ERT), byrow = TRUE))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))

Optimising a calculation on every cumulative subset of a vector in R

I have a collection of DNA sequencing reads of various lengths, sorted from longest to shortest. I would like to know the largest number of reads I can include in a set such that the N50 of that set is above some threshold t
For any given set of reads, the total amount of data is just the cumulative sum of the lengths of the reads. The N50 is defined as the length of the read such that half of the data are contained in reads at least that long.
I have a solution below, but it is slow for very large read sets. I tried vectorising it, but this was slower (probably because my threshold is usually relatively large, such that my solution below stops calculating fairly early on).
Here's a worked example:
df = data.frame(l = 100:1) # read lengths
df$cs = cumsum(df$l) # getting the cumulative sum is easy and quick
t = 95 # let's imagine that this is my threshold N50
for(i in 1:nrow(df)){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
# the loop will have gone one too far, so I subtract one
number.of.reads = as.integer(i-1)
This works fine on small datasets, but my actual data are more like 5m reads that vary from ~200,000 to 1 in length (longer reads are rarer), and I'm interested in an N50 of 100,000, then it gets pretty slow.
This example is closer to something that's realistic. It takes ~15s on my desktop.
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
So, I'm interested in any ideas, tips, or tricks to noticeably optimise this. It seems like this should be possible, but I'm out of ideas.
As n is decreasing with i, you should use a binary search algorithm.
binSearch <- function(min, max) {
print(mid <- floor(mean(c(min, max))))
if (mid == min) {
if (df$l[min(which(df$cs>df$cs[min]/2))] < t) {
return(min - 1)
} else {
return(max - 1)
}
}
n = df$l[min(which(df$cs>df$cs[mid]/2))]
if (n >= t) {
return(binSearch(mid, max))
} else {
return(binSearch(min, mid))
}
}
Then, just call
binSearch(1, nrow(df))
Since your data are ordered by DNA/read length, maybe you could avoid testing every single row. On the contrary, you can iterate and test a limited number of rows (reasonably spaced) at each iteration (using while() for example), and so get progressively closer to your solution. This should make things much faster. Just make sure that once you get close to the solution, you stop iterating.
This is your solution
set.seed(111)
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
result
# 21216, in ~29 seconds
Instead of testing every row, let's set a range
i1 <- 1
i2 <- nrow(df)
i.range <- as.integer(seq(i1, i2, length.out = 10))
Now, test only these 10 rows. Get the closest one and "focus in" by re-defining the range. Stop when you cannot increase granularity.
while(sum(duplicated(i.range))==0){
for(i in 1:length(i.range)){
N50 = df$l[min(which(df$cs>df$cs[i.range[i]]/2))]
if(N50 < t){ break }
}
#update i1 and i2
i1 <- i.range[(i-1)]
i2 <- i.range[i]
i.range <- as.integer(seq(i1, i2, length.out = 10))
}
i.range <- seq(i1, i2, by=1)
for(i in i.range){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
result <- as.integer(i-1)
result
#21216, in ~ 0.06 seconds
Same result in a fraction of the time.

Optimise code for a simple monte carlo like simulation

I run the following code that works but just take ages and I'm sure there is a way to get the same results much faster.
runs <- 1000
prediction <- runif(77,0,1)
n< - length(prediction)
df.all <- data.frame(Preds = rep(prediction, runs),
simno=rep(1:runs,each=n))
for (x in 1:runs) {
for (i in 1:length(df.all$Preds)){
df.all$rand[i] <- sample(1:100,1)
df.all$Win[i] <- ifelse(df.all$rand[i]<df.all$Preds[i]*100,1,0)
}
}
df.all% >% group_by(simno) %>% summarise(Wins=sum(Win)) -> output
This can easily be vectorise by:
Performing a single sample operation (not the additional replace = TRUE argument.
Performing a single comparison >
You can remove the inner for loop to get
for (x in 1:runs) {
df.all$rand = sample(1:100, size = length(prediction), replace=TRUE)
df.all$Win = df.all$rand < df.all$Preds*100
}
You can then take it one step further and remove that loop
df.all$rand = sample(1:100, n = nrow(df.all), replace=TRUE)
df.all$Win = df.all$rand < df.all$Preds*100

Resources