Save the output of a for loop in R

Save the output of a for loop in R - r

Suppose I have a binomial distribution where n=12, p=0.2. I split this sample into 4 chunks(groups), each chunk has group size 3. Then I remove the output whose sum is equal to 0. For the remaining outputs, what I'm trying to do is combining all remaining outputs into a new vector. Here's my code
set.seed(123)
sample1=rbinom(12,1,0.2)
chuck2=function(x,n)split(x,cut(seq_along(x),n,labels=FALSE))
chunk=chuck2(sample1,4)
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
print(a.no0)
}
}
And here's the output:
[1] 1 1 0
[1] 0 1 0
[1] 0 1 0
I want to combine these three outputs into a new vector like:
[1] 1 1 0 0 1 0 0 1 0
but I have no idea how it works, any hints please?

set.seed(123)
sample1=rbinom(12,1,0.2)
chuck2=function(x,n)split(x,cut(seq_along(x),n,labels=FALSE))
chunk=chuck2(sample1,4)
int_vector <- c()
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
int_vector <- c(int_vector, a.no0)
}
}
int_vector
# [1] 1 1 0 0 1 0 0 1 0

Create a list() and assign it a variable name. Next, you add that variable inside the loop, then append the looping values in the list.
new_vector <- list()
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
new_vector <- append(new_vector, a.no0)
}
}
new_vector
This will return:
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 0
[[4]]
[1] 0
[[5]]
[1] 1
[[6]]
[1] 0
[[7]]
[1] 0
[[8]]
[1] 1
[[9]]
[1] 0
But I think you want a flattened vector:
as.vector(unlist(new_vector))
[1] 1 1 0 0 1 0 0 1 0

Doesn't directly address your issue, but this can be accomplished without a for-loop:
library(dplyr)
set.seed(123)
sample1 <- rbinom(12, 1, 0.2)
as.data.frame(matrix(sample1, ncol = 3, byrow = TRUE)) %>%
mutate(test = rowSums(.), id = 1:n()) %>%
filter(test > 0) %>%
dplyr::select(-test) %>%
gather(key, value, -id) %>%
arrange(id, key) %>%
.$value

Two versions without for loop.
data:
set.seed(123)
sample1 <- rbinom(12, 1, 0.2)
base-R functional version:
split.sample1 <- split(sample1,cut(seq_along(sample1),4,labels=FALSE))
sumf <- function(x) if(sum(x) == 0) NULL else x
result <- unlist(lapply(split.sample1,sumf),use.names=F)
> result
[1] 1 1 0 0 1 0 0 1 0
modern use of pipe %>% operator version:
library(magrittr) # for %>% operator
grp.indx <- cut(seq_along(sample1),4,labels=FALSE)
split.sample1 <- sample1 %>% split(grp.indx)
result <- split.sample1 %>% lapply(sumf) %>% unlist(use.names=F)
> result
[1] 1 1 0 0 1 0 0 1 0

It seems like your function makes a pseudo matrix as a list. This instead directly makes a matrix from sample1 and then outputs a vector where rowSums are greater than 0.
set.seed(123)
sample1 = rbinom(12, 1, 0.2)
chunk_mat = matrix(sample1, ncol = 3, byrow = T)
as.vector(t(chunk_mat[which(rowSums(chunk_mat) != 0), ]))
Here are benchmarks - I have the chuck2 in the global environment but each function still has to generate the chunk dataframe / matrix / list so that they're apples to apples.
Unit: microseconds
expr min lq mean median uq max neval
cole_matrix 19.902 26.2515 38.60094 43.3505 47.4505 56.801 100
heds_int_vector 4965.201 5101.9010 5616.53893 5251.8510 5490.9010 23417.401 100
bwilliams_dplyr 5278.602 5506.4010 5847.55298 5665.7010 5821.5515 9413.801 100
Simon_base 128.501 138.0010 196.46697 185.6005 203.1515 2481.101 100
Simon_magrittr 366.601 392.5005 453.74806 455.1510 492.0010 739.501 100

Related

How to submit a "vector of distributions" to a function in R?

I want to write an R-function, say f, which has inputs x and n, where x is some kind of "list of distributions" and f is supposed to draw n samples from each distribution in x.
What is a good way to implement this in R?
My current idea is
f = function(x,n){
out = list()
for(i in 1:length(x)){
name = sub("\\(.*", "",x[i])
size = ifelse(name=="sample",paste("size=",n),paste0("n=",n))
args = paste(size,gsub("[\\(\\)]", "", regmatches(x[i], gregexpr("\\(.*?\\)", x[i]))[[1]]),sep=",")
out[[i]] = eval(parse(text=paste0(name,"(",args,")")))
}
return(out)
}
f(x = c("rnorm(mean=1,sd=2)","sample(0:1,replace=TRUE)","rbinom(size=10,prob=0.1)"), n = 10)
I don't like this implementation, because
n is not always the input name for the sample size (e.g. in sample it is size),
the code will crash if not all inputs for the distributions are properly defined.
Can I improve the implementation, for example with x of class alist?

You could change your input and create a list of function names and arguments. For each distribution we set the n/size-value to 1.
ls_func <- list("rnorm" = list(mean = 0, sd = 1, n = 1),
"sample" = list(x = 0:1, replace = TRUE, size = 1),
"rbinom" = list(size = 10, prob = 0.1, n = 1))
Your function takes those distributions and replicates them n-times:
g <- function(ls_func, n) {
out = list()
for(i in seq_along(ls_func)){
out[[i]] <- replicate(do.call(names(ls_func)[i], ls_func[[i]]), n = n)
}
return(out)
}
so
set.seed(4096)
g(ls_func, 10)
returns
[[1]]
[1] 0.1894398 -0.1622468 0.5327100 -1.5747229 -0.6884024 -0.3092226 -0.0879258 -0.4320240 -0.7799596 0.4525895
[[2]]
[1] 0 1 0 0 0 1 1 1 0 0
[[3]]
[1] 0 0 1 1 0 1 1 1 1 0
or. Basically it's not a good approach to use eval(parse(text=...)) to execute functions. Use do.call instead.
You can remove the for-loop:
g <- function(ls_func, n) {
out = list()
out <- lapply(seq_along(ls_func), function(i) replicate(do.call(names(ls_func)[i], ls_func[[i]]), n = n))
return(out)
}
Note: This code also crashes, if your distributions aren't defined properly. To avoid this, you need some error handling. Look for try and stop functions.

I've been putting together an R package -- distionary -- that can help with this.
First make a list of input distributions:
library(distionary)
x <- list(
dst_norm(1, 2^2),
dst_empirical(0:1),
dst_binom(10, 0.1)
)
The function for drawing from a distribution is realize(), which fits nicely with lapply() (or purrr's map()):
set.seed(123)
lapply(x, realize, n = 10)
#> [[1]]
#> [1] -0.1209513 0.5396450 4.1174166 1.1410168 1.2585755 4.4301300
#> [7] 1.9218324 -1.5301225 -0.3737057 0.1086761
#>
#> [[2]]
#> [1] 0 0 0 0 0 0 0 0 1 1
#>
#> [[3]]
#> [1] 3 2 1 2 0 1 2 0 0 0
Putting this code in a function is then straightforward:
f <- function(x, n) {
lapply(x, realize, n = n)
}
set.seed(123)
f(x, n = 10)
#> [[1]]
#> [1] -0.1209513 0.5396450 4.1174166 1.1410168 1.2585755 4.4301300
#> [7] 1.9218324 -1.5301225 -0.3737057 0.1086761
#>
#> [[2]]
#> [1] 0 0 0 0 0 0 0 0 1 1
#>
#> [[3]]
#> [1] 3 2 1 2 0 1 2 0 0 0

How to store for loop outputs of varying sizes in an initialized vector by indices

Problem Statement
Let's say you have the following data:
df <- data.frame(x = rep(0, 10),
batch = rep(1:3,c(4,2,4)))
x batch
1 0 1
2 0 1
3 0 1
4 0 1
5 0 2
6 0 2
7 0 3
8 0 3
9 0 3
10 0 3
You want to loop over the number of unique batches in your dataset and within each batch, apply an algorithm to generate a vector of 1's and 0's. The algorithm is quite long, so for example's sake, let's say it's a random sample:
set.seed(2021)
for(i in seq_len(length(unique(df$batch)))){
batch_val <- d[which(df$batch == i),]$batch
#some algorithm to generate 1's and 0's, but using sample() here
out_x <- sample(c(0,1), length(batch_val), replace = T)
}
You then want to save out_x into the correct indices in df$x. My current rudimentary approach is to explicitly specify indices:
idxb <- 1
idxe <- length(df[which(df$batch == 1),]$batch)
set.seed(2021)
for(i in seq_len(length(unique(df$batch)))){
batch_val <- d[which(df$batch == i),]$batch
#some algorithm to generate 1's and 0's, but using sample() here
out_x <- sample(c(0,1), length(batch_val), replace = T)
print(out_x)
#save output
df$x[idxb:idxe] <- out_x
#update indices
idxb <- idxb + length(out_X)
if(i < length(unique(df$batch))) {
idxe <- idxe + length(df[which(df$batch == i+1),]$batch)
}
}
Output
The result should look like this:
x batch
1 0 1
2 1 1
3 1 1
4 0 1
5 1 2
6 1 2
7 1 3
8 0 3
9 1 3
10 1 3
where each iteration of out_x looks like this:
[1] 0 1 1 0
[1] 1 1
[1] 1 0 1 1
Question
What is a faster way to implement this while still using base R?

What about using tapply?
out_x <- tapply(df$batch, df$batch, function(x) sample(c(0,1), length(x), replace = T))
#------
$`1`
[1] 0 1 1 1
$`2`
[1] 0 1
$`3`
[1] 1 1 1 1
And then to reassign to df
df$x <- unlist(out_x)
A timing test:
microbenchmark::microbenchmark(f_loop(), f_apply())
#---------
Unit: microseconds
expr min lq mean median uq max neval
f_loop() 399.895 425.1975 442.7077 437.754 450.690 612.969 100
f_apply() 100.449 106.9185 160.5557 110.913 114.909 4867.603 100
Where the functions are defined as
f_loop <- function(){
idxb <- 1
idxe <- length(df[which(df$batch == 1),]$batch)
for(i in seq_len(length(unique(df$batch)))){
batch_val <- df[which(df$batch == i),]$batch
#some algorithm to generate 1's and 0's, but using sample() here
out_x <- sample(c(0,1), length(batch_val), replace = T)
#print(out_x)
#save output
df$x[idxb:idxe] <- out_x
#update indices
idxb <- idxb + length(out_x)
if(i < length(unique(df$batch))) {
idxe <- idxe + length(df[which(df$batch == i+1),]$batch)
}
}
return(df$x)
}
f_apply <- function() {
unlist(tapply(df$batch, df$batch, function(x) sample(c(0,1), length(x), replace = T)))
}

One solution is to remind myself that I can index a vector with a vector!
set.seed(2021)
for(i in seq_len(length(unique(df$batch)))){
batch_val <- d[which(df$batch == i),]$batch
#some algorithm to generate 1's and 0's, but using sample() here
out_x <- sample(c(0,1), length(batch_val), replace = T)
print(out_x)
#save output
idx <- which(df$batch == i)
df$x[idx] <- out_x
}

Generating Coin flips using purrr

Now I'm learning how to use purrr package in r, and thinking about how to generate 5 samples of each 1, 2, ..., 99, 100 coin flips.
My image is to create a list, that should look like..
[[1]]
[[1]]
[1] 1 0 1 0 0
[[2]]
[[1]]
[1] 1 0 0 0 1
[[2]]
[1] 0 1 0 1 1
[[3]]
[[1]]
[1] 0 1 1 1 0
[[2]]
[1] 1 0 0 0 1
[[3]]
[1] 0 1 1 1 1
..
Can anyone help me make this up?

You want the function rerun applied to each element of the vector 1:100 using the map function as follows
library(purrr)
1:100 %>% map(function(x) rerun(x, rbinom(5,1,.5)))
However, it is just as easy to use replicate, where the default for replicate is to produce a column wise array.
lapply(1:100, function(x) replicate(x,rbinom(5,1,0.5)))
Note that the base R expression is much faster in this case.
a <- function() 1:100 %>% map(function(x) rerun(x, rbinom(5,1,.5)))
b <- function() lapply(1:100, function(x) replicate(x,rbinom(5,1,0.5)))
library(microbenchmark)
microbenchmark(a(),b())
Unit: milliseconds
expr min lq mean median uq max neval cld
a() 96.89941 104.83822 117.10245 111.48309 120.28554 391.9411 100 b
b() 16.88232 18.47104 23.22976 22.20549 26.31445 49.0042 100 a
Edit Regarding your question in the comments, if you are just interested in the law of large numbers representation, you could do as follows.
plot(1:100, do.call("c", lapply(b(), mean)),
type= "l", xlab = "replications",
ylab = "proportion of heads")
abline(h = .5)

If I understand you correctly, this is what you're after:
lapply(1:100, function(x) replicate(x,rbinom(5,1,0.5),simplify = FALSE))

R Generating a new variable based on conditional statement applied to many columns

There is probably an obvious and elegant way to do this, probably using lapply, but I am still mastering apply commands and am struggling to find it.
I have a dataframe that looks like the following except that instead of 5 factor variables there are dozens and instead of 10 rows there are hundreds.
a<- data.frame("id" = c(1:10),
"a1" = factor(c(0,0,1,1,0,1,0,1,0,1)),
"a2" = factor(c(0,0,0,0,0,0,0,0,1,0)),
"a3" = factor(c(0,0,0,0,0,1,0,0,0,0)),
"a4" = factor(c(0,0,0,0,0,0,0,0,1,1)),
"a5" = factor(c(0,0,0,1,0,0,0,0,0,0)))
I want to create a new variable which is 1 if any of 13 columns contain a particular level of the factor. The equivalent in the example dataframe would be creating a new variable called "b" which is 1 is there's a "1" in any of the columns a1:a4, which would look like the following.
a<- data.frame("id" = c(1:10),
"a1" = factor(c(0,0,1,1,0,1,0,1,0,1)),
"a2" = factor(c(0,0,0,0,0,0,0,0,1,0)),
"a3" = factor(c(0,0,0,0,0,1,0,0,0,0)),
"a4" = factor(c(0,0,0,0,0,0,0,0,1,1)),
"a5" = factor(c(0,0,0,1,0,0,0,0,0,0)),
"b" = c(0,0,1,1,0,1,0,1,1,1))
There has GOT to be a way to do this using the 13 column positions instead of writing a conditional ifthen statement for each of the 13 variables.

Just use rowSums, something like this:
> as.numeric(rowSums(a[paste0("a", 1:5)] == 1) >= 1)
[1] 0 0 1 1 0 1 0 1 1 1

In case you wanted to try lapply
Reduce(`|`,lapply(a[,-1], function(x) as.numeric(as.character(x))))+0
#[1] 0 0 1 1 0 1 0 1 1 1
Or just
Reduce(`|`, lapply(a[,-1], `==`, 1)) +0
#[1] 0 0 1 1 0 1 0 1 1 1
Benchmarks
set.seed(155)
df <- as.data.frame(matrix(sample(0:1, 5000*1e4, replace=TRUE), ncol=5000))
library(microbenchmark)
f1 <- function() {as.numeric(rowSums(df == 1) >= 1) }
f2 <- function() {Reduce(`|`, lapply(df, `==`, 1)) +0}
f3 <- function() {apply(df == 1, 1, function(x) any(x %in% TRUE))+0}
microbenchmark(f1(), f2(), f3(), unit="relative")
# Unit: relative
# expr min lq median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 100
# f2() 1.040561 1.043713 1.053773 1.032932 1.045067 100
# f3() 2.538287 2.517184 2.825253 2.477225 2.454511 100

You could also use any after converting the matrix to logical.
> apply(a[grep("a[1-4]", names(a))] == 1, 1, any)+0
# [1] 0 0 1 1 0 1 0 1 1 1
Or
> apply(a[grepl("a[1-4]", names(a))] == 1, 1, any)+0
# [1] 0 0 1 1 0 1 0 1 1 1

Conditional Statements; creating new binary variable

I have a pair of binary variables (1's and 0's), and my professor wants me to create a new binary variable that takes the value 1 if both of the previous variables have the value 1 (i.e., x,y=1) and takes the value zero otherwise.
How would I do this in R?
Thanks!
JMC

Here's one example with some sample data to play with:
set.seed(1)
A <- sample(0:1, 10, replace = TRUE)
B <- sample(0:1, 10, replace = TRUE)
A
# [1] 0 0 1 1 0 1 1 1 1 0
B
# [1] 0 0 1 0 1 0 1 1 0 1
as.numeric(A + B == 2)
# [1] 0 0 1 0 0 0 1 1 0 0
as.numeric(rowSums(cbind(A, B)) == 2)
# [1] 0 0 1 0 0 0 1 1 0 0
as.numeric(A == 1 & B == 1)
# [1] 0 0 1 0 0 0 1 1 0 0
Update (to introduce some more alternatives and share a link and a benchmark)
set.seed(1)
A <- sample(0:1, 1e7, replace = TRUE)
B <- sample(0:1, 1e7, replace = TRUE)
fun1 <- function() ifelse(A == 1 & B == 1, 1, 0)
fun2 <- function() as.numeric(A + B == 2)
fun3 <- function() as.numeric(A & B)
fun4 <- function() as.numeric(A == 1 & B == 1)
fun5 <- function() as.numeric(rowSums(cbind(A, B)) == 2)
library(microbenchmark)
microbenchmark(fun1(), fun2(), fun3(), fun4(), fun5(), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 4842.8559 4871.7072 5022.3525 5093.5932 10424.6589 5
# fun2() 220.8336 220.9867 226.1167 229.1225 472.4408 5
# fun3() 440.7427 445.9342 461.0114 462.6184 488.6627 5
# fun4() 604.1791 613.9284 630.4838 645.2146 682.4689 5
# fun5() 373.8088 373.8532 373.9460 435.0385 1084.6227 5
As can be seen, ifelse is indeed much slower than the other approaches mentioned here. See this SO question and answer for some more details about the efficiency of ifelse.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Save the output of a for loop in R - r

set.seed(123) sample1=rbinom(12,1,0.2) chuck2=function(x,n)split(x,cut(seq_along(x),n,labels=FALSE)) chunk=chuck2(sample1,4) int_vector <- c() for (i in 1:4){ aa=chunk[[i]] if (sum(aa)!=0){ a.no0=aa int_vector <- c(int_vector, a.no0) } } int_vector # [1] 1 1 0 0 1 0 0 1 0

Related

How to submit a "vector of distributions" to a function in R?

How to store for loop outputs of varying sizes in an initialized vector by indices

Generating Coin flips using purrr

R Generating a new variable based on conditional statement applied to many columns

Conditional Statements; creating new binary variable

Categories

Resources