Suppose I have a binomial distribution where n=12, p=0.2. I split this sample into 4 chunks(groups), each chunk has group size 3. Then I remove the output whose sum is equal to 0. For the remaining outputs, what I'm trying to do is combining all remaining outputs into a new vector. Here's my code
set.seed(123)
sample1=rbinom(12,1,0.2)
chuck2=function(x,n)split(x,cut(seq_along(x),n,labels=FALSE))
chunk=chuck2(sample1,4)
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
print(a.no0)
}
}
And here's the output:
[1] 1 1 0
[1] 0 1 0
[1] 0 1 0
I want to combine these three outputs into a new vector like:
[1] 1 1 0 0 1 0 0 1 0
but I have no idea how it works, any hints please?
set.seed(123)
sample1=rbinom(12,1,0.2)
chuck2=function(x,n)split(x,cut(seq_along(x),n,labels=FALSE))
chunk=chuck2(sample1,4)
int_vector <- c()
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
int_vector <- c(int_vector, a.no0)
}
}
int_vector
# [1] 1 1 0 0 1 0 0 1 0
Create a list() and assign it a variable name. Next, you add that variable inside the loop, then append the looping values in the list.
new_vector <- list()
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
new_vector <- append(new_vector, a.no0)
}
}
new_vector
This will return:
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 0
[[4]]
[1] 0
[[5]]
[1] 1
[[6]]
[1] 0
[[7]]
[1] 0
[[8]]
[1] 1
[[9]]
[1] 0
But I think you want a flattened vector:
as.vector(unlist(new_vector))
[1] 1 1 0 0 1 0 0 1 0
Doesn't directly address your issue, but this can be accomplished without a for-loop:
library(dplyr)
set.seed(123)
sample1 <- rbinom(12, 1, 0.2)
as.data.frame(matrix(sample1, ncol = 3, byrow = TRUE)) %>%
mutate(test = rowSums(.), id = 1:n()) %>%
filter(test > 0) %>%
dplyr::select(-test) %>%
gather(key, value, -id) %>%
arrange(id, key) %>%
.$value
Two versions without for loop.
data:
set.seed(123)
sample1 <- rbinom(12, 1, 0.2)
base-R functional version:
split.sample1 <- split(sample1,cut(seq_along(sample1),4,labels=FALSE))
sumf <- function(x) if(sum(x) == 0) NULL else x
result <- unlist(lapply(split.sample1,sumf),use.names=F)
> result
[1] 1 1 0 0 1 0 0 1 0
modern use of pipe %>% operator version:
library(magrittr) # for %>% operator
grp.indx <- cut(seq_along(sample1),4,labels=FALSE)
split.sample1 <- sample1 %>% split(grp.indx)
result <- split.sample1 %>% lapply(sumf) %>% unlist(use.names=F)
> result
[1] 1 1 0 0 1 0 0 1 0
It seems like your function makes a pseudo matrix as a list. This instead directly makes a matrix from sample1 and then outputs a vector where rowSums are greater than 0.
set.seed(123)
sample1 = rbinom(12, 1, 0.2)
chunk_mat = matrix(sample1, ncol = 3, byrow = T)
as.vector(t(chunk_mat[which(rowSums(chunk_mat) != 0), ]))
Here are benchmarks - I have the chuck2 in the global environment but each function still has to generate the chunk dataframe / matrix / list so that they're apples to apples.
Unit: microseconds
expr min lq mean median uq max neval
cole_matrix 19.902 26.2515 38.60094 43.3505 47.4505 56.801 100
heds_int_vector 4965.201 5101.9010 5616.53893 5251.8510 5490.9010 23417.401 100
bwilliams_dplyr 5278.602 5506.4010 5847.55298 5665.7010 5821.5515 9413.801 100
Simon_base 128.501 138.0010 196.46697 185.6005 203.1515 2481.101 100
Simon_magrittr 366.601 392.5005 453.74806 455.1510 492.0010 739.501 100
Related
I want to write an R-function, say f, which has inputs x and n, where x is some kind of "list of distributions" and f is supposed to draw n samples from each distribution in x.
What is a good way to implement this in R?
My current idea is
f = function(x,n){
out = list()
for(i in 1:length(x)){
name = sub("\\(.*", "",x[i])
size = ifelse(name=="sample",paste("size=",n),paste0("n=",n))
args = paste(size,gsub("[\\(\\)]", "", regmatches(x[i], gregexpr("\\(.*?\\)", x[i]))[[1]]),sep=",")
out[[i]] = eval(parse(text=paste0(name,"(",args,")")))
}
return(out)
}
f(x = c("rnorm(mean=1,sd=2)","sample(0:1,replace=TRUE)","rbinom(size=10,prob=0.1)"), n = 10)
I don't like this implementation, because
n is not always the input name for the sample size (e.g. in sample it is size),
the code will crash if not all inputs for the distributions are properly defined.
Can I improve the implementation, for example with x of class alist?
You could change your input and create a list of function names and arguments. For each distribution we set the n/size-value to 1.
ls_func <- list("rnorm" = list(mean = 0, sd = 1, n = 1),
"sample" = list(x = 0:1, replace = TRUE, size = 1),
"rbinom" = list(size = 10, prob = 0.1, n = 1))
Your function takes those distributions and replicates them n-times:
g <- function(ls_func, n) {
out = list()
for(i in seq_along(ls_func)){
out[[i]] <- replicate(do.call(names(ls_func)[i], ls_func[[i]]), n = n)
}
return(out)
}
so
set.seed(4096)
g(ls_func, 10)
returns
[[1]]
[1] 0.1894398 -0.1622468 0.5327100 -1.5747229 -0.6884024 -0.3092226 -0.0879258 -0.4320240 -0.7799596 0.4525895
[[2]]
[1] 0 1 0 0 0 1 1 1 0 0
[[3]]
[1] 0 0 1 1 0 1 1 1 1 0
or. Basically it's not a good approach to use eval(parse(text=...)) to execute functions. Use do.call instead.
You can remove the for-loop:
g <- function(ls_func, n) {
out = list()
out <- lapply(seq_along(ls_func), function(i) replicate(do.call(names(ls_func)[i], ls_func[[i]]), n = n))
return(out)
}
Note: This code also crashes, if your distributions aren't defined properly. To avoid this, you need some error handling. Look for try and stop functions.
I've been putting together an R package -- distionary -- that can help with this.
First make a list of input distributions:
library(distionary)
x <- list(
dst_norm(1, 2^2),
dst_empirical(0:1),
dst_binom(10, 0.1)
)
The function for drawing from a distribution is realize(), which fits nicely with lapply() (or purrr's map()):
set.seed(123)
lapply(x, realize, n = 10)
#> [[1]]
#> [1] -0.1209513 0.5396450 4.1174166 1.1410168 1.2585755 4.4301300
#> [7] 1.9218324 -1.5301225 -0.3737057 0.1086761
#>
#> [[2]]
#> [1] 0 0 0 0 0 0 0 0 1 1
#>
#> [[3]]
#> [1] 3 2 1 2 0 1 2 0 0 0
Putting this code in a function is then straightforward:
f <- function(x, n) {
lapply(x, realize, n = n)
}
set.seed(123)
f(x, n = 10)
#> [[1]]
#> [1] -0.1209513 0.5396450 4.1174166 1.1410168 1.2585755 4.4301300
#> [7] 1.9218324 -1.5301225 -0.3737057 0.1086761
#>
#> [[2]]
#> [1] 0 0 0 0 0 0 0 0 1 1
#>
#> [[3]]
#> [1] 3 2 1 2 0 1 2 0 0 0
Problem Statement
Let's say you have the following data:
df <- data.frame(x = rep(0, 10),
batch = rep(1:3,c(4,2,4)))
x batch
1 0 1
2 0 1
3 0 1
4 0 1
5 0 2
6 0 2
7 0 3
8 0 3
9 0 3
10 0 3
You want to loop over the number of unique batches in your dataset and within each batch, apply an algorithm to generate a vector of 1's and 0's. The algorithm is quite long, so for example's sake, let's say it's a random sample:
set.seed(2021)
for(i in seq_len(length(unique(df$batch)))){
batch_val <- d[which(df$batch == i),]$batch
#some algorithm to generate 1's and 0's, but using sample() here
out_x <- sample(c(0,1), length(batch_val), replace = T)
}
You then want to save out_x into the correct indices in df$x. My current rudimentary approach is to explicitly specify indices:
idxb <- 1
idxe <- length(df[which(df$batch == 1),]$batch)
set.seed(2021)
for(i in seq_len(length(unique(df$batch)))){
batch_val <- d[which(df$batch == i),]$batch
#some algorithm to generate 1's and 0's, but using sample() here
out_x <- sample(c(0,1), length(batch_val), replace = T)
print(out_x)
#save output
df$x[idxb:idxe] <- out_x
#update indices
idxb <- idxb + length(out_X)
if(i < length(unique(df$batch))) {
idxe <- idxe + length(df[which(df$batch == i+1),]$batch)
}
}
Output
The result should look like this:
x batch
1 0 1
2 1 1
3 1 1
4 0 1
5 1 2
6 1 2
7 1 3
8 0 3
9 1 3
10 1 3
where each iteration of out_x looks like this:
[1] 0 1 1 0
[1] 1 1
[1] 1 0 1 1
Question
What is a faster way to implement this while still using base R?
What about using tapply?
out_x <- tapply(df$batch, df$batch, function(x) sample(c(0,1), length(x), replace = T))
#------
$`1`
[1] 0 1 1 1
$`2`
[1] 0 1
$`3`
[1] 1 1 1 1
And then to reassign to df
df$x <- unlist(out_x)
A timing test:
microbenchmark::microbenchmark(f_loop(), f_apply())
#---------
Unit: microseconds
expr min lq mean median uq max neval
f_loop() 399.895 425.1975 442.7077 437.754 450.690 612.969 100
f_apply() 100.449 106.9185 160.5557 110.913 114.909 4867.603 100
Where the functions are defined as
f_loop <- function(){
idxb <- 1
idxe <- length(df[which(df$batch == 1),]$batch)
for(i in seq_len(length(unique(df$batch)))){
batch_val <- df[which(df$batch == i),]$batch
#some algorithm to generate 1's and 0's, but using sample() here
out_x <- sample(c(0,1), length(batch_val), replace = T)
#print(out_x)
#save output
df$x[idxb:idxe] <- out_x
#update indices
idxb <- idxb + length(out_x)
if(i < length(unique(df$batch))) {
idxe <- idxe + length(df[which(df$batch == i+1),]$batch)
}
}
return(df$x)
}
f_apply <- function() {
unlist(tapply(df$batch, df$batch, function(x) sample(c(0,1), length(x), replace = T)))
}
One solution is to remind myself that I can index a vector with a vector!
set.seed(2021)
for(i in seq_len(length(unique(df$batch)))){
batch_val <- d[which(df$batch == i),]$batch
#some algorithm to generate 1's and 0's, but using sample() here
out_x <- sample(c(0,1), length(batch_val), replace = T)
print(out_x)
#save output
idx <- which(df$batch == i)
df$x[idx] <- out_x
}
Now I'm learning how to use purrr package in r, and thinking about how to generate 5 samples of each 1, 2, ..., 99, 100 coin flips.
My image is to create a list, that should look like..
[[1]]
[[1]]
[1] 1 0 1 0 0
[[2]]
[[1]]
[1] 1 0 0 0 1
[[2]]
[1] 0 1 0 1 1
[[3]]
[[1]]
[1] 0 1 1 1 0
[[2]]
[1] 1 0 0 0 1
[[3]]
[1] 0 1 1 1 1
..
Can anyone help me make this up?
You want the function rerun applied to each element of the vector 1:100 using the map function as follows
library(purrr)
1:100 %>% map(function(x) rerun(x, rbinom(5,1,.5)))
However, it is just as easy to use replicate, where the default for replicate is to produce a column wise array.
lapply(1:100, function(x) replicate(x,rbinom(5,1,0.5)))
Note that the base R expression is much faster in this case.
a <- function() 1:100 %>% map(function(x) rerun(x, rbinom(5,1,.5)))
b <- function() lapply(1:100, function(x) replicate(x,rbinom(5,1,0.5)))
library(microbenchmark)
microbenchmark(a(),b())
Unit: milliseconds
expr min lq mean median uq max neval cld
a() 96.89941 104.83822 117.10245 111.48309 120.28554 391.9411 100 b
b() 16.88232 18.47104 23.22976 22.20549 26.31445 49.0042 100 a
Edit Regarding your question in the comments, if you are just interested in the law of large numbers representation, you could do as follows.
plot(1:100, do.call("c", lapply(b(), mean)),
type= "l", xlab = "replications",
ylab = "proportion of heads")
abline(h = .5)
If I understand you correctly, this is what you're after:
lapply(1:100, function(x) replicate(x,rbinom(5,1,0.5),simplify = FALSE))
There is probably an obvious and elegant way to do this, probably using lapply, but I am still mastering apply commands and am struggling to find it.
I have a dataframe that looks like the following except that instead of 5 factor variables there are dozens and instead of 10 rows there are hundreds.
a<- data.frame("id" = c(1:10),
"a1" = factor(c(0,0,1,1,0,1,0,1,0,1)),
"a2" = factor(c(0,0,0,0,0,0,0,0,1,0)),
"a3" = factor(c(0,0,0,0,0,1,0,0,0,0)),
"a4" = factor(c(0,0,0,0,0,0,0,0,1,1)),
"a5" = factor(c(0,0,0,1,0,0,0,0,0,0)))
I want to create a new variable which is 1 if any of 13 columns contain a particular level of the factor. The equivalent in the example dataframe would be creating a new variable called "b" which is 1 is there's a "1" in any of the columns a1:a4, which would look like the following.
a<- data.frame("id" = c(1:10),
"a1" = factor(c(0,0,1,1,0,1,0,1,0,1)),
"a2" = factor(c(0,0,0,0,0,0,0,0,1,0)),
"a3" = factor(c(0,0,0,0,0,1,0,0,0,0)),
"a4" = factor(c(0,0,0,0,0,0,0,0,1,1)),
"a5" = factor(c(0,0,0,1,0,0,0,0,0,0)),
"b" = c(0,0,1,1,0,1,0,1,1,1))
There has GOT to be a way to do this using the 13 column positions instead of writing a conditional ifthen statement for each of the 13 variables.
Just use rowSums, something like this:
> as.numeric(rowSums(a[paste0("a", 1:5)] == 1) >= 1)
[1] 0 0 1 1 0 1 0 1 1 1
In case you wanted to try lapply
Reduce(`|`,lapply(a[,-1], function(x) as.numeric(as.character(x))))+0
#[1] 0 0 1 1 0 1 0 1 1 1
Or just
Reduce(`|`, lapply(a[,-1], `==`, 1)) +0
#[1] 0 0 1 1 0 1 0 1 1 1
Benchmarks
set.seed(155)
df <- as.data.frame(matrix(sample(0:1, 5000*1e4, replace=TRUE), ncol=5000))
library(microbenchmark)
f1 <- function() {as.numeric(rowSums(df == 1) >= 1) }
f2 <- function() {Reduce(`|`, lapply(df, `==`, 1)) +0}
f3 <- function() {apply(df == 1, 1, function(x) any(x %in% TRUE))+0}
microbenchmark(f1(), f2(), f3(), unit="relative")
# Unit: relative
# expr min lq median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 100
# f2() 1.040561 1.043713 1.053773 1.032932 1.045067 100
# f3() 2.538287 2.517184 2.825253 2.477225 2.454511 100
You could also use any after converting the matrix to logical.
> apply(a[grep("a[1-4]", names(a))] == 1, 1, any)+0
# [1] 0 0 1 1 0 1 0 1 1 1
Or
> apply(a[grepl("a[1-4]", names(a))] == 1, 1, any)+0
# [1] 0 0 1 1 0 1 0 1 1 1
I have a pair of binary variables (1's and 0's), and my professor wants me to create a new binary variable that takes the value 1 if both of the previous variables have the value 1 (i.e., x,y=1) and takes the value zero otherwise.
How would I do this in R?
Thanks!
JMC
Here's one example with some sample data to play with:
set.seed(1)
A <- sample(0:1, 10, replace = TRUE)
B <- sample(0:1, 10, replace = TRUE)
A
# [1] 0 0 1 1 0 1 1 1 1 0
B
# [1] 0 0 1 0 1 0 1 1 0 1
as.numeric(A + B == 2)
# [1] 0 0 1 0 0 0 1 1 0 0
as.numeric(rowSums(cbind(A, B)) == 2)
# [1] 0 0 1 0 0 0 1 1 0 0
as.numeric(A == 1 & B == 1)
# [1] 0 0 1 0 0 0 1 1 0 0
Update (to introduce some more alternatives and share a link and a benchmark)
set.seed(1)
A <- sample(0:1, 1e7, replace = TRUE)
B <- sample(0:1, 1e7, replace = TRUE)
fun1 <- function() ifelse(A == 1 & B == 1, 1, 0)
fun2 <- function() as.numeric(A + B == 2)
fun3 <- function() as.numeric(A & B)
fun4 <- function() as.numeric(A == 1 & B == 1)
fun5 <- function() as.numeric(rowSums(cbind(A, B)) == 2)
library(microbenchmark)
microbenchmark(fun1(), fun2(), fun3(), fun4(), fun5(), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 4842.8559 4871.7072 5022.3525 5093.5932 10424.6589 5
# fun2() 220.8336 220.9867 226.1167 229.1225 472.4408 5
# fun3() 440.7427 445.9342 461.0114 462.6184 488.6627 5
# fun4() 604.1791 613.9284 630.4838 645.2146 682.4689 5
# fun5() 373.8088 373.8532 373.9460 435.0385 1084.6227 5
As can be seen, ifelse is indeed much slower than the other approaches mentioned here. See this SO question and answer for some more details about the efficiency of ifelse.