I have simulation and data structures as follows (just a toy example):
foo = function(mu=0,lambda=1){
x1 = rnorm(1,mu) #X~N(μ,1)
y1 = rexp(1,lambda) #Y~Exp(λ)
list(x=x1,y=y1)
}
mu = 1; lambda = 2 #true values: E(X)=μ=1; E(Y)=1/λ=0.5
set.seed(0); out = replicate(1000, foo(mu,lambda), simplify=FALSE)
# str(out)
Then we get a list out of length(out)=1000, with each list having out$x and out$y.
I want to compute the means for 1000 out$xs and out$ys, respectively.
Of course, I can reach my goal through a not-clever way as
m = c() #for storing simulated values
for(i in 1:2){
s = sapply( 1:1000, function(j)out[[j]][i] )
m[i] = mean( as.numeric(s) )
}
m
# [1] 0.9736922 0.4999028
Can we use a more simple and efficient way to compute the means? I also try lapply(out, mean)
and Reduce("+",out)/1000, but failed...
This is another option if the sublists are always the same length:
> rowMeans(matrix(unlist(out),2))
[1] 0.9736922 0.4999028
Or:
> rowMeans(replicate(1000,unlist(foo(mu,lambda))))
x y
0.9736922 0.4999028
An option is to use purrr::transpose
library(purrr)
out %>% transpose() %>% map(~ mean(unlist(.x)[1:1000]))
# Or: out[1:1000] %>% transpose() %>% map(~ mean(unlist(.x)))
#$x
#[1] 0.9736922
#
#$y
#[1] 0.4999028
Or a base R solution using lapply (which is essentially the same as your explicit for loop):
lapply(c("x", "y"), function(var) mean(sapply(out[1:1000], "[[", var)))
#[[1]]
#[1] 0.9736922
#
#[[2]]
#[1] 0.4999028
Related
I'd like to get better at writing elegant code in R, and am trying to avoid writing nested loops, but cannot figure out an (l)apply solution to my problem.
I have a set of paired files, each of which has two variables associated with them - a name and a number. The filenames are long, so I'd like to generate a vector of filenames that can then be accessed by my own custom downstream function for reading them into a dataframe, plotting, etc.
For example, the files look like:
5_simulationA.k 5_simulationA.b
10_simulationA.k 10_simulationA.b
5_simulationB.k 5_simulationB.b
10_simulationB.k 10_simualtionB.b
The ".k" and ".b" files are mates of a pair and must stay together for downstream processing.
I could read in these files by writing a nested loop that would look something like,
K_files = c()
B_files = c()
for (i in c(A,B,C)){ # iterate over letter variable
for (n in c(5,10,15)){ #iterate over numbers of the files
k_filename = paste(n, "_simulation", i, ".k")
b_filename = paste(n, "_simulation", i, ".b")
K_files = c(K_files, k_filename)
B_files = c(B_files, b_filename)
}
}
This is of course very ugly and un-R-like. I would love to find a way to do this with the very powerful apply or lapply statements, or any other elegant solutions anyone might have. Thanks!
Base R function outer is meant for this kind of problem.
L <- c("A", "B", "C")
N <- c(5, 10, 15)
f <- function(i, n, e) paste0(n, "_simulation", i, e)
sapply(c(".k", ".b"), function(.e) outer(L, N, f, e = .e))
# .k .b
# [1,] "5_simulationA.k" "5_simulationA.b"
# [2,] "5_simulationB.k" "5_simulationB.b"
# [3,] "5_simulationC.k" "5_simulationC.b"
# [4,] "10_simulationA.k" "10_simulationA.b"
# [5,] "10_simulationB.k" "10_simulationB.b"
# [6,] "10_simulationC.k" "10_simulationC.b"
# [7,] "15_simulationA.k" "15_simulationA.b"
# [8,] "15_simulationB.k" "15_simulationB.b"
# [9,] "15_simulationC.k" "15_simulationC.b"
From OP's example output filenames, it looks like we want all combinations of n and i. expand.grid returns a dataframe of all combinations of ns and is. We could then use apply to loop through its row to generate the filenames:
i <- c("A", "B", "C")
n <- c(5, 10, 15)
combi <- expand.grid(n = n, i = i)
invisible(apply(combi, 1, function(x){
k_filename = paste0(x[1], "_simulation", x[2], ".k")
b_filename = paste0(x[1], "_simulation", x[2], ".b")
print(k_filename)
print(b_filename)
}))
Noticed that I used invisible to suppress the output of apply since we are only interested in the side-effects (read/write files). Alternatively, we can use pwalk from purrr, which takes each column of the same expand.grid dataframe as input and creates the filenames silently:
library(dplyr)
library(purrr)
combi %>%
pwalk(~ {
k_filename = paste0(.x, "_simulation", .y, ".k")
b_filename = paste0(.x, "_simulation", .y, ".b")
print(k_filename)
print(b_filename)
})
Output:
[1] "5_simulationA.k"
[1] "5_simulationA.b"
[1] "10_simulationA.k"
[1] "10_simulationA.b"
[1] "15_simulationA.k"
[1] "15_simulationA.b"
[1] "5_simulationB.k"
[1] "5_simulationB.b"
[1] "10_simulationB.k"
[1] "10_simulationB.b"
[1] "15_simulationB.k"
[1] "15_simulationB.b"
[1] "5_simulationC.k"
[1] "5_simulationC.b"
[1] "10_simulationC.k"
[1] "10_simulationC.b"
[1] "15_simulationC.k"
[1] "15_simulationC.b"
library(tidyverse)
Type = c("A", "B", "C")
Index = c(5, 10, 15)
crossing(Type, Index) %>%
mutate(k_filename = map2_chr(Index, Type, ~paste(.x, "_simulation", .y, ".k", sep="")),
b_filename = map2_chr(Index, Type, ~paste(.x, "_simulation", .y, ".b", sep=""))) -> names
After that, you can access the k_filename or b_filename using pull
K_files <- names %>% pull(k_filename)
I have the following function which finds the distinct number of cases belonging to 4 different factors. test is a list containing 4 dataframes
for (i in test){
i<-i%>%distinct(FileNumber)%>%nrow()
print(i)
}
when i run this, I get the following output
[1] 38
[1] 129
[1] 1868
[1] 277
However I want this output to be saved into another vector called my_vector. So that my_vector is
38 129 1868 277
So I tried the following based on this answer I found
Saving results from for loop as a vector in r
library(dplyr)
my_vector<-vector("numeric",4L)
for (i in test){
my_vector[i]<-i%>%distinct(FileNumber)%>%nrow()
}
However when I run this I get the following message
Error in my_vector[i] <- i %>% distinct(FileNumber) %>% nrow() :
invalid subscript type 'list'
How do I get the earlier output I listed saved into a vector?
You are trying to index my_vector with a list-like object.
For instance:
mylist <- list(mtcars, mtcars)
myvec <- numeric(length(mylist))
for (i in mylist) {
myvec[i] <- nrow(distinct(i, cyl))
}
On the first (and second in this example) iteration, i is a frame, so myvec[i] is equivalent to myvec[mtcars], which does not make sense.
Instead, loop over the index of the list of frames, ala:
library(dplyr)
mylist <- list(mtcars, mtcars)
myvec <- numeric(length(mylist))
for (i in seq_len(length(mylist))) {
myvec[i] <- test[[i]] %>% distinct(cyl) %>% nrow()
}
myvec
# [1] 3 3
or just do something like:
sapply(mylist, function(l) l %>% distinct(cyl) %>% nrow())
# [1] 3 3
BTW: this is just as easy in base-R with:
sapply(mylist, function(l) length(unique(l[["cyl"]])))
# [1] 3 3
This should work with a list of data frames or matrices
d <- list(a = matrix(rnorm(100), nrow = 20),
b = matrix(rnorm(100), nrow = 10),
c = matrix(rnorm(100), nrow = 50))
my_vect <- c()
for (i in seq_along(d)){
n <- nrow(d[[i]])
my_vect[i] <- n
}
my_vect
[1] 20 10 50
Use unlist() and if that doesn't work, then add as.vector() in your pipe:
for (i in test){
i<-i %>% distinct(FileNumber) %>% nrow() %>% unlist()
print(i)
}
If that does not come out as a vector then:
for (i in test){
i<-i %>% distinct(FileNumber) %>% nrow() %>% unlist() %>% as.vector()
print(i)
}
For example, I have a vector of functions: fun_vec <- c(step1,step2,step3).
Now I want to compose them like this: step1(step2(step3(x))). How do I do this using fun_vec? (Suppose that fun_vec isn't fixed and can have more or less functions.)
Similar to Frank's use of freduce, you can use Reduce:
step1 <- function(a) a^2
step2 <- function(a) sum(a)
step3 <- function(a) sqrt(a)
steps <- list(step1, step2, step3)
Reduce(function(a,f) f(a), steps, 1:3)
# [1] 3.741657
step3(step2(step1(1:3)))
# [1] 3.741657
You can see it "in action" with:
Reduce(function(a,f) f(a), steps, 1:3, accumulate=TRUE)
# [[1]]
# [1] 1 2 3
# [[2]]
# [1] 1 4 9
# [[3]]
# [1] 14
# [[4]]
# [1] 3.741657
You can use freduce from the magrittr package:
fun_vec = c(function(x) x^2, function(x) sum(x), function(x) sqrt(x))
library(magrittr)
freduce(1:10, fun_vec)
Alternately, define a function sequence with pipes like...
library(magrittr)
f = . %>% raise_to_power(2) %>% sum %>% sqrt
f(1:10)
A similar example: Is there a way to `pipe through a list'?
Here's a base R recursive approach:
compose <- function(funs) {
n <- length(funs)
fcomp <- function(x) funs[[n - 1]](funs[[n]](x))
ifelse(n > 2, compose(c(funs[1:(n - 2)], fcomp)), fcomp)
}
x <- c(sqrt, log, exp)
compose(x)(2)
# [1] 1.414214
sqrt(log(exp(2)))
# [1] 1.414214
If the number of functions in funs is greater than two, we shorten the list by one by replacing the last two functions by their composition. Otherwise, we return the composition of the last remaining two. It's assumed that initially there are at least two functions in funs.
Take a look at purrr::compose. If your functions are stored inside a list, use purrr::invoke to pass that list to compose:
fun_vec <- c( exp, log10, sqrt )
f <- purrr::invoke( purrr::compose, fun_vec )
f(4) # 1.35125
exp( log10( sqrt(4) ) ) # 1.35125
I would like to apply the same function certain number of times on a vector using the output from the function every time.
A simplified example with a simple function just to demonstrate:
# sample vector
a <- c(1,2,3)
# function to be applied n times
f1 <- function(x) {
x^2 + x^3
}
I would like to apply f1 on a, n number of times, for example here lets say 3 times.
I heard purrr::reduce or purrr::map() might be a good idea for this but couldn't make it work.
The desired output if n = 3 would be equal to f1(f1(f1(a))).
Let's use Reduce (no external library requirements, generally good performance). I'll modify the function slightly to accept a second (ignored) argument:
f1 <- function(x, ign) x^2 + x^3
Reduce(f1, 1:3, init = a)
# [1] 1.872000e+03 6.563711e+09 1.102629e+14
Here's what's happening. Reduce:
uses a binary function to successively combine the elements of a given vector and a possibly given initial value.
The first argument is the function to use, and it should accept two arguments. The first is the value from the previous execution of the function in this reduction. On the first call of the function, it uses the init= value provided.
First call:
f1(c(1,2,3), 1)
# [1] 2 12 36
Second call:
f1(c(2,12,36), 2)
# [1] 12 1872 47952
Third call:
f1(c(12,1872,47952), 3)
# [1] 1.872000e+03 6.563711e+09 1.102629e+14
The second argument 1:3 is used just for its length. Anything of the proper length will work.
If you don't want to redefine f1 just for this reduction, you can always do
Reduce(function(a,ign) f1(a), ...)
Benchmark:
library(microbenchmark)
r <- Reduce(function(a,b) call("f1", a), 1:3, init=quote(a))
triple_f1 <- function(a) f1(f1(f1(a)))
microbenchmark::microbenchmark(
base = Reduce(function(a,ign) f1(a), 1:3, a),
accum = a %>% accumulate(~ .x %>% f1, .init = f1(a)) %>% extract2(3),
reduc = purrr::reduce(1:3, function(a,ign) f1(a), .init=a),
whil = {
i <- 1
a <- c(1,2,3)
while (i < 10) {
i <- i + 1
a <- f1(a)
}
},
forloop = {
out <- a
for(i in seq_len(3)) out <- f1(out)
},
evaluated = {
r <- Reduce(function(a,b) call("f1", a), 1:3, init=quote(a))
eval(r)
},
precompiled = eval(r),
anotherfun = triple_f1(a)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# base 5.101 7.3015 18.28691 9.3010 10.8510 848.302 100
# accum 294.201 328.4015 381.21204 356.1520 402.6510 823.602 100
# reduc 27.000 38.1005 57.55694 45.2510 54.2005 747.401 100
# whil 1717.300 1814.3510 1949.03100 1861.8510 1948.9510 2931.001 100
# forloop 1110.001 1167.1010 1369.87696 1205.5010 1292.6500 9935.501 100
# evaluated 6.702 10.2505 22.18598 13.3015 15.5510 715.301 100
# precompiled 2.300 3.2005 4.69090 4.0005 4.5010 26.800 100
# anotherfun 1.400 2.0515 12.85201 2.5010 3.3505 1017.801 100
i <- 1
while (i < 10) {
i <- i + 1
x <- f(x)
}
Here is an option with accumulate
library(tidyverse)
n <- 3
a %>%
accumulate(~ .x %>%
f1, .init = f1(a)) %>%
extract2(n)
#[1] 1.872000e+03 6.563711e+09 1.102629e+14
NOTE: accumulate is similar to the base R option Reduce with accumulate = TRUE
checking with the OP's output
f1(f1(f1(a)))
#[1] 1.872000e+03 6.563711e+09 1.102629e+14
Or use a for loop (no external libraries used)
out <- a
for(i in seq_len(n)) out <- f1(out)
out
#[1] 1.872000e+03 6.563711e+09 1.102629e+14
Here's another way to do it with Reduce:
setting the stage
a <- 1:3
f1 <- function(x) x^2 + x^3
constructing a call and evaluating it
N <- 3 # how many times?
r <- Reduce(function(a,b) call("f1", a), rep(NA, N), init=a)
# f1(f1(f1(1:3)))
eval(r)
# [1] 1.872000e+03 6.563711e+09 1.102629e+14
alternative 2
# N defined as above
Reduce(function(x,y) y(x), replicate(N,f1), init=a)
# [1] 1.872000e+03 6.563711e+09 1.102629e+14
alternative 3 (recursive with a global-like variable)
doit <- function(N) {
i <- 0
function(fun, x){
i <<- i +1
if(i < N) Recall(fun, fun(x)) else fun(x)
}
}
doit(3)(f1, a)
# [1] 1.872000e+03 6.563711e+09 1.102629e+14
... or even
doit <- function(N, fun, x) (function(fun, x)
if((N <<- N - 1) > 0)
Recall(fun, fun(x)) else
fun(x))(fun, x)
doit(3, f1, a)
# [1] 1.872000e+03 6.563711e+09 1.102629e+14
How can I use setdiff() in R to get the elements that are in one vector but not in the others My example is as follows:
dat1 <- c("osa", "bli", "usd", "mnl")
dat2 <- c("mnu", "erd", "usd", "mnl")
dat3 <- c("ssu", "erd", "usd", "mnl")
The following code only returns what is diffrent in dat1 compared to dat2 and dat3:
diffs <- Reduce(setdiff,
list(A = dat1,
B = dat2,
C = dat3
)
How can I modify this code to be able to get all the elements that are uniquely present in on vector compared to the other? Thanks
another solution using setdiff :
myl <- list(A = dat1,
B = dat2,
C = dat3)
lapply(1:length(myl), function(n) setdiff(myl[[n]], unlist(myl[-n])))
[[1]]
[1] "osa" "bli"
[[2]]
[1] "mnu"
[[3]]
[1] "ssu"
a second possibility :
f <- function (...)
{
aux <- list(...)
ind <- rep(1:length(aux), sapply(aux, length))
x <- unlist(aux)
boo <- !(duplicated(x) | duplicated(x, fromLast = T))
split(x[boo], ind[boo])
}
f(dat1, dat2, dat3)
$`1`
[1] "osa" "bli"
$`2`
[1] "mnu"
$`3`
[1] "ssu"
Try this:
all.dat <- list(dat1, dat2, dat3)
from.dat <- rep(seq_along(all.dat), sapply(all.dat, length))
in.dat <- split(from.dat, unlist(all.dat))
in.one.dat <- in.dat[sapply(in.dat, length) == 1]
in.one.dat
# $bli
# [1] 1
# $mnu
# [1] 2
# $osa
# [1] 1
# $ssu
# [1] 3
which tells you what items are found in only one of the dat objects, and which one. If you only care for the names, then finish with: names(in.one.dat).