Sampling iteratively without a for loop in R - r

Even though I think the issue I have may be simple, I nevertheless can't figure it out. Here's the thing:
I have the following list and vector. The list is used to fill up the vector:
probabilities = list(c(0.2, 0.3, 0.5), c(0.1, 0.1, 0.8), c(0.3,0.4,0.3))
nextState = c()
for(counter in 1:3){
nextState[counter] = sample(1:3, size = 1, prob = probabilities[[counter]])
}
The code works fine. However, when expanding to larger lists (>10,000 elements), the loop becomes aggravatingly slow. Since the loop above is used multiple times in the larger code, the time consumed is way too much. Would there be a way to achieve the same result without looping?
Additional question:
Thanks guys, you've been a big help. One additional question: How would approach the same issue if the probabilities and the nextState were interdependent Meaning, how could I avoid the for loop? Perhaps some code to clarify:
M <- list(matrix(c(0.1, 0.2, 0.7, 0.2, 0.2, 0.6, 0.3, 0.3, 0.4), nrow = 3, ncol = 3),
matrix(c(0.3, 0.3, 0.4, 0.5, 0.5, 0, 0.1, 0.1, 0.8), nrow = 3, ncol = 3))
probabilities <- list()
nextState <- c(2, NA, NA)
for(i in 1:2){
probabilities[[i]] <- M[[i]][nextState[i], ]
nextState[i + 1] <- sample(1:3, size = 1, prob = probabilities[[i]])
}
If you've got any idea, then you truly are miracle workers!!

try sapply
nextstate <- sapply( probabilities, function(x) {sample(1:3, size = 1, prob = x)})
benchmarks
# Unit: microseconds
# expr min lq mean median uq max neval
# for 2115.170 2223.475 2436.0797 2283.2755 2371.546 10048.64 100
# sapply 24.704 29.524 164.0261 37.3565 41.123 12763.03 100
microbenchmark::microbenchmark(
`for` = {
nextState = c()
for(counter in 1:3){
nextState[counter] = sample(1:3, size = 1, prob = probabilities[[counter]])
}
},
sapply = sapply( probabilities, function(x) {sample(1:3, size = 1, prob = x)}),
times = 100)

Another possibility with purrr package:
library(purrr)
nexstate <- map_int(probabilities, function(x) {sample(1:3, size = 1, prob = x)})
Data:
probabilities = list(c(0.2, 0.3, 0.5), c(0.1, 0.1, 0.8), c(0.3,0.4,0.3))

Related

How do I need to assign values to each other in triplets using R?

The situation is as follows:
I need to create a dataset of triplets where we have discrete distribution of stock prices S <- c(80,100,120,140,160), with probability P <- c(0.2, 0.3, 0.2, 0.2, 0.1), call option C <- max(S-120,0) = c(0,0,0,20,40) and liability of an option which pays 30 if in a certain region otherwise zero, namely L = I{110 \leq S \leq 150} = c(0,0,30,30,0) <- c(0,0,30,30,0). It is important to mention that if P[1] = 80, then C[1] and L[1]. This holds for i = 1,2,3,4,5. How do you create a dataset for N = 10000 simulations where each value for i corresponds to the other two values for the same i?
This is the code I had for now. Note that X_1 = S, X_2 = C and Y = L.
X_1 <- function(n) {
sample(c(80,100,120,140,160), size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
X_2 <- function(n) {
sample(X_1 - 120, size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
Y <- function(n) {
sample(L, size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
##Creating triplets##
df <- data.frame(S_T = X_1(10000), C_T = X_2(10000), L_T =Y(10000))
df```
I'm not sure if you want C_T to be dependent on the S_T values. If you do, I think you just want to call X_1, assign the results to an object, then use that as the argument to X_2 (or just subtract 120, which is what X_2 does).
X_1 <- function(n) {
sample(c(80,100,120,140,160), size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
# Call that function
S_T <- X_1(10) # for practice
C_T <- S_T - 120 # that's all you're doing in function X_2, if you want to use S_T
If you want to C_T to contain values independent of S_T, you can create function within function
X_1 <- function(n) {
sample(c(80,100,120,140,160), size = n,
prob = c(0.2, 0.3, 0.2, 0.2, 0.1), replace=T)
}
X_2 <- function(n) {
X_1(n) - 120
}
S_T <- X_1(10) # Same as above
C_T <- X_2(10) # Gives values not dependent on S_T
EDIT to address comment below:
It's hard to read the comment, but it looks like you want create a function that takes the results of function X_1 and returns a result based on a condition. Use ifelse to read each element one at at time. You can create another function and then input the results of function X_1
Y <- function(X_1_func){
ifelse( X_1_func == 80,
return(0),
ifelse(X_1_func == 100,
return(0),
ifelse(X_1_func == 120,
return(30),
return(60) # Add a default value here or the last possible value if others are F
)
)
)
}
sapply(X_1(10), Y) # Use an apply to input one element of function X_1 at a time. Assign results to L or whatever you with to call.
If this all works for you, you can accept the answer.

Applying function to data.frame

I have a function with following which looks like
function(nsim = 10, maxN = 10000, mu = 0, sigma = 0.1, S0 = 100, endT = 1, K = 100){
nsim+maxN+mu+sigma+S0+endT+K
}
(The function here is just given for simplicity, the actual funtion is a simple Black Sholes pricing model)
Now, I have a data.frame:
df <- expand.grid(nsim = 10,
maxN = 10000,
mu = c(0.05, 0.10, 0.15),
sigma = c(0.2, 0.4, 0.6),
S0 = seq(80,120, by = 1),
endT = c(0.25, 0.50, 0.75),
K = 100,
sim = sprintf("Sim.%s", 1:10)
)
Which is just a collection of multiple values. Now the question is, how do I apply previous function to the data set to calculate a new column with values, but using the column values from each row as input?
You can add a column with mutate :
library(dplyr)
my_function <- function(nsim = 10, maxN = 10000, mu = 0, sigma = 0.1, S0 = 100, endT =
1, K = 100){
nsim+maxN+mu+sigma+S0+endT+K
}
df %>%
mutate(new_c = my_function(nsim, maxN, mu,sigma, S0, endT, K))
You can use mapply :
apply_fun <- function(nsim = 10, maxN = 10000, mu = 0, sigma = 0.1, S0 = 100, endT = 1, K = 100){
nsim+maxN+mu+sigma+S0+endT+K
}
df$price <- mapply(apply_fun, df$nsim, df$maxN, df$mu, df$sigma, df$S0, df$endT, df$K)
If you don't want to write each argument separately you can also use apply with do.call.
df$price <- apply(df[-ncol(df)], 1, function(x) do.call(apply_fun, as.list(x)))

How to generate n MarkovChain sequences of 25 transitions each

I have a transition matrix "T" and would like to produce 20 different sequences of 25 states each.
I have the markovchain package and have tried the following:
lapply(1:20,markovchainSequence(n = 25, markovchain = T, t0 = "In"))
but it says that markovcahinsequence is not a function. Is there a way around this please?
A reproducible example can really help here but I think this does the job done! You may just need a bigger transition matrix?!
set.seed(123)
statesNames <- c("a", "b", "c") #easier with three states
t <- new("markovchain", states = statesNames,
transitionMatrix = matrix(c(0.2, 0.5, 0.3, 0, 0.2, 0.8, 0.1, 0.8, 0.1),
nrow = 3, byrow = TRUE, dimnames = list(statesNames, statesNames)))
mchain = function(n){
markovchainSequence(n = n, markovchain = t, t0 = "a")
}
lapply(rep(25, each=20), mchain) # you may change 25 to desired number

R. lapply multinomial test to list of dataframes

I have a data frame A, which I split into a list of 100 data frames, each having 3 rows (In my real data each data frame has 500 rows). Here I show A with 2 elements of the list (row1-row3; row4-row6):
A <- data.frame(n = c(0, 1, 2, 0, 1, 2),
prob = c(0.4, 0.5, 0.1, 0.4, 0.5, 0.1),
count = c(24878, 33605, 12100 , 25899, 34777, 13765))
# This is the list:
nest <- split(A, rep(1:2, each = 3))
I want to apply the multinomial test to each of these data frames and extract the p-value of each test. So far I have done this:
library(EMT)
fun <- function(x){
multinomial.test(x$count,
prob=x$prob,
useChisq = FALSE, MonteCarlo = TRUE,
ntrial = 100, # n of withdrawals accomplished
atOnce=100)
}
lapply(nest, fun)
However, I get:
"Error in multinomial.test(x$counts_set, prob = x$norm_genome, useChisq = F, :
Observations have to be stored in a vector, e.g. 'observed <- c(5,2,1)'"
Does anyone have a smarter way of doing this?
The results of split are created with names 1, 2 and so on. That's why x$count in fun cannot access it. To make it simpler, you can combine your splitted elements using the list function and then use lapply:
n <- c(0,1,2,0,1,2)
prob <- c(0.4, 0.5, 0.1, 0.4, 0.5, 0.1)
count <- c(24878, 33605, 12100 , 25899, 34777, 13765)
A <- cbind.data.frame(n, prob, count)
nest = split(A,rep(1:2,each=3))
fun <- function(x){
multinomial.test(x$count,
prob=x$prob,
useChisq = F, MonteCarlo = TRUE,
ntrial = 100, # n of withdrawals accomplished
atOnce=100)
}
# Create a list of splitted elements
new_list <- list(nest$`1`, nest$`2`)
lapply(new_list, fun)
A solution with dplyr.
A = data.frame(n = c(0,1,2,0,1,2),
prob = c(0.4, 0.5, 0.1, 0.4, 0.5, 0.1),
count = c(43, 42, 9, 74, 82, 9))
library(dplyr)
nest <- A %>%
mutate(pattern = rep(1:2,each=3)) %>%
group_by(pattern) %>%
dplyr::summarize(mn_pvals = multinomial.test(count, prob)$p.value)
nest

Fastest way to select a valid range for raster data

Using R, I need to select the valid range for a given raster (from package raster) in the fastest possible way. I tried this:
library(raster)
library(microbenchmark)
library(ggplot2)
library(compiler)
r <- raster(ncol=100, nrow=100)
r[] <- runif(ncell(r))
#Let's see if precompiling helps speed...
f <- function(x, min, max) reclassify(x, c(-Inf, min, NA, max, Inf, NA))
g <- cmpfun(f)
#Benchmark!
compare <- microbenchmark(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
times=100)
autoplot(compare) #Reclassify is much faster, precompiling doesn't help much.
#Check they are the same...
identical(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA))
) #TRUE
identical(
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
) #TRUE
The reclassify method is much faster, but I'm sure that it can be sped up more. How can I do so?
While the accepted answer to this question is true for the example raster, it is important to note that the fastest safe function is highly dependent on raster size: the functions h and i presented by #rengis are only faster with relatively small rasters (and relatively simple reclassifications). Just increasing the size of the raster r in the OP's example by a magnitude of ten makes reclassify quicker:
# Code from OP #AF7
library(raster)
library(microbenchmark)
library(ggplot2)
library(compiler)
#Let's see if precompiling helps speed...
f <- function(x, min, max) reclassify(x, c(-Inf, min, NA, max, Inf, NA))
g <- cmpfun(f)
# Funcions from #rengis
h <- function(r, min, max) {
rr <- r[]
rr[rr < min | rr > max] <- NA
r[] <- rr
r
}
i <- cmpfun(h)
# Benchmark with larger raster (100k cells, vs 10k originally)
r <- raster(ncol = 1000, nrow = 100)
r[] <- runif(ncell(r))
compare <- microbenchmark(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
h(r, 0.2, 0.8),
i(r, 0.2, 0.8),
times=100)
autoplot(compare)
The exact point when reclassify becomes quicker is dependent both on the number of the cells in the raster and on the complexity of the reclassification, but in this case the cross-over point is at about 50,000 cells (see below).
As the raster becomes even larger (or the calculation more complex), another way to speed up reclassification is using multi-threading, e.g. with the snow package:
# Reclassify, using clusterR to split into two threads
library(snow)
tryCatch({
beginCluster(n = 2)
clusterR(r, reclassify, args = list(rcl = c(-Inf, 0.2, NA, 0.8, Inf, NA)))
}, finally = endCluster())
Multi-threading involves even more overhead to set up, and so only makes sense with very large rasters and/or more complex calculations (in fact, I was surprised to note that it didn't come out as the best option under any of the conditions I tested below--perhaps with a more complex reclassification?).
To illustrate, I've plotted results from microbenchmark using the OP's setup at intervals up to 10 million cells (10 runs of each) below:
As a final note, compiling didn't make a difference at any of the tested sizes.
Here is one more way:
h <- function(r, min, max) {
rr <- r[]
rr[rr < min | rr > max] <- NA
r[] <- rr
r
}
i <- cmpfun(h)
identical(
i(r, 0.2, 0.8),
g(r, 0.2, 0.8)
)
#Benchmark!
compare <- microbenchmark(
calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
g(r, 0.2, 0.8),
h(r, 0.2, 0.8),
i(r, 0.2, 0.8),
times=100)
autoplot(compare)
Compiling doesn't help much in this instance.
You could even gain some further speed up, by accessing slots of the raster object directly using # (although usually discouraged).
j <- function(r, min, max) {
v <- r#data#values
v[v < min | v > max] <- NA
r#data#values <- v
r
}
k <- cmpfun(j)
identical(
j(r, 0.2, 0.8)[],
g(r, 0.2, 0.8)[]
)
The raster package has a function for that: clamp. It is faster than g but slower than h and i because it has some overhead (safety) built in.
compare <- microbenchmark(
h(r, 0.2, 0.8),
i(r, 0.2, 0.8),
clamp(r, 0.2, 0.8),
g(r, 0.2, 0.8),
times=100)
autoplot(compare)

Resources