Related
Lets talk you through my workflow:
General idea
Based on data in a dataframe, select the appropriate distribution functions, combine them in all possible ways to get the mean of the combined distributions.
Starting position
I have a large data frame df. In there I have different variables var1, var2 and var3 in this example which contains data to select the appropriate distribution function.
I have several distribution functions per variable:
var1_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var1_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 6, sd = 1))
var1_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 2, sd = 2))
var2_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 3))
var2_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var2_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 2))
var3_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 1))
var3_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 1))
var3_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 7, sd = 2))
Select the right distribution
Using an if_else on each of the vars I generate the appropriate distribution per case in a new vector. The if_else looks like this for var1 and has the same appearance for all vars:
df$distr_var1 <- if_else(df$info < 0, "var1_distr1",
if_else(df$info > 0 & df$info < 100, "var1_distr2", "var1_distr3")
This results in the following df:
df <- data.frame(distr_var1 = c("var1_distr1", "var1_distr3", "var1_distr1", "var1_distr2", "var1_distr2", "var1_distr1", "var1_distr3"),
distr_var2 = c("var2_distr2", "var2_distr1", "var2_distr2", "var2_distr1", "var2_distr3", "var2_distr3", "var2_distr1"),
distr_var3 = c("var3_distr2", "var3_distr3", "var3_distr1", "var3_distr1", "var3_distr2", "var3_distr3", "var3_distr1"))
Combine distribution functions
To combine distribution functions in a new proportional distribution function I have created this function based on this question:
foo <- function(...){
#set x values
x <- seq(1, 10, by = 1)
#create y values
y <- 1L
for (fun in list(...)) y <- y * fun(x)
#create new PDF
p <- data.frame(x,y)
pdqr::new_d(p, type = "continuous")
}
And I have stored the PDFs in a list:
PDFS <- list(var1_distr1 = var1_distr1, var1_distr2 = var1_distr2, var1_distr3 = var1_distr3,
var2_distr1 = var2_distr1, var2_distr2 = var2_distr2, var2_distr3 = var2_distr3,
var3_distr1 = var3_distr1, var3_distr2 = var3_distr2, var3_distr3 = var3_distr3)
I would like to use the function foo in the df to generate proportional distributions for all combinations of distributions given in the df. So, for each case, a the following combinations: var1_var2, var1_var3, var2_var3, var1_var2_var3.
Calculate mean over distributions
If I want to calculate a mean over the distributions individually, I can do this:
means <- sapply(PDFS, pdqr::summ_mean)
df$mean_var1 <- means[df$distr_var1]
Or:
df$mean_var2 <- sapply(mget(df$distr_var2), pdqr::summ_mean)
Both approaches work fine. But on the combinations var1_var2, var1_var3, var2_var3, var1_var2_var3 I have not found a suitable approach, but tried these:
df$var1_var2_mean <- sapply(foo(mget(mapply(PDFS, sapply, df$distr_var1, df$distr_var2))), pdqr::summ_mean)
I tried to overcome not calling functions by using a list, but things seem to get too complicated / nested to work nicely...
Question
How to select the appropriate distributions given in distr_var1, distr_var2 and distr_var3, combined them using foo and calculate the mean using pdqr::summ_mean?
I'm happy with all comments, also on the workflow in general
A foreach loop works for me:
df$var1_var2_mean <- foreach(i = 1:nrow(df), .combine = c) %do% {
A <- as.name(df$var1[i])
B <- as.name(df$var2[i])
mean <- summ_mean(foo(get(A),get(B)))
}
And, for each combination I need to do this. At least I got it working...
I have a data frame with 10,000 rows and 40 columns. I am trying to apply a function to each of these rows. For each row, I am expecting to return a scalar which is the value of the statistic I am calculating in this function. Below is what I have done so far;
library(sandwich)
# Creating example data #
nrows=10000
ncols=40
n1=20
n2=20
df=data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
# Function to evaluate on each row of df #
get_est= function(x){
mod = lm(x~cov$group)
vcov = vcovHC(mod)
coef = as.numeric(mod$coefficients[2])
se = sqrt(as.numeric(diag(vcov)[2]))
stats = coef/se
return(stats)
}
# Applying above function to full data #
t1=Sys.time()
estimates=apply(df, 1, function(x) get_est(x))
t2=Sys.time()-t1
# Time taken by apply function
Time difference of 32.10623 secs
Is there a way to significantly decrease the time taken to implement get_est() on the full data? The main reason I need to speed up the computation on a single df is because I have 1000 more data frames with the same dimension and I have to apply this function to each row to each of these data frames simultaneously. To illustrate, below is the broader situation I am dealing with;
# Creating example data
set.seed(1234)
nrows = 10000
ncols = 40
n1 = 20
n2 = 20
df.list = list()
for(i in 1:1000){
df.list[[i]] = data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
}
# Applying get_est() to each row and to each of data frame in df.list #
all.est = foreach(j = 1:length(df.list), .combine = cbind, .packages = 'sandwich') %dopar% {
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
est = apply(df.list[[j]], 1, function(x) get_est(x))
return(est)
}
Even after parallelizing it is taking hours to finish. My ultimate objective is to significantly cut down the time to obtain "all.est" which will contain 10000 rows and 1000 columns where each column has the stats estimates for the respective data set. Any help is much appreciated!! Thanks in advance!
Your function get_est uses some "expensive" functions, such as lm, vcovHC, and so on. If you think of the OLS equation,
$$
\hat{\beta} = (X^TX)^{-1}X^Ty,
$$
then you can see that the first part $(X^TX)^{-1}X^T$ doesn't change in your simulation, so the design matrix is constant. To make use of this, I compute $(X^TX)^{-1}X^T$ before starting the simulation. This approach then also requires computing the HC3 standard errors manually using the formula (see e.g. here)
$$
\widehat{\text{Cov}}_{\text{HC3}}(\hat{\beta}) = (X^TX)^{-1}X^T \text{diag} \left[ \frac{e_i^2}{(1-h_{ii})^2} \right] X(X^TX)^{-1}.
$$
Everything except for the residuals is constant across your simulation iterations, so it can be precomputed. Once I implement these tricks, I achieve a speed up of roughly factor 50.
(Note: lm uses QR decomposition, which could also be implement similarly here. Maybe you can an ever bigger speed up by parallelizing the code.)
nrows = 10000
ncols = 40
n1 = 20
n2 = 20
df = data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
cov = data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
# old function
get_est_old = function(x){
mod = lm(x~cov$group)
vcov = sandwich::vcovHC(mod)
coef = as.numeric(mod$coefficients[2])
se = sqrt(as.numeric(diag(vcov)[2]))
stats = coef/se
return(stats)
}
# new function
# first construct design matrix
X = matrix(c(rep(1, ncols), rep(0, ncols / 2), rep(1, ncols / 2)), ncol = 2)
# these quantities will be used below
inv = solve(crossprod(X)) %*% t(X)
h = diag(X %*% inv)
get_est_new= function(x){
coef = (inv %*% x)
resid = x - (X %*% coef)
bread = (resid^2 / (1 - h)^2)[,1]
hc3 = inv %*% diag(bread) %*% t(inv)
se = sqrt(hc3[2,2])
stats = coef[2,1]/sqrt(hc3[2,2])
}
# Applying above function to full data #
system.time({
estimates_old = apply(df, 1, function(x) get_est_old(x))
})
#> user system elapsed
#> 7.876 0.042 7.929
system.time({
estimates_new = apply(df, 1, function(x) get_est_new(x))
})
#> user system elapsed
#> 0.141 0.016 0.158
# check
all.equal(estimates_old, estimates_new)
#> [1] TRUE
Created on 2021-09-04 by the reprex package (v2.0.1)
These posts could be of interest:
https://grantmcdermott.com/efficient-simulations-in-r/
https://grantmcdermott.com/simulations-remix-turn-up-the-base/
Please excuse the title for lack of a better phrase describing my question.
I'm running cluster stability analysis function out of 'flexclust' package, which runs bootstrap sampling on your dataset, calculate this thing called "Random Index" per each value of k (the range which I get to specify).
The function lets you try multiple distance metrics and clustering methods, and I want to run the function for every one of distance&method combination, find the best k based on each k's mean + median.
I've basically written nested for loops, initializing vector for each of the column: (name, distance metric, method, and best k). And calling a data.frame() to stitch all of them together.
###############################################################################################
df = data.frame(matrix(rbinom(10*100, 1, .5), ncol=4)) #random df for testing purpose
cl_stability <- function(df, df.name, k_low, k_high)
{
cluster.distance = c("euclidean","manhattan")
cluster.method = c("kmeans","hardcl","neuralgas")
for (dist in cluster.distance)
{
for (method in cluster.method)
{
j = 1
while (j <= length(cluster.distance)*length(cluster.method))
{
df.names = rep(c(df.name),length(cluster.distance)*length(cluster.method))
distances = c()
methods = c()
best.k.s = c()
ip = as.data.frame((bootFlexclust(df, k = k_low:k_high, multicore = TRUE,
FUN = "cclust", dist = d, method = m))#rand)
best_k = names(which.max(apply(ip, 2, mean) + apply(ip, 2, median))) #this part runs fine when I run them outside of the function
distances[j] = d
methods[j] = m
best.k.s[j] = best_k
j = j + 1
final = data.frame(df.names,distances,methods,best.k.s)
}
}
}
return(final)
}
Expected result would be a dataframe with 7 columns (name, distance metric, method, and best k, 2nd best, 3rd best, and the worst based on median+mean criteria.).
https://imgur.com/a/KpFM04m
I am trying to simulate 5000 samples of size 5 from a normal distribution with mean 5 and standard deviation 3. I want to then compute the mean of each sample and make a histogram of the sample means
My current code is not giving me an error but I don't think it's right:
nrSamples = 5000
e <- list(mode="vector",length=nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means <- matrix(NA, 5000,1)
for (i in 1:5000){
sample_means[i] <- mean(e[[i]])
}
Any idea on how to tackle this? I am very very new to R!
You don't need a list in this case. It is a common mistake of new R users to use lists excessively.
observations <- matrix(rnorm(25000, mean=5, sd=3), 5000, 5)
means <- rowMeans(observations)
Now means is a vector of 5000 elements.
You can actually do this without for loops. replicate can be used to create the 5000 samples. Then use sapply to return the mean of each sample. Wrap the sapply call in hist() to get the histogram of means.
dat = replicate(5000, rnorm(5,5,3), simplify=FALSE)
hist(sapply(dat, mean))
Or, if you want to save the means:
sample.means = sapply(dat,mean)
hist(sample.means)
I think your code is giving valid results. list(mode="vector",length=nrSamples) isn't doing what I think you intended (run it in the console and see what happens), but it works out because the first two list elements get overwritten in the loop.
Although there's no need to use loops here, just for illustration here are two modified versions of your code using loops:
# 1. Store random samples in a list
e <- vector("list", nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means = rep(NA, nrSamples)
for (i in 1:nrSamples){
sample_means[i] <- mean(e[[i]])
}
# 2. Store random samples in a matrix
e <- matrix(rep(NA, 5000*5), nrow=5)
for (i in 1:nrSamples) {
e[,i] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means = rep(NA, nrSamples)
for (i in 1:nrSamples){
sample_means[i] <- mean(e[, i])
}
Your code is fine (see below), but I would suggest you try the following:
yourlist <- lapply(1:nrSamples, function(x) rnorm(n=5, mean = 5, sd = 3 ))
yourmeans <- sapply(yourlist, mean)
Here, for each element of the sequence 1, 2, 3, ... nrSamples that I supply as the first argument, lapply executes an function with the given element of the sequence as argument (i.e. x). The function that I have supplied does not depend on x, however, so it is just replicated 5000 times, and the output is stored in a list (this is what lapply does). It is an easy way to avoid loops in situations like these. Needless to say, you could also just run
yourmeans <- sapply(1:nrSamples, function(x) mean(rnorm(n=5, mean = 5, sd = 3)))
Apart from the means, the latter does not store your results though, which may not be what you want. Also note that I call sapply to return a vector, which you can then use to plot your histogram, using e.g. hist(yourmeans).
To show that your code is fine, consider the following:
set.seed(42)
nrSamples = 5000
e <- list(mode="vector",length=nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
}
sample_means <- matrix(NA, 5000,1)
for (i in 1:5000){
sample_means[i] <- mean(e[[i]])
}
set.seed(42)
yourlist <- lapply(1:nrSamples, function(x) rnorm(n=5, mean = 5, sd = 3 ))
yourmeans <- sapply(yourlist, mean)
all.equal(as.vector(sample_means), yourmeans)
[1] TRUE
Here, I set the seed to the random number generator to make sure that the random numbers are the same. As you see, your code works fine, though as others have pointed out, loops can easily be avoided.
I have a list of data.frames, and would like to operate on their columns, using various weights.
For example, subtracting the first columns from the second column (solved, see below); or subtracting the first and third from twice the second (unsolved).
Thanks to the generous help obtained in response to this question, I have a solution to the the problem in two dimensions without weights using Reduce.
I would like to have the flexibility to operate with weights - and in higher dimesions.
What I have so far is:
priceList <- data.frame(aaa = rnorm(100, 100, 10), bbb = rnorm(100, 100, 10),
ccc = rnorm(100, 100, 10), ddd = rnorm(100, 100, 10),
eee = rnorm(100, 100, 10), fff = rnorm(100, 100, 10),
ggg = rnorm(100, 100, 10)
)
colDiff <- function(x)
{
Reduce('-', rev(x))
}
tradeLegsList <- combn(names(priceList), 3, function(x) priceList[x], simplify = FALSE)
tradeList <- lapply(tradeLegsList, colDiff)
From what I can tell, Reduce is not designed to take multiple arguments.
I can do this the long way with 2* tradeLegsList[[1]]$bbb - tradeLegsList[[1]]$aaa - tradeLegsList[[1]]$ccc, and some loops, but it doesn't seem like the R way.
Is there a way to pass in a weight vector?
Ideally, I would to pass an argument such as w = c(-1, 2, -1) to the colDiff (or Reduce) function ... or something similar.
True, Reduce is not geared to allow multiple arguments, just two for each reduction. Therefore it is easiest to premultiply the elements in the list you are Reduce-ing.
Below is a solution that does this using mapply within your colDiff function definition.
Change your definifion of colDiff to allow a weight vector, and apply this using mapply
with SIMPLIFY = F.
EDIT
In light of the comments, weighting depends on the number of columns and there being no need for the rev
The weighting by length
length(x) == 1 -> w = 1
length(x) == 2 -> w = c(-1, 1),
length(x) == 3 -> w = c(-1, 2, -1),
length(x) == 4 -> w = c(-1, 1, -1, +1)
weighting <- function(i){
switch(i, 1, c(-1,1), c(-1,2,-1), c(-1,1,-1, 1))
}
colDiff <- function(x)
{
w = weighting(length(x))
Reduce('+', mapply('*', x, e2 = w, SIMPLIFY = F))
}
Then something like this would work
tradeList <- lapply(tradeLegsList, colDiff)
you could also keep with the functional programming theme and use Map which is a simple wrapper for mapply with SIMPLIFY = F
colDiff <- function(x)
{
w = weighting(length(x))
Reduce('+', Map('*', x , e2 = w))
}
you could also prefine the weighting within the function colDiff (which may be easier).
weighting[[2]] is weighting for when there are 2 columns, weighting[[3]] when there are 3.
colDiff <- function(x)
{
weighting <- list(1, c(-1,1), c(-1,2,-1), c(-1,1,-1, 1))
w = weighting[[length(x)]]
Reduce('+', Map('*', x , e2 = w))
}