Add multiple RNG columns to a tibble - r

db %>%
add_column(X1 = rnorm(nrow(db),0,sample(SD,1)),
X2 = rnorm(nrow(db),0,sample(SD,1))
)
I would like to automatically generate 100 random columns. I want to keep each sampled sd a different draw.

1 - You can use the dplyover package.
An exemple with mtcars as db and 10 different SD:
library(tidyverse)
db <- as_tibble(mtcars)
SD <- 1:10
n_col <- 100
sds <- sample(SD, n_col, replace = TRUE)
names(sds) <- paste0("X", seq_len(n_col))
mutate(db, dplyover::over(sds, ~ rnorm(n(), sd = .x)))
2 - Alternatively, you can create the random generated matrix separately and bind with db after
mat_rng <- as_tibble(lapply(sds, \(x) rnorm(nrow(db), sd = x)))
# or, as well suggested by #Adam
# mat_rng <- map_dfc(sds, rnorm, n = nrow(db), mean = 0)
bind_cols(db, mat_rng)
(works only with R >= 4.0.0 because I used the new anonymous function syntax)

Related

Error: Unable to apply package function to each row in R

I am trying to apply fm.Choquet function (Rfmtool package) to my R data frame, but no success. The function works like this (ref. here):
# let x <- 0.6 (N = 1)
# and y <- c(0.3, 0.5). y elements are always 2 power N (here, 2)
# env<-fm.Init(1). env is propotional to N
# fm.Choquet(0.6, c(0.3, 0.5),env) gives a single value output
I have this sample data frame:
set.seed(123456)
a <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
b <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
c <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
df <- data.frame(a=a, b=b, c=c)
df$id <- seq_len(nrow(df))
I would like to apply fm.Choquet function to each row of my df such that, for each row (or ID), a is read as x, while b and c are read as y vector (N = 2), and add the function output as a new column for each row. However, I am getting the dimension error "The environment mismatches the dimension to the fuzzy measure.".
Here is my attempt.
df2 <- df %>% as_tibble() %>%
rowwise() %>%
mutate(ci = fm.Choquet(df$a,c(df[,2],df[,3]), env)) %>%
mutate(sum = rowSums(across(where(is.numeric)))) %>% # Also tried adding sum which works
as.matrix()
I am using dplyr::rowwise(), but I am open to looping or other suggestions. Can someone help me?
EDIT 1:
A relevant question is identified as a possible solution for the above question, but using one of the suggestions, by(), still throws the same error:
by(df, seq_len(nrow(df)), function(row) fm.Choquet(df$a,c(df$b,df$c), env))
set.seed(123456)
a <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
b <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
c <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
df <- data.frame(a = a, b = b, c = c)
df$id <- seq_len(nrow(df))
library(Rfmtool)
library(tidyverse)
env <- fm.Init(1)
map_dbl(
seq_len(nrow(df)),
~ {
row <- slice(df,.x)
fm.Choquet(
x = row$a,
v = c(row$b, row$c), env
)
}
)

Vectorised argument for a function in R. The function gives out multiple data frames, whereas I'd like it to output only one

I'd like to compute trimmed mean for each trimming proportion alpha, and then see which trimming proportion gives the minimal variance of the trimmed means, when Bootstrap simulations of size N=200 are applied. The problem that I have, is that when I try to create a data frame of column1 = mean and column2 = variance, the code that I wrote creates each output of mean and variance as separate data frame, so I cannot look up the trimming proportion and trimmed mean which have the minimal variance.
The function gives out "data.frame" 9 times. I guess it's because the alpha argument is vectorized. The code:
tmean_var <- function(n,N,alpha){
set.seed(1)
data <- rnorm(n)
data_aug1 <- c(data, -data)
data_aug2 <- c(data, 2 * median(data) - data)
est <- data.frame()
tmean <- replicate(N, {
sample <- base::mean(sample(x = data[(round(alpha*n)+1):(n-round(alpha*n))],
size = n-2*round(alpha*n), replace = TRUE))
})
mean <- base::mean(tmean)
var <- var(tmean) * (n-2*round(alpha * n))
df <- data.frame(mean = mean, var = var)
class(df)
}
f <- Vectorize(tmean_var, vectorize.args = "alpha")
f(n,N,alpha)
How could I make the output to be one dataframe not nine?
This should do it. Rather than try to use Vectorize() on a function that doesn't inherently take vector arguments, you could just use sapply() and lapply() across the values of alpha you provide as below:
tmean_var <- function(n,N,alpha){
set.seed(1)
data <- rnorm(n)
data_aug1 <- c(data, -data)
data_aug2 <- c(data, 2 * median(data) - data)
est <- data.frame()
tmean <- lapply(alpha, function(a){replicate(N, {
sample <- base::mean(sample(x = data[(round(a*n)+1):(n-round(a*n))],
size = n-2*round(a*n), replace = TRUE))
})
})
mean <- sapply(tmean, mean)
var <- sapply(seq_along(tmean), function(i)var(tmean[[i]]) * (n-2*round(alpha[i] * n)))
df <- data.frame(mean = mean, var = var, alpha=alpha)
# class(df)
}
out <- tmean_var(100, 200, c(.1, .2, .3))
out
#> mean var alpha
#> 1 0.10555709 0.8066377 0.1
#> 2 0.06868891 0.8331401 0.2
#> 3 0.21791984 0.9024612 0.3
Created on 2022-05-13 by the reprex package (v2.0.1)

How to bootstrap weighted mean in a loop in r

I would like to run a bootstrap of a weighted mean in a for loop (I don’t think I can use ‘apply’ because it concerns a weighted mean). I would only need to store the resulting standard errors in a dataframe. Another post provided the code for how to calculate the weighted mean in a bootstrap (bootstrap weighted mean in R), and works perfectly:
library(boot)
mtcarsdata = mtcars #dataframe for data
mtcarsweights = rev(mtcars) #dataframe for weights
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
results_qsec <- sd(boot(data= mtcarsdata[, 6, drop = FALSE],
statistic = samplewmean,
R=10000,
j = mtcarsweights[, 6 , drop = FALSE])[[2]], na.rm=T)
results_qsec
To then run it in a loop, I tried:
outputboot = matrix(NA, nrow=11, ncol=1)
for (k in 1:11){
outputboot[1,k] = sd(boot(data= mtcarsdata[, k, drop = FALSE],
statistic = samplewmean,
R=10000,
j = mtcarsweights[, k, drop = FALSE])[[2]], na.rm=T)
}
outputboot
But this doesn't work. The first output isn’t even correct. I suspect the code can’t work with two iterators: one for looping over the columns and the other for the sampling with replacement.
I hope anyone could offer some help.
This will calculate the standard deviation of all bootstraps for each column of the table mtcarsdata weighted by mtcarsweights.
Since we can calculate the result in one step, we can use apply and friends (here: purrr:map_dbl)
library(boot)
library(purrr)
set.seed(1337)
mtcarsdata <- mtcars # dataframe for data
mtcarsweights <- rev(mtcars) # dataframe for weights
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
mtcarsdata %>%
ncol() %>%
seq() %>%
map_dbl(~ {
# .x is the number of the current column
sd(boot(
data = mtcarsdata[, .x, drop = FALSE],
statistic = samplewmean,
R = 10000,
j = mtcarsweights[, .x, drop = FALSE]
)[[2]], na.rm = T)
})
#> [1] 0.90394218 0.31495232 23.93790468 6.34068205 0.09460257 0.19103196
#> [7] 0.33131814 0.07487754 0.07745781 0.13477355 0.27240347
Created on 2021-12-10 by the reprex package (v2.0.1)

Calculations across dataframes

I have two lists of dataframes, the first list of dfs hold values that extend down the column and the second list of dfs holds single values like this:
dynamic_df_1 <- data.frame(x = 1:10)
dynamic_df_2 <- data.frame(y = 1:10)
df_list <- list(dynamic_df_1, dynamic_df_2)
df_list
static_df_1 <- data.frame(mu = 10,
stdev = 5)
static_df_2 <- data.frame(mu = 12,
stdev = 6)
static_df_list <- list(stat_df1 = static_df_1,
stat_df2 = static_df_2)
static_df_list
I would like to add a column to each dataframe (dynamic_df_1 and dynamic_df_2) using values from static_df_1 and static_df_2 to perform the calculation where the calculation for dynamic_df_1 computes with static_df_1 and the calculation for dynamic_df_2 computes with static_df_2.
The result I'm aiming for is this:
df_list[[1]] <- df_list[[1]] %>%
mutate(z = dnorm(x = df_list[[1]]$x, mean = static_df_list$stat_df1$mu, sd = static_df_list$stat_df1$stdev))
df_list
df_list[[2]] <- df_list[[2]] %>%
mutate(z = dnorm(x = df_list[[2]]$y, mean = static_df_list$stat_df2$mu, sd = static_df_list$stat_df2$stdev))
df_list
I can take a loop approach which gets messy with more complex functions in my real code:
for (i in 1:length(df_list)) {
df_list[[i]]$z <- dnorm(x = df_list[[i]][[1]], mean = static_df_list[[i]]$mu, sd = static_df_list[[i]]$stdev)
}
df_list
I'm trying to find an lapply / map / mutate type solution that calculates across dataframes - imagine a grid of dataframes where the objective is to calculate across rows. Also open to other solutions such as single df with nested values but haven't figured out how to do that yet.
Hope that is clear - I did my best!
Thanks!
This Map solution seems to be simpler. And the results are identical(). The code that creates df_list2 and df_list3 follows below.
df_list4 <- df_list
fun <- function(DF, Static_DF){
DF[["z"]] = dnorm(DF[[1]], mean = Static_DF[["mu"]], sd = Static_DF[["stdev"]])
DF
}
df_list4 <- Map(fun, df_list4, static_df_list)
identical(df_list2, df_list3)
#[1] TRUE
identical(df_list2, df_list4)
#[1] TRUE
Data.
After running the question's code that creates the initial df_list, run the dplyr pipe and for loop code:
df_list2 <- df_list
df_list2[[1]] <- df_list2[[1]] %>%
mutate(z = dnorm(x = df_list2[[1]]$x, mean = static_df_list$stat_df1$mu, sd = static_df_list$stat_df1$stdev))
df_list2[[2]] <- df_list2[[2]] %>%
mutate(z = dnorm(x = df_list2[[2]]$y, mean = static_df_list$stat_df2$mu, sd = static_df_list$stat_df2$stdev))
df_list3 <- df_list
for (i in 1:length(df_list3)) {
df_list3[[i]]$z <- dnorm(x = df_list3[[i]][[1]], mean = static_df_list[[i]]$mu, sd = static_df_list[[i]]$stdev)
}

Bootstrap CI for several variables of column in dataframe

I would like to bootstrap confidence intervals for a proportion from a data.frame. I would like to get the results for the variables in one of my columns.
I have managed to perform the bootstrap for a vector but do not know how to scale it up to a data.frame from here.
A simplified example setting a threshold value of 10 and looking at the proportion less than 10 in the data.
Vector solution:
library(boot)
vec <- abs(rnorm(1000)*10) #generate example vector
data_to_tb <- vec
tb <- function(data) {
sum(data < 10, na.rm = FALSE)/length(data) #function for generating the proportion
}
tb(data_to_tb)
boot.out <- boot(data = data_to_tb, function(u,i) tb(u[i]), R = 999)
quantile(boot.out$t, c(.025,.975))
And from here I would like to do the same for a data.frame containing two columns.
I would like to return the result in a "summarized" data.frame if possible, with columns (x, sample, proportion, CI) :
x n proportion CI
A xx xx xx
B xx xx xx
C xx xx xx
Would be extra good if dplyr package could be used.
Here is a simplified example of my data:
Example:
dataframe <- data.frame(x = sample(c("A","B","C"),100,replace = TRUE), vec =abs(rnorm(100)*10))
head(dataframe)
## x vec
## 1 B 0.06735163
## 2 C 0.48612358
## 3 B 2.34190635
## 4 C 0.36393262
## 5 A 7.99762969
## 6 B 1.43293330
You can use group_by and summarise from dplyr to achieve the desired result. See below for the code.
# load required package
require(dplyr)
# function to calculate the confidence interval
CIfun <- function(v, probs = c(.025, .975)) {
quantile(boot(data = v, function(u,i) tb(u[i]), R = 999)$t, probs)
}
# using summarise from dplyr
dataframe %>% group_by(x) %>%
summarise(n = n(),
proportion = tb(vec),
`2.5%` = CIfun(vec, .025),
`97.5%`= CIfun(vec, .975))

Resources