I would like to bootstrap confidence intervals for a proportion from a data.frame. I would like to get the results for the variables in one of my columns.
I have managed to perform the bootstrap for a vector but do not know how to scale it up to a data.frame from here.
A simplified example setting a threshold value of 10 and looking at the proportion less than 10 in the data.
Vector solution:
library(boot)
vec <- abs(rnorm(1000)*10) #generate example vector
data_to_tb <- vec
tb <- function(data) {
sum(data < 10, na.rm = FALSE)/length(data) #function for generating the proportion
}
tb(data_to_tb)
boot.out <- boot(data = data_to_tb, function(u,i) tb(u[i]), R = 999)
quantile(boot.out$t, c(.025,.975))
And from here I would like to do the same for a data.frame containing two columns.
I would like to return the result in a "summarized" data.frame if possible, with columns (x, sample, proportion, CI) :
x n proportion CI
A xx xx xx
B xx xx xx
C xx xx xx
Would be extra good if dplyr package could be used.
Here is a simplified example of my data:
Example:
dataframe <- data.frame(x = sample(c("A","B","C"),100,replace = TRUE), vec =abs(rnorm(100)*10))
head(dataframe)
## x vec
## 1 B 0.06735163
## 2 C 0.48612358
## 3 B 2.34190635
## 4 C 0.36393262
## 5 A 7.99762969
## 6 B 1.43293330
You can use group_by and summarise from dplyr to achieve the desired result. See below for the code.
# load required package
require(dplyr)
# function to calculate the confidence interval
CIfun <- function(v, probs = c(.025, .975)) {
quantile(boot(data = v, function(u,i) tb(u[i]), R = 999)$t, probs)
}
# using summarise from dplyr
dataframe %>% group_by(x) %>%
summarise(n = n(),
proportion = tb(vec),
`2.5%` = CIfun(vec, .025),
`97.5%`= CIfun(vec, .975))
Related
I am trying to apply fm.Choquet function (Rfmtool package) to my R data frame, but no success. The function works like this (ref. here):
# let x <- 0.6 (N = 1)
# and y <- c(0.3, 0.5). y elements are always 2 power N (here, 2)
# env<-fm.Init(1). env is propotional to N
# fm.Choquet(0.6, c(0.3, 0.5),env) gives a single value output
I have this sample data frame:
set.seed(123456)
a <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
b <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
c <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
df <- data.frame(a=a, b=b, c=c)
df$id <- seq_len(nrow(df))
I would like to apply fm.Choquet function to each row of my df such that, for each row (or ID), a is read as x, while b and c are read as y vector (N = 2), and add the function output as a new column for each row. However, I am getting the dimension error "The environment mismatches the dimension to the fuzzy measure.".
Here is my attempt.
df2 <- df %>% as_tibble() %>%
rowwise() %>%
mutate(ci = fm.Choquet(df$a,c(df[,2],df[,3]), env)) %>%
mutate(sum = rowSums(across(where(is.numeric)))) %>% # Also tried adding sum which works
as.matrix()
I am using dplyr::rowwise(), but I am open to looping or other suggestions. Can someone help me?
EDIT 1:
A relevant question is identified as a possible solution for the above question, but using one of the suggestions, by(), still throws the same error:
by(df, seq_len(nrow(df)), function(row) fm.Choquet(df$a,c(df$b,df$c), env))
set.seed(123456)
a <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
b <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
c <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
df <- data.frame(a = a, b = b, c = c)
df$id <- seq_len(nrow(df))
library(Rfmtool)
library(tidyverse)
env <- fm.Init(1)
map_dbl(
seq_len(nrow(df)),
~ {
row <- slice(df,.x)
fm.Choquet(
x = row$a,
v = c(row$b, row$c), env
)
}
)
I have a data frame on population of particles with given size. Data is organized in a dataframe where the first column represents the size (x value) and the other columns represent the density (y-values) for the actual size. I need to calculate the median for all the columns.
Since median() works with hist data, I decided to transform my dataset to this type by adding Nth time the value of the first column to a vector and get N from all the columns for the rows. This actually works, but really slow with my 1200 lines dataframes, so I wonder if you have a more efficient solution.
df <- data.frame(Size = c(1:100),
val1 = sample(0:9,100,replace = TRUE,),
val2 = sample(0:9,100,replace = TRUE))
get.median <- function(dataset){
results <- list()
for(col in colnames(dataset)[2:ncol(dataset)]){
col.results <- c()
for(i in 1:nrow(dataset)){
size <- dataset[i,"Size"]
count <- dataset[i,col]
out <- rep(size,count)
col.results <- c(col.results,out)
}
med <- median(col.results)
results <- append(results,med)
}
return(results)
}
get.median(df)
Without transforming:
lapply(df[,2:3], function(y) median(rep(df$Size, times = y)))
$val1
[1] 49
$val2
[1] 47
data:
set.seed(99)
df <- data.frame(Size = c(1:100),
val1 = sample(0:9,100,replace = TRUE,),
val2 = sample(0:9,100,replace = TRUE))
You can use sapply and median to calculate the median for each column like this:
sapply(df, median)
Output:
Size val1 val2
50.5 6.0 3.5
from "spatstat" library with dplyr::across
> df %>% summarize(across(-Size, ~weighted.median(Size,.x,na.rm = TRUE)))
val1 val2
1 42.5 47.5
db %>%
add_column(X1 = rnorm(nrow(db),0,sample(SD,1)),
X2 = rnorm(nrow(db),0,sample(SD,1))
)
I would like to automatically generate 100 random columns. I want to keep each sampled sd a different draw.
1 - You can use the dplyover package.
An exemple with mtcars as db and 10 different SD:
library(tidyverse)
db <- as_tibble(mtcars)
SD <- 1:10
n_col <- 100
sds <- sample(SD, n_col, replace = TRUE)
names(sds) <- paste0("X", seq_len(n_col))
mutate(db, dplyover::over(sds, ~ rnorm(n(), sd = .x)))
2 - Alternatively, you can create the random generated matrix separately and bind with db after
mat_rng <- as_tibble(lapply(sds, \(x) rnorm(nrow(db), sd = x)))
# or, as well suggested by #Adam
# mat_rng <- map_dfc(sds, rnorm, n = nrow(db), mean = 0)
bind_cols(db, mat_rng)
(works only with R >= 4.0.0 because I used the new anonymous function syntax)
I'd like to compute trimmed mean for each trimming proportion alpha, and then see which trimming proportion gives the minimal variance of the trimmed means, when Bootstrap simulations of size N=200 are applied. The problem that I have, is that when I try to create a data frame of column1 = mean and column2 = variance, the code that I wrote creates each output of mean and variance as separate data frame, so I cannot look up the trimming proportion and trimmed mean which have the minimal variance.
The function gives out "data.frame" 9 times. I guess it's because the alpha argument is vectorized. The code:
tmean_var <- function(n,N,alpha){
set.seed(1)
data <- rnorm(n)
data_aug1 <- c(data, -data)
data_aug2 <- c(data, 2 * median(data) - data)
est <- data.frame()
tmean <- replicate(N, {
sample <- base::mean(sample(x = data[(round(alpha*n)+1):(n-round(alpha*n))],
size = n-2*round(alpha*n), replace = TRUE))
})
mean <- base::mean(tmean)
var <- var(tmean) * (n-2*round(alpha * n))
df <- data.frame(mean = mean, var = var)
class(df)
}
f <- Vectorize(tmean_var, vectorize.args = "alpha")
f(n,N,alpha)
How could I make the output to be one dataframe not nine?
This should do it. Rather than try to use Vectorize() on a function that doesn't inherently take vector arguments, you could just use sapply() and lapply() across the values of alpha you provide as below:
tmean_var <- function(n,N,alpha){
set.seed(1)
data <- rnorm(n)
data_aug1 <- c(data, -data)
data_aug2 <- c(data, 2 * median(data) - data)
est <- data.frame()
tmean <- lapply(alpha, function(a){replicate(N, {
sample <- base::mean(sample(x = data[(round(a*n)+1):(n-round(a*n))],
size = n-2*round(a*n), replace = TRUE))
})
})
mean <- sapply(tmean, mean)
var <- sapply(seq_along(tmean), function(i)var(tmean[[i]]) * (n-2*round(alpha[i] * n)))
df <- data.frame(mean = mean, var = var, alpha=alpha)
# class(df)
}
out <- tmean_var(100, 200, c(.1, .2, .3))
out
#> mean var alpha
#> 1 0.10555709 0.8066377 0.1
#> 2 0.06868891 0.8331401 0.2
#> 3 0.21791984 0.9024612 0.3
Created on 2022-05-13 by the reprex package (v2.0.1)
I would like to run a custom function that uses specific columns of a dataframe split by groups. Here is my sample data & function code:
my_data = data.frame(N = c(12, 12, 24, 24, 12, 12),
p = rep(c(.125,.125,.025),2),
group = rep(c("dogs","cats"),each=3))
power.sequential <- function(d, nseq, pseq){
decvec <- NULL
nvec <- NULL
for (i in 1:100){
decvec[i] <- 0
nvec[i] <- 0
j <- 1
x <- NULL
while(decvec[i] == 0 & nvec[i] < sum(nseq)){
x <- c(x, rnorm(nseq[j], mean = d))
p <- t.test(x)$p.value
nvec[i] <- nvec[i] + nseq[j]
if (p < pseq[j]) decvec[i] <- 1
j <- j + 1
}
}
power <- mean(decvec == 1)
meanN <- mean(nvec)
return(list("power" = power, "mean_N" = meanN))
}
Now I want to run this function on each group in my dataframe. This is how the function is called normally:
power.sequential(d = .5,
nseq = c(12,12,24),
pseq = c(.125,.125,.025))
The function returns two values, and ideally they would each be saved in a separate column of my dataframe.
And this is my best try, but it gives an error message:
my_data %>% group_by(group) %>%
mutate(result = power.sequential(d=.5,nseq=N,pseq=p))
I probably need to reshape my dataframe so that each group is a single row, but I'm stuck on how to proceed.
Here is my desired output, the function outputs two values (power and meanN), each should get its own column.
group power meanN
dogs .94 20.28
cats .95 27.36
You can do:
my_data %>%
group_by(group) %>%
do(data.frame(power.sequential(d=.5,nseq=.$N,pseq=.$p)[c(1, 2)])) %>%
data.frame()
That gives:
group power mean_N
1 cats 0.96 27.24
2 dogs 0.94 21.12
The task can be simplified with use of data.table. One can call the function in 'j` section directly and both values will appear as separate column.
library(data.table)
setDT(my_data)
set.seed(1)
my_data[,power.sequential(0.5, N, p), by=group]
# group power mean_N
# 1: dogs 0.90 24.48
# 2: cats 0.94 27.72
Note: set.seed(1) has been used to keep the result consistent.