I'm trying to get bootstrapped descriptive statistics for the columns of a data frame using boot() in R. I can't figure out how to write the "statistic" function required of boot so that it goes through all the columns and returns 4 stats each.
I'm basically trying to modify code from a book chapter introducing bootstrapping, but the example there is using dlply to group up the numeric values in one column based on the group names in another. I don't need that step; I just need to do the same thing to every column. I reproduce the code from the textbook below:
library(boot)
library(plyr)
library(moments)
DescStat <- function(data, i) {
+ temp <- data[i,]
+ desc <- dlply(temp, "GroupName", summarize, mean = mean(ValueColumn), sd = sd(ValueColumn), skew = skewness(ValueColumn), kurt = kurtosis(ValueColumn))
+ l.desc <- unlist(desc)
+ return(l.desc)
+}
DasBoot <- boot(dataframe, DescStat, 1000)
print(DasBoot)
That returns a table of each one of those stats, plus the bias and the SE, each on its own line. So Group1 mean and bias and SE, then Group1 SD and bias and SE on the next line, and on down the list, 4 lines for each group.
I would like to do the exact same thing, but for column in the data frame instead of groups of cases.
I have kind of been floundering. It seems like this should not be that difficult to do, but all the examples and tutorials online are either for one column (which I'm able to do just fine by specifying it in the statistic function), or for trickier manipulations such as in the above.
Any help would be greatly appreciated.
I was able to figure it out using either apply() or purrr::map(). Here's the statistic function that doesn't require any extra packages (tidyverse) installed:
> ApStat <- function(data, i) {
+ temp <- data[i,]
+ desc <- apply(temp, 2, FUN = function(x) {
+ list(mean = mean(x, 0.2), sd = sd(x),
+ skew = skewness(x), kurt = kurtosis(x))
+ })
+ l.desc <- unlist(desc)
+ return(l.desc)
+ }
Thanks to the guys/gals at r/rstats!
Related
I am using the table1 package and the code below to create a table with region-specific descriptive statistics measures for three variables (i.e. ITE_tech, tech and IMR_tech) that lie in the closed unit interval.
library(table1)
table1::label(results19$ITE_tech) <- "TE"
table1::label(results19$tech) <- "TE_k"
table1::label(results19$IMR_tech) <- "MTR"
table1::table1(~ITE_tech + tech + IMR_tech | REGION, data = results19)
How can I add to the resulting table another row showing the number of times each of the prementioned variables takes the value of 1? I was trying to do this using the code below but this code applies the sum function to only the first variable and excludes the default descriptive statistics measures (i.e. mean, sd, min, max) from the final table.
render.continuous.custom <- function(x, ...) {
attr(x, "label") <- NULL
c(
"",
"Eff. Obs." = sum(results19$tech==1)
)
}
table1::table1(~ITE_tech + tech + IMR_tech | REGION, data = results19, render.continuous = render.continuous.custom)
Thank you in advance for your help.
Here is how you can define the render function:
render.continuous.custom <- function(x, ...) {
y <- render.default(x, ...)
c(y, "Eff. Obs."=sum(x==1))
}
(Note: a reproducible example would be helpful)
I have a dataset that looks something like this:
data.table(x=c(11:30),y=rnorm(20))
I would like to calculate the rolling regression coefficient and rsquared over the last 10 items:
dtset[,coefficient:=rollapply(1:20,width=10,FUN=function(a) {
subdtset <- dtset[a]
reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
return(coef(reg)[1])
},align="right",fill=NA)]
dtset[,rsquare:=rollapply(1:20,width=10,FUN=function(a) {
subdtset <- dtset[a]
reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
return(1 - sum((subdtset$y - reg$fitted.values)^2) / sum((subdtset$y - mean(subdtset$y, na.rm=TRUE))^2))
},align="right",fill=NA)]
The code above accomplishes this, but my dataset has millions of rows and I have multiple columns where I want to make these calculations so it is taking a very long time. I am hoping there is a way to speed things up:
Is there a better way to capture the last 10 items in rollapply rather than passing the row numbers as the variable a and then doing subdtset <- dtset[a]? I tried using .SD and .SDcols but was unable to get that to work. I can only figure out how to get rollapply to accept one column or vector as the input, not two columns/vectors.
Is there a way to return 2 values from one rollapply statement? I think I could get significant time savings if I only had to do the regression once, and then from that take the coefficient and calculate RSquare. It's pretty inefficient to do the same calculations twice.
Thanks for the help!
Use by.column = FALSE to pass both columns to the function. In the function calculate the slope and r squared directly to avoid the overhead of lm.fit. Note that rollapply can return a vector and that rollapplyr with an r on the end is right aligned. This also works if dtset consists of a single x column followed by multiple y columns as in the example below with the builtin anscombe data frame.
library(data.table)
library(zoo)
stats <- function(X, x = X[, 1], y = X[, -1]) {
c(slope = cov(x, y) / var(x), rsq = cor(x, y)^2)
}
rollapplyr(dtset, 10, stats, by.column = FALSE, fill = NA)
a <- anscombe[c("x3", "y1", "y2", "y3")]
rollapplyr(a, 3, stats, by.column = FALSE, fill = NA)
Check
We check the formulas using the built-in BOD data frame.
fm <- lm(demand ~ Time, BOD)
c(coef(fm)[[2]], summary(fm)$r.squared)
## [1] 1.7214286 0.6449202
stats(BOD)
## slope rsq
## 1.7214286 0.6449202
I have a simulated data created like this:
average_vector = c(0,0,25)
sigma_matrix = matrix(c(4,1,0,1,8,0,0,0,9),nrow=3,ncol=3)
set.seed(12345)
data0 = as.data.frame(mvrnorm(n =20000, mu = average_vector, Sigma=sigma_matrix))
names(data0)=c("hard","smartness","age")
set.seed(13579)
data0$final=0.5*data0$hard+0.2*data0$smartness+(-0.1)*data0$age+rnorm(n=dim(data0)[1],mean=90,sd=6)
Now, I want to randomly sample 50 students 1,000 times (1,000 sets of 50 people), I used this code:
datsub<-(replicate(1000, sample(1:nrow(data0),50)))
After that step, I encountered a issue: I want to ask if I want to run a regression model with the 50 selected people (1,000 times), and record/store the point estimates of “hard” from model 4, where is given like this:
model4 = lm(formula = final ~ hard + smartness + age, data = data0), and plot the variation around the line of 0.5 (true value), is there any way I can achieve that? Thanks a lot!
I would highly suggest looking into either caret or the newer (and still maintained) TidyModels if you're just getting into R modelling. Either of these will make your life easier, once you get used to the dplyr-like syntax.
What you're trying to do is bootstrapping. Here is the manual approach using only base functions.
n <- nrow(data0)
k <- 1000
ns <- 50
samples <- replicate(k, sample(seq_len(n), ns))
params <- vector('list', k)
for(i in seq_len(n)){
params[[i]] <- coef( lm(formula = final ~ hard + smartness + age, data = data0[samples[, i],]) )
}
# merge params into columns
params <- do.call(rbind, params)
# Create plot from here.
plot(x = seq_len(n), y = params[, "hard"])
abline(h = 0.5)
Note the above may have a few typos as your example is not reproducible.
This is a question both about using the boot() function with grouped variables, but also about passing multiple columns of data into boot. Almost all examples of the boot() function seem to pass a single column of data to calculate a simple bootstrap of the mean.
My specific analysis is trying to use the stats::weighted.mean(x,w) function which takes a vector 'x' of values to calculate the mean and a second vector 'w' for weights. The main point is that I need two inputs into this function - and I'm hoping the solution will generalize to any function that takes multiple arguments.
I'm also looking for a solution to use this weighted.means function in a dplyr style workflow with group_by() variables. If the answer is that "it can't be done with dplyr", that's fine, I'm just trying to figure it out.
Below I simulate a dataset with three groups (A,B,C) that each have different ranges of counts. I also attempt to come up with a function "my.function" that will be used to bootstrap the weighted average. Here might be my first mistake: is this how I would set up a function to pass in the 'count' and 'weight' columns of data into each bootstrapped sample? Is there some other way to index the data?
Inside the summarise() call, I reference the original data with "." - Possibly another mistake?
The end result shows that I was able to achieve appropriately grouped calculations using mean() and weighted.mean(), but the calls for confidence intervals using boot() have instead calculated the 95% confidence interval around the global mean of the dataset.
Suggestions on what I'm doing wrong? Why is the boot() function referencing the entire dataset and not the grouped subsets?
library(tidyverse)
library(boot)
set.seed(20)
sample.data = data.frame(letter = rep(c('A','B','C'),each = 50) %>% as.factor(),
counts = c(runif(50,10,30), runif(50,40,60), runif(50,60,100)),
weights = sample(10,150, replace = TRUE))
##Define function to bootstrap
##I'm using stats::weighted.mean() which needs to take in two arguments
##############
my.function = function(data,index){
d = data[index,] #create bootstrap sample of all columns of original data?
return(weighted.mean(d$counts, d$weights)) #calculate weighted mean using 'counts' and 'weights' columns
}
##############
## group by 'letter' and calculate weighted mean, and upper/lower 95% CI limits
## I pass data to boot using "." thinking that this would only pass each grouped subset of data
##(e.g., only letter "A") to boot, but instead it seems to pass the entire dataset.
sample.data %>%
group_by(letter) %>%
summarise(avg = mean(counts),
wtd.avg = weighted.mean(counts, weights),
CI.LL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[4],
CI.UL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[5])
And below I've calculated a rough estimate of 95% confidence intervals around the global mean to show that this is what was going on with boot() in my summarise() call above
#Here is a rough 95% confidence interval estimate as +/- 1.96* Standard Error
mean(sample.data$counts) + c(-1,1) * 1.96 * sd(sample.data$counts)/sqrt(length(sample.data[,1]))
The following base R solution solves the problem of bootstrapping by groups. Note that boot::boot is only called once.
library(boot)
sp <- split(sample.data, sample.data$letter)
y <- lapply(sp, function(x){
wtd.avg <- weighted.mean(x$counts, x$weights)
basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
CI.LL <- basic[4]
CI.UL <- basic[5]
data.frame(wtd.avg, CI.LL, CI.UL)
})
do.call(rbind, y)
# wtd.avg CI.LL CI.UL
#A 19.49044 17.77139 21.16161
#B 50.49048 48.79029 52.55376
#C 82.36993 78.80352 87.51872
Final clean-up:
rm(sp)
A dplyr solution could be the following. It also calls map_dfr from package purrr.
library(boot)
library(dplyr)
sample.data %>%
group_split(letter) %>%
purrr::map_dfr(
function(x){
wtd.avg <- weighted.mean(x$counts, x$weights)
basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
CI.LL <- basic[4]
CI.UL <- basic[5]
data.frame(wtd.avg, CI.LL, CI.UL)
}
)
# wtd.avg CI.LL CI.UL
#1 19.49044 17.77139 21.16161
#2 50.49048 48.79029 52.55376
#3 82.36993 78.80352 87.51872
I am a non-computing/math student who is really new to R and require some help. I have provided a dummy dataset and example to illustrate my problems.
Note: AUC = area under curve; ICP = intracranial pressure; cumAUC = cumulative AUC;
To put things into context, any ICP value > 20 is not clinically-ideal. A method of calculating this physiological insult is via a measure known as the ICP-times-Time burden, which can be represented by the AUC of the ICP-Time curve above y-cutoff of 20.
Hence I would like to calculate the cumulative AUC of the ICP-time curve above a cut-off value of ICP = 20. However, my codes are not giving me the desired output because ideally, the cumAUC should remain constant at ICP values < 20 and not be increasing. Here's a dummy dataset and codes that I have created:
require(MASS) #Using the area function in MASS
require(dplyr)
require(ggplot2)
df <- data.frame(time=seq(1,20,by=1),
ICP=c(7,9,15,14,16,20,25,23,26,27,18,15,10,9,7,13,22,24,26,20))
ggplot(data=df, mapping=aes(x=time,y=ICP)) + geom_line() + geom_hline(yintercept = 20)
func_test <- approxfun(df$time, df$ICP, method="linear", rule=2)
area_single <- function(x) {area(func_test,0,x)}
area_multiple <- Vectorize(area_single)
area_cutoff <- function(x, level=20){
tmp <- area_multiple(x)-lag(area_multiple(x))-level*(x-lag(x))
tmp_test <- tmp>0
tmp_test <- ifelse(is.na(tmp), FALSE, tmp_test)
out <- ifelse(tmp_test, tmp, 0)
return(out)
}
df_auc <- df %>%
mutate(cumAUC = cumsum(area_cutoff(time)))
Could you all kindly point out any possible error in my code, or any alternative suggestions will be great! :) Thank you so much for your help; much appreciated!