I would like to run a bootstrap of a weighted mean in a for loop (I don’t think I can use ‘apply’ because it concerns a weighted mean). I would only need to store the resulting standard errors in a dataframe. Another post provided the code for how to calculate the weighted mean in a bootstrap (bootstrap weighted mean in R), and works perfectly:
library(boot)
mtcarsdata = mtcars #dataframe for data
mtcarsweights = rev(mtcars) #dataframe for weights
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
results_qsec <- sd(boot(data= mtcarsdata[, 6, drop = FALSE],
statistic = samplewmean,
R=10000,
j = mtcarsweights[, 6 , drop = FALSE])[[2]], na.rm=T)
results_qsec
To then run it in a loop, I tried:
outputboot = matrix(NA, nrow=11, ncol=1)
for (k in 1:11){
outputboot[1,k] = sd(boot(data= mtcarsdata[, k, drop = FALSE],
statistic = samplewmean,
R=10000,
j = mtcarsweights[, k, drop = FALSE])[[2]], na.rm=T)
}
outputboot
But this doesn't work. The first output isn’t even correct. I suspect the code can’t work with two iterators: one for looping over the columns and the other for the sampling with replacement.
I hope anyone could offer some help.
This will calculate the standard deviation of all bootstraps for each column of the table mtcarsdata weighted by mtcarsweights.
Since we can calculate the result in one step, we can use apply and friends (here: purrr:map_dbl)
library(boot)
library(purrr)
set.seed(1337)
mtcarsdata <- mtcars # dataframe for data
mtcarsweights <- rev(mtcars) # dataframe for weights
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
mtcarsdata %>%
ncol() %>%
seq() %>%
map_dbl(~ {
# .x is the number of the current column
sd(boot(
data = mtcarsdata[, .x, drop = FALSE],
statistic = samplewmean,
R = 10000,
j = mtcarsweights[, .x, drop = FALSE]
)[[2]], na.rm = T)
})
#> [1] 0.90394218 0.31495232 23.93790468 6.34068205 0.09460257 0.19103196
#> [7] 0.33131814 0.07487754 0.07745781 0.13477355 0.27240347
Created on 2021-12-10 by the reprex package (v2.0.1)
Related
How can you iterate in a for loop with specific column names in R? This is the dataset I am using and below are the names of the columns I want to iterate. Also are the column number.
When I try to iterate, it does not compile. I need this to create a multiple cluster data visualization.
if (!require('Stat2Data')) install.packages('Stat2Data')
library(Stat2Data)
data("Hawks")
#summary(Hawks)
for (i in 10:13(Hawks)){
print(Hawks$ColumnName)
}
for (i in Hawks(c("Wing","Weight","Culmen","Hallux"))){
print(Hawks$ColumnName)
}
EDIT
After what Martin told me, this error occurs:
Error in [.data.frame`(Hawks, , i) : undefined columns selected
This is the code I have:
if(!require('DescTools')) {
install.packages('DescTools')
library('DescTools')
}
Hawks$Wing[is.na(Hawks$Wing)] <- mean(Hawks$Wing, na.rm = TRUE)
Hawks$Weight[is.na(Hawks$Weight)] <- mean(Hawks$Weight, na.rm = TRUE)
Hawks$Culmen[is.na(Hawks$Culmen)] <- mean(Hawks$Culmen, na.rm = TRUE)
Hawks$Hallux[is.na(Hawks$Hallux)] <- mean(Hawks$Hallux, na.rm = TRUE)
# Parámetro Wing
n <- nrow(Hawks) # Number of rows
for (col_names in 10:13){
x <- matrix(Hawks[, i],0.95*n)
#x <- rbind(x1,x2)
plot (x)
fit2 <- kmeans(x, 2)
y_cluster2 <- fit2$cluster
fit3 <- kmeans(x, 3)
y_cluster3 <- fit3$cluster
fit4 <- kmeans(x, 4)
y_cluster4 <- fit4$cluster
}
I have a question regarding bootstrapping of a weighted mean.
Depending on how my data is structured, I sometimes want to bootstrap across columns and sometimes across rows.
In another post (bootstrap weighted mean in R), the following code was provided to bootstrap the weighted mean across columns:
library(boot)
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
results_qsec <- boot(data= mtcars[, 7, drop = FALSE],
statistic = samplewmean,
R=10000,
j = mtcars[, 6 , drop = FALSE])
This works perfectly (check: weighted.mean(mtcars[,7], mtcars[,6]).
However, I now also want to bootstrap across rows, which I thought the following code would do:
samplewmean2 <- function(d, i, j) {
d <- d[, i]
w <- j[, i]
return(weighted.mean(d, w))
}
results_qsec2 <- boot(data= mtcars[7, , drop = FALSE],
statistic = samplewmean2,
R=10000,
j = mtcars[6, , drop = FALSE])
Unfortunately this is not working, and I don't know what I should change?
Many thanks in advance.
Update
I think the easiest way is to get the row values into a vector and perform the bootstrap.
You could define your bootstrap-function like this:
samplewmean <- function(d,x, j) {
return(weighted.mean(d[x], j[x]))
}
And then apply the bootstrap like this:
results_qsec2 <- boot(data= as.vector(t(mtcars[, 7, drop = FALSE])),
statistic = samplewmean,
R=100,
j = as.vector(t(mtcars[, 6, drop = FALSE])))
If this is not what you want you can consider performing the bootstrap without the usage of any package. Then a good starting point would be the creation of a for-loop (or lapply, ...) using the resampling I suggested first:
elements2use <- sample(1:length(d), length(d), replace=T)
I have been trying to make the output of a wfe model as tidy so I can easily incorporate it into ggplot and etc. This is a problem I've had with other packages and statistical models which are not included in broom.
So let's say I create a dataset like this: (taken from wfe's file):
library (wfe)
## generate panel data with number of units = N, number of time = Time
N <- 10 # number of distinct units
Time <- 15 # number of distinct time
## treatment effect
beta <- 1
## generate treatment variable
treat <- matrix(rbinom(N*Time, size = 1, 0.25), ncol = N)
## make sure at least one observation is treated for each unit
while ((sum(apply(treat, 2, mean) == 0) > 0) | (sum(apply(treat, 2, mean) == 1) > 0) |
(sum(apply(treat, 1, mean) == 0) > 0) | (sum(apply(treat, 1, mean) == 1) > 0)) {
treat <- matrix(rbinom(N*Time, size = 1, 0.25), ncol = N)
}
treat.vec <- c(treat)
## unit fixed effects
alphai <- rnorm(N, mean = apply(treat, 2, mean))
## geneate two random covariates
x1 <- matrix(rnorm(N*Time, 0.5,1), ncol=N)
x2 <- matrix(rbeta(N*Time, 5,1), ncol=N)
x1.vec <- c(x1)
x2.vec <- c(x2)
## generate outcome variable
y <- matrix(NA, ncol = N, nrow = Time)
for (i in 1:N) {
y[, i] <- alphai[i] + treat[, i] + x1[,i] + x2[,i] + rnorm(Time)
}
y.vec <- c(y)
## generate unit and time index
unit.index <- rep(1:N, each = Time)
time.index <- rep(1:Time, N)
Data.obs <- as.data.frame(cbind(y.vec, treat.vec, unit.index, time.index, x1.vec, x2.vec))
colnames(Data.obs) <- c("y", "tr", "unit", "time", "x1", "x2")
Now I run a model from the function wfe (again, code from the package's help file):
mod.did <- wfe(y~ tr+x1+x2, data = Data.obs, treat = "tr",
unit.index = "unit", time.index = "time", method = "unit",
qoi = "ate", estimator ="did", hetero.se=TRUE, auto.se=TRUE,
White = TRUE, White.alpha = 0.05, verbose = TRUE)
## summarize the results
summary(mod.did)
My question is how to turn this output into a tidy object I could plot.
If I call tidy(mod.did) I get the following error:
Error: No tidy method for objects of class wfedid
Which I understand, but I am unsure as to how to solve. I tried mapping the individual parameters (coefficient, se, etc.) into a new list object but that did not work, so I hope that someone here knows of a more systematic way of doing this.
In case it helps, here's a dput of the output: https://pastebin.com/HTkKEUUQ
Thanks!
Here's a start at a tidy method:
library(dplyr); library(tibble)
tidy.wfedid <- function(x, conf.int=FALSE, conf.level=0.95, ...) {
cc <- (coef(summary(x))
%>% as.data.frame()
%>% setNames(c("estimate","std.error","statistic","p.value"))
%>% tibble::rownames_to_column("term")
%>% as_tibble()
)
return(cc)
}
Note that (1) I haven't implemented the confidence interval stuff (you could do this by using mutate to add columns (conf.low, conf.high) = term ± std.error*qnorm((1+conf.level)/2); (2) this gives the standard "tidy" method, which gives a coefficient table. If you want predictions and confidence intervals on predictions you will need to write an augment method ...
I am trying to combine two approaches:
Bootstrapping multiple columns in data.table in a scalable fashion
with
Bootstrap weighted mean in R
Here is some random data:
## Generate sample data
# Function to randomly generate weights
set.seed(7)
rtnorm <- function(n, mean, sd, a = -Inf, b = Inf){
qnorm(runif(n, pnorm(a, mean, sd), pnorm(b, mean, sd)), mean, sd)
}
# Generate variables
nps <- round(runif(3500, min=-1, max=1), 0) # nps value which takes 1, 0 or -1
group <- sample(letters[1:11], 3500, TRUE) # groups
weight <- rtnorm(n=3500, mean=1, sd=1, a=0.04, b=16) # weights between 0.04 and 16
# Build data frame
df = data.frame(group, nps, weight)
# The following packages / libraries are required:
require("data.table")
require("boot")
This is the code from the first post above boostrapping the weighted mean:
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
results_qsec <- boot(data= df[, 2, drop = FALSE],
statistic = samplewmean,
R=10000,
j = df[, 3 , drop = FALSE])
This works totally fine.
Below ist the code from the second post above bootstrapping the mean by groups within a data table:
dt = data.table(df)
stat <- function(x, i) {x[i, (m=mean(nps))]}
dt[, list(list(boot(.SD, stat, R = 100))), by = group]$V1
This, too, works fine.
I have trouble combining both approaches:
Running …
dt[, list(list(boot(.SD, samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
… brings up the error message:
Error in weighted.mean.default(d, w) :
'x' and 'w' must have the same length
Running …
dt[, list(list(boot(dt[, 2 , drop = FALSE], samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
… brings up a different error:
Error in weighted.mean.default(d, w) :
(list) object cannot be coerced to type 'double'
I still have problems getting my head around the arguments in data.table and how to combine functions running data.table.
I would appreciate any help.
It is related to how data.table behaves within the scope of a function. d is still a data.table within samplewmean even after subsetting with i whereas weighted.mean is expecting numerical vector of weights and of values. If you unlist before calling weighted.mean, you will be able to fix this error
Error in weighted.mean.default(d, w) :
(list) object cannot be coerced to type 'double'
Code to unlist before passing into weighted.mean:
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(unlist(d), unlist(w)))
}
dt[, list(list(boot(dt[, 2 , drop = FALSE], samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
A more data.table-like (data.table version >= v1.10.2) syntax is probably as follows:
#a variable named original is being passed in from somewhere and i am unable to figure out from where
samplewmean <- function(d, valCol, wgtCol, original) {
weighted.mean(unlist(d[, ..valCol]), unlist(d[, ..wgtCol]))
}
dt[, list(list(boot(.SD, statistic=samplewmean, R=1, valCol="nps", wgtCol="weight"))), by=group]$V1
Or another possible syntax is: (see data.table faq 1.6)
samplewmean <- function(d, valCol, wgtCol, original) {
weighted.mean(unlist(d[, eval(substitute(valCol))]), unlist(d[, eval(substitute(wgtCol))]))
}
dt[, list(list(boot(.SD, statistic=samplewmean, R=1, valCol=nps, wgtCol=weight))), by=group]$V1
while using Regsubsets from package leaps on data with linear dependencies, I found that results given by coef() and by summary()$which differs. It seems that, when linear dependencies are found, reordering changes position of coefficients and coef() returns wrong values.
I use mtcars just to "simulate" the problem I had with other data. In first example there is no issue of lin. dependencies and best given model by BIC is mpg~wt+cyl and both coef(),summary()$which gives the same result. In second example I add dummy variable so there is possibility of perfect multicollinearity, but variables in this order (dummy in last column) don't cause the problem. In last example after changing order of variables in dataset, the problem finally appears and coef(),summary()$which gives different models. Is there anything incorrect in this approach? Is there any other way to get coefficients from regsubsets?
require("leaps") #install.packages("leaps")
###Example1
dta <- mtcars[,c("mpg","cyl","am","wt","hp") ]
bestSubset.cars <- regsubsets(mpg~., data=dta)
(best.sum <- summary(bestSubset.cars))
#
w <- which.min(best.sum$bic)
best.sum$which[w,]
#
best.sum$outmat
coef(bestSubset.cars, w)
#
###Example2
dta2 <- cbind(dta, manual=as.numeric(!dta$am))
bestSubset.cars2 <- regsubsets(mpg~., data=dta)
(best.sum2 <- summary(bestSubset.cars2))
#
w <- which.min(best.sum2$bic)
best.sum2$which[w,]
#
coef(bestSubset.cars2, w)
#
###Example3
bestSubset.cars3 <- regsubsets(mpg~., data=dta2[,c("mpg","manual","am","cyl","wt","hp")])
(best.sum3 <- summary(bestSubset.cars3))
#
w <- which.min(best.sum3$bic)
best.sum3$which[w,]
#
coef(bestSubset.cars3, w)
#
best.sum2$which
coef(bestSubset.cars2,1:4)
best.sum3$which
coef(bestSubset.cars3,1:4)
The order of vars by summary.regsubsets and regsubsets are different. The generic function coef() of regsubsets calls those two in one function, and the results are in mess if you are trying to force.in or using formula with fixed order. Changing some lines in the coef() function might help. Try codes below, see if it works!
coef.regsubsets <- function (object, id, vcov = FALSE, ...)
{
s <- summary(object)
invars <- s$which[id, , drop = FALSE]
betas <- vector("list", length(id))
for (i in 1:length(id)) {
# added
var.name <- names(which(invars[i, ]))
thismodel <- which(object$xnames %in% var.name)
names(thismodel) <- var.name
# deleted
#thismodel <- which(invars[i, ])
qr <- .Fortran("REORDR", np = as.integer(object$np),
nrbar = as.integer(object$nrbar), vorder = as.integer(object$vorder),
d = as.double(object$d), rbar = as.double(object$rbar),
thetab = as.double(object$thetab), rss = as.double(object$rss),
tol = as.double(object$tol), list = as.integer(thismodel),
n = as.integer(length(thismodel)), pos1 = 1L, ier = integer(1))
beta <- .Fortran("REGCF", np = as.integer(qr$np), nrbar = as.integer(qr$nrbar),
d = as.double(qr$d), rbar = as.double(qr$rbar), thetab = as.double(qr$thetab),
tol = as.double(qr$tol), beta = numeric(length(thismodel)),
nreq = as.integer(length(thismodel)), ier = numeric(1))$beta
names(beta) <- object$xnames[qr$vorder[1:qr$n]]
reorder <- order(qr$vorder[1:qr$n])
beta <- beta[reorder]
if (vcov) {
p <- length(thismodel)
R <- diag(qr$np)
R[row(R) > col(R)] <- qr$rbar
R <- t(R)
R <- sqrt(qr$d) * R
R <- R[1:p, 1:p, drop = FALSE]
R <- chol2inv(R)
dimnames(R) <- list(object$xnames[qr$vorder[1:p]],
object$xnames[qr$vorder[1:p]])
V <- R * s$rss[id[i]]/(object$nn - p)
V <- V[reorder, reorder]
attr(beta, "vcov") <- V
}
betas[[i]] <- beta
}
if (length(id) == 1)
beta
else betas
}
Another solution that works for me is to randomize the order of the column(independent variables) in your dataset before running the regsubsets. The idea is that after reorder hopefully the highly correlated columns will be far apart from each other and will not trigger the reorder behavior in the regsubsets algorithm.