Bootstrap of weighted mean in dataframe across rows - r

I have a question regarding bootstrapping of a weighted mean.
Depending on how my data is structured, I sometimes want to bootstrap across columns and sometimes across rows.
In another post (bootstrap weighted mean in R), the following code was provided to bootstrap the weighted mean across columns:
library(boot)
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
results_qsec <- boot(data= mtcars[, 7, drop = FALSE],
statistic = samplewmean,
R=10000,
j = mtcars[, 6 , drop = FALSE])
This works perfectly (check: weighted.mean(mtcars[,7], mtcars[,6]).
However, I now also want to bootstrap across rows, which I thought the following code would do:
samplewmean2 <- function(d, i, j) {
d <- d[, i]
w <- j[, i]
return(weighted.mean(d, w))
}
results_qsec2 <- boot(data= mtcars[7, , drop = FALSE],
statistic = samplewmean2,
R=10000,
j = mtcars[6, , drop = FALSE])
Unfortunately this is not working, and I don't know what I should change?
Many thanks in advance.

Update
I think the easiest way is to get the row values into a vector and perform the bootstrap.
You could define your bootstrap-function like this:
samplewmean <- function(d,x, j) {
return(weighted.mean(d[x], j[x]))
}
And then apply the bootstrap like this:
results_qsec2 <- boot(data= as.vector(t(mtcars[, 7, drop = FALSE])),
statistic = samplewmean,
R=100,
j = as.vector(t(mtcars[, 6, drop = FALSE])))
If this is not what you want you can consider performing the bootstrap without the usage of any package. Then a good starting point would be the creation of a for-loop (or lapply, ...) using the resampling I suggested first:
elements2use <- sample(1:length(d), length(d), replace=T)

Related

How to bootstrap weighted mean in a loop in r

I would like to run a bootstrap of a weighted mean in a for loop (I don’t think I can use ‘apply’ because it concerns a weighted mean). I would only need to store the resulting standard errors in a dataframe. Another post provided the code for how to calculate the weighted mean in a bootstrap (bootstrap weighted mean in R), and works perfectly:
library(boot)
mtcarsdata = mtcars #dataframe for data
mtcarsweights = rev(mtcars) #dataframe for weights
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
results_qsec <- sd(boot(data= mtcarsdata[, 6, drop = FALSE],
statistic = samplewmean,
R=10000,
j = mtcarsweights[, 6 , drop = FALSE])[[2]], na.rm=T)
results_qsec
To then run it in a loop, I tried:
outputboot = matrix(NA, nrow=11, ncol=1)
for (k in 1:11){
outputboot[1,k] = sd(boot(data= mtcarsdata[, k, drop = FALSE],
statistic = samplewmean,
R=10000,
j = mtcarsweights[, k, drop = FALSE])[[2]], na.rm=T)
}
outputboot
But this doesn't work. The first output isn’t even correct. I suspect the code can’t work with two iterators: one for looping over the columns and the other for the sampling with replacement.
I hope anyone could offer some help.
This will calculate the standard deviation of all bootstraps for each column of the table mtcarsdata weighted by mtcarsweights.
Since we can calculate the result in one step, we can use apply and friends (here: purrr:map_dbl)
library(boot)
library(purrr)
set.seed(1337)
mtcarsdata <- mtcars # dataframe for data
mtcarsweights <- rev(mtcars) # dataframe for weights
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
mtcarsdata %>%
ncol() %>%
seq() %>%
map_dbl(~ {
# .x is the number of the current column
sd(boot(
data = mtcarsdata[, .x, drop = FALSE],
statistic = samplewmean,
R = 10000,
j = mtcarsweights[, .x, drop = FALSE]
)[[2]], na.rm = T)
})
#> [1] 0.90394218 0.31495232 23.93790468 6.34068205 0.09460257 0.19103196
#> [7] 0.33131814 0.07487754 0.07745781 0.13477355 0.27240347
Created on 2021-12-10 by the reprex package (v2.0.1)

nested loop in r to correlate columns of df1 to columns of df2

I have two datasets with abundance data from groups of different species. Columns are species and rows are sites. The sites (rows) are identical between the two datasets and what i am trying to do is to correlate the columns of the first dataset to the columns of the second dataset in order to see if there is a positive or a negative correlation.
library(Hmisc)
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$P
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$r
the first will give me the p value of the relation between sp1 and spA and the second the r value
I initially created a loop that allowed me to check all species of the first dataframe with a single column of the second dataframe. Needless to say if I was to make this work I would have to repeat the process a few hundred times.
My simple loop for one column of df1(new6) against all columns of df2(otu.table.filter)
pvalues = list()
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$P
}
rvalues = list()
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$r
}
p<-NULL
for(i in 1:length(pvalues)){
tmp <-print(pvalues[[i]][2])
p <- rbind(p, tmp)
}
r<-NULL
for(i in 1:length(rvalues)){
tmp <-print(rvalues[[i]][2])
r <- rbind(r, tmp)
}
fdr<-as.matrix(p.adjust(p, method = "fdr", n = length(p)))
sprman<-cbind(r,p,fdr)
and using the above as a starting point I tried to create a nested loop that each time would examine a column of df1 vs all columns of df2 and then it would proceed to the second column of df1 against all columns of df2 etc etc
but here i am a bit lost and i could not find an answer for a solution in r
I would assume that the pvalues output should be a list of
pvalues[[i]][[j]]
and similarly the rvalues output
rvalues[[i]][[j]]
but I am a bit lost and I dont know how to do that as I tried
pvalues = list()
rvalues = list()
for (j in 1:7){
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$P
}
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$r
}
}
but I cannot make it work cause I am not sure how to direct the output in the lists and then i would also appreciate if someone could help me with the next part which would be to extract for each comparison the p and r value and apply the fdr function (similar to what i did with my simple loop)
here is a subset of my two dataframes
Here a small demo. Let's assume two matrices x and y with a sample size n. Then correlation and approximate p-values can be estimated as:
n <- 100
x <- matrix(rnorm(10 * n), nrow = n)
y <- matrix(rnorm(5 * n), nrow = n)
## correlation matrix
r <- cor(x, y, method = "spearman")
## p-values
pval <- function(r, n) 2 * (1 - pt(abs(r)/sqrt((1 - r^2)/(n - 2)), n - 2))
pval(r, n)
## for comparison
cor.test(x[,1], y[,1], method = "spearman", exact = FALSE)
More details can be found here: https://stats.stackexchange.com/questions/312216/spearman-correlation-significancy-test
Edit
And finally a loop with cor.test:
## for comparison
p <- matrix(NA, nrow = ncol(x), ncol=ncol(y))
for (i in 1:ncol(x)) {
for (j in 1:ncol(y)) {
p[i, j] <- cor.test(x[,i], y[,j], method = "spearman")$p.value
}
}
p
The values differ a somewhat, because the first uses the t-approximation then the second the "exact AS 89 algorithm" of cor.test.

R bootstrap weighted mean by group with data table

I am trying to combine two approaches:
Bootstrapping multiple columns in data.table in a scalable fashion
with
Bootstrap weighted mean in R
Here is some random data:
## Generate sample data
# Function to randomly generate weights
set.seed(7)
rtnorm <- function(n, mean, sd, a = -Inf, b = Inf){
qnorm(runif(n, pnorm(a, mean, sd), pnorm(b, mean, sd)), mean, sd)
}
# Generate variables
nps <- round(runif(3500, min=-1, max=1), 0) # nps value which takes 1, 0 or -1
group <- sample(letters[1:11], 3500, TRUE) # groups
weight <- rtnorm(n=3500, mean=1, sd=1, a=0.04, b=16) # weights between 0.04 and 16
# Build data frame
df = data.frame(group, nps, weight)
# The following packages / libraries are required:
require("data.table")
require("boot")
This is the code from the first post above boostrapping the weighted mean:
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
results_qsec <- boot(data= df[, 2, drop = FALSE],
statistic = samplewmean,
R=10000,
j = df[, 3 , drop = FALSE])
This works totally fine.
Below ist the code from the second post above bootstrapping the mean by groups within a data table:
dt = data.table(df)
stat <- function(x, i) {x[i, (m=mean(nps))]}
dt[, list(list(boot(.SD, stat, R = 100))), by = group]$V1
This, too, works fine.
I have trouble combining both approaches:
Running …
dt[, list(list(boot(.SD, samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
… brings up the error message:
Error in weighted.mean.default(d, w) :
'x' and 'w' must have the same length
Running …
dt[, list(list(boot(dt[, 2 , drop = FALSE], samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
… brings up a different error:
Error in weighted.mean.default(d, w) :
(list) object cannot be coerced to type 'double'
I still have problems getting my head around the arguments in data.table and how to combine functions running data.table.
I would appreciate any help.
It is related to how data.table behaves within the scope of a function. d is still a data.table within samplewmean even after subsetting with i whereas weighted.mean is expecting numerical vector of weights and of values. If you unlist before calling weighted.mean, you will be able to fix this error
Error in weighted.mean.default(d, w) :
(list) object cannot be coerced to type 'double'
Code to unlist before passing into weighted.mean:
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(unlist(d), unlist(w)))
}
dt[, list(list(boot(dt[, 2 , drop = FALSE], samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
A more data.table-like (data.table version >= v1.10.2) syntax is probably as follows:
#a variable named original is being passed in from somewhere and i am unable to figure out from where
samplewmean <- function(d, valCol, wgtCol, original) {
weighted.mean(unlist(d[, ..valCol]), unlist(d[, ..wgtCol]))
}
dt[, list(list(boot(.SD, statistic=samplewmean, R=1, valCol="nps", wgtCol="weight"))), by=group]$V1
Or another possible syntax is: (see data.table faq 1.6)
samplewmean <- function(d, valCol, wgtCol, original) {
weighted.mean(unlist(d[, eval(substitute(valCol))]), unlist(d[, eval(substitute(wgtCol))]))
}
dt[, list(list(boot(.SD, statistic=samplewmean, R=1, valCol=nps, wgtCol=weight))), by=group]$V1

Reqsubsets results differ with coef() for model with linear dependencies

while using Regsubsets from package leaps on data with linear dependencies, I found that results given by coef() and by summary()$which differs. It seems that, when linear dependencies are found, reordering changes position of coefficients and coef() returns wrong values.
I use mtcars just to "simulate" the problem I had with other data. In first example there is no issue of lin. dependencies and best given model by BIC is mpg~wt+cyl and both coef(),summary()$which gives the same result. In second example I add dummy variable so there is possibility of perfect multicollinearity, but variables in this order (dummy in last column) don't cause the problem. In last example after changing order of variables in dataset, the problem finally appears and coef(),summary()$which gives different models. Is there anything incorrect in this approach? Is there any other way to get coefficients from regsubsets?
require("leaps") #install.packages("leaps")
###Example1
dta <- mtcars[,c("mpg","cyl","am","wt","hp") ]
bestSubset.cars <- regsubsets(mpg~., data=dta)
(best.sum <- summary(bestSubset.cars))
#
w <- which.min(best.sum$bic)
best.sum$which[w,]
#
best.sum$outmat
coef(bestSubset.cars, w)
#
###Example2
dta2 <- cbind(dta, manual=as.numeric(!dta$am))
bestSubset.cars2 <- regsubsets(mpg~., data=dta)
(best.sum2 <- summary(bestSubset.cars2))
#
w <- which.min(best.sum2$bic)
best.sum2$which[w,]
#
coef(bestSubset.cars2, w)
#
###Example3
bestSubset.cars3 <- regsubsets(mpg~., data=dta2[,c("mpg","manual","am","cyl","wt","hp")])
(best.sum3 <- summary(bestSubset.cars3))
#
w <- which.min(best.sum3$bic)
best.sum3$which[w,]
#
coef(bestSubset.cars3, w)
#
best.sum2$which
coef(bestSubset.cars2,1:4)
best.sum3$which
coef(bestSubset.cars3,1:4)
The order of vars by summary.regsubsets and regsubsets are different. The generic function coef() of regsubsets calls those two in one function, and the results are in mess if you are trying to force.in or using formula with fixed order. Changing some lines in the coef() function might help. Try codes below, see if it works!
coef.regsubsets <- function (object, id, vcov = FALSE, ...)
{
s <- summary(object)
invars <- s$which[id, , drop = FALSE]
betas <- vector("list", length(id))
for (i in 1:length(id)) {
# added
var.name <- names(which(invars[i, ]))
thismodel <- which(object$xnames %in% var.name)
names(thismodel) <- var.name
# deleted
#thismodel <- which(invars[i, ])
qr <- .Fortran("REORDR", np = as.integer(object$np),
nrbar = as.integer(object$nrbar), vorder = as.integer(object$vorder),
d = as.double(object$d), rbar = as.double(object$rbar),
thetab = as.double(object$thetab), rss = as.double(object$rss),
tol = as.double(object$tol), list = as.integer(thismodel),
n = as.integer(length(thismodel)), pos1 = 1L, ier = integer(1))
beta <- .Fortran("REGCF", np = as.integer(qr$np), nrbar = as.integer(qr$nrbar),
d = as.double(qr$d), rbar = as.double(qr$rbar), thetab = as.double(qr$thetab),
tol = as.double(qr$tol), beta = numeric(length(thismodel)),
nreq = as.integer(length(thismodel)), ier = numeric(1))$beta
names(beta) <- object$xnames[qr$vorder[1:qr$n]]
reorder <- order(qr$vorder[1:qr$n])
beta <- beta[reorder]
if (vcov) {
p <- length(thismodel)
R <- diag(qr$np)
R[row(R) > col(R)] <- qr$rbar
R <- t(R)
R <- sqrt(qr$d) * R
R <- R[1:p, 1:p, drop = FALSE]
R <- chol2inv(R)
dimnames(R) <- list(object$xnames[qr$vorder[1:p]],
object$xnames[qr$vorder[1:p]])
V <- R * s$rss[id[i]]/(object$nn - p)
V <- V[reorder, reorder]
attr(beta, "vcov") <- V
}
betas[[i]] <- beta
}
if (length(id) == 1)
beta
else betas
}
Another solution that works for me is to randomize the order of the column(independent variables) in your dataset before running the regsubsets. The idea is that after reorder hopefully the highly correlated columns will be far apart from each other and will not trigger the reorder behavior in the regsubsets algorithm.

Replacing a rolling average for loop with apply in R

I want to test the correlations between moving averages of varying lengths and a dependent variable. I've written a for loop that gets the job done but obviously for loops are not the ideal solution. I was wondering if someone could give me some pointers on how to replace the functionality of this for loop with apply as a more elegant solution? I've provided code and test data.
library(zoo)
# a function that calculates the correlation between moving averages for
different lengths of window
# the input functions are "independent": the variable over which to apply the
moving function
# "dependent": the output column, "startLength": the shortest window length,
"endLength" the longest window length
# "functionType": the function to apply (mean, sd, etc.)
MovingAverageCorrelation <- function(indepedent, depedent, startLength, endLength, functionType) {
# declare an matrix for the different rolling functions and a correlation vector
avgMat <- matrix(nrow = length(depedent), ncol = (endLength-startLength+1))
corVector <- rep(NA, ncol(avgMat))
# run the rollapply function over the data and calculate the corresponding correlations
for (i in startLength:endLength) {
avgMat[, i] <- rollapply(indepedent, width = i, FUN = functionType,
na.rm = T, fill = NA, align = "right")
corVector[i] <- cor(avgMat[, i], depedent, use = "complete.obs")
}
return(corVector)
}
# set test data
set.seed(100)
indVector <- runif(1000)
depVector <- runif(1000)
# run the function over the data
cor <- MovingAverageCorrelation(indVector, depVector, 1, 100, "mean")
Thanks!
Try sapply:
sapply(1:100, function(i) cor(rollapplyr(indVector, i, mean, na.rm = TRUE, fill = NA),
depVector, use = "complete.obs"))
If there are no NAs in your inputs this would work and is substantially faster:
sapply(1:100, function(i) cor(rollmeanr(indVector, i, fill = NA), depVector, use = "comp"))

Resources