Before performing some statistical analysis I would like to add weights to my sample as a function of a variable (the population size for each areal unit) so that the higher the population size within each unit, the greater the weight it will get and the opposite. Do you have any suggestion on how to do this in R? Thanks in advance
You can do this with weighted.mean(), providing the weights as the second argument.
Here is a quick example, using population as weights.
dat <- data.frame(
country = c("UK", "US", "France", "Zimbabwe"),
pop = c(6.7e4, 3.31e8, 6.8e4, 1.5e4),
love_of_british_royal_family = c(5, 9, 2, 1)
)
mean(dat$love_of_british_royal_family) # 4.25
weighted.mean(
dat$love_of_british_royal_family,
w = dat$pop
) # 8.997391
SamR's weighted.mean requires a weight for each member of your vector. If you have a population vector with many members and want to weight by a catagories of population size, you could use the base R cut function. Here is a toy example:
population <- sample(200:200000, 100)
df <- data.frame(population)
breaks <- c(200, 10000, 50000, 100000, 200000)
labels <- c(0.1, 0.2, 0.3, 0.4)
cuts <- cut(df$population, breaks = breaks, labels = labels)
df$weights <- as.numeric(as.character(cuts))
head(df)
population weights
1 25087 0.2
2 92652 0.3
3 99051 0.3
4 136376 0.4
5 184573 0.4
6 147675 0.4
Note that cuts is a vector of factors. Therefore the as.character(cuts) conversion is required to maintain the intended fractional weights.
I’m trying to simulate the Matlab ewstats function here defined:
https://it.mathworks.com/help/finance/ewstats.html
The results given by Matlab are the following ones:
> ExpReturn = 1×2
0.1995 0.1002
> ExpCovariance = 2×2
0.0032 -0.0017
-0.0017 0.0010
I’m trying to replicate the example with the RiskPortfolios R package:
https://cran.r-project.org/web/packages/RiskPortfolios/RiskPortfolios.pdf
The R code I’m using is this one:
library(RiskPortfolios)
rets <- as.matrix(cbind(c(0.24, 0.15, 0.27, 0.14), c(0.08, 0.13, 0.06, 0.13)))
w <- 0.98
rets
w
meanEstimation(rets, control = list(type = 'ewma', lambda = w))
covEstimation(rets, control = list(type = 'ewma', lambda = w))
The mean estimation is the same of the one in the example, but the covariance matrix is different:
> rets
[,1] [,2]
[1,] 0.24 0.08
[2,] 0.15 0.13
[3,] 0.27 0.06
[4,] 0.14 0.13
> w
[1] 0.98
>
> meanEstimation(rets, control = list(type = 'ewma', lambda = w))
[1] 0.1995434 0.1002031
>
> covEstimation(rets, control = list(type = 'ewma', lambda = w))
[,1] [,2]
[1,] 0.007045044 -0.003857217
[2,] -0.003857217 0.002123827
Am I missing something?
Thanks
They give the same answer if type = "lw" is used:
round(covEstimation(rets, control = list(type = 'lw')), 4)
## 0.0032 -0.0017
## -0.0017 0.0010
They are using different algorithms. From the RiskPortfolio manual:
ewma ... See RiskMetrics (1996)
From the Matlab hlp page:
There is no relationship between ewstats function and the RiskMetrics® approach for determining the expected return and covariance from a return time series.
Unfortunately Matlab does not tell us which algorithm is used.
For those who eventually need an equivalent ewstats function in R, here the code I wrote:
ewstats <- function(RetSeries, DecayFactor=NULL, WindowLength=NULL){
#EWSTATS Expected return and covariance from return time series.
# Optional exponential weighting emphasizes more recent data.
#
# [ExpReturn, ExpCovariance, NumEffObs] = ewstats(RetSeries, ...
# DecayFactor, WindowLength)
#
# Inputs:
# RetSeries : NUMOBS by NASSETS matrix of equally spaced incremental
# return observations. The first row is the oldest observation, and the
# last row is the most recent.
#
# DecayFactor : Controls how much less each observation is weighted than its
# successor. The k'th observation back in time has weight DecayFactor^k.
# DecayFactor must lie in the range: 0 < DecayFactor <= 1.
# The default is DecayFactor = 1, which is the equally weighted linear
# moving average Model (BIS).
#
# WindowLength: The number of recent observations used in
# the computation. The default is all NUMOBS observations.
#
# Outputs:
# ExpReturn : 1 by NASSETS estimated expected returns.
#
# ExpCovariance : NASSETS by NASSETS estimated covariance matrix.
#
# NumEffObs: The number of effective observations is given by the formula:
# NumEffObs = (1-DecayFactor^WindowLength)/(1-DecayFactor). Smaller
# DecayFactors or WindowLengths emphasize recent data more strongly, but
# use less of the available data set.
#
# The standard deviations of the asset return processes are given by:
# STDVec = sqrt(diag(ECov)). The correlation matrix is :
# CorrMat = VarMat./( STDVec*STDVec' )
#
# See also MEAN, COV, COV2CORR.
NumObs <- dim(RetSeries)[1]
NumSeries <- dim(RetSeries)[2]
# size the series and the window
if (is.null(WindowLength)) {
WindowLength <- NumObs
}
if (is.null(DecayFactor)) {
DecayFactor = 1
}
if (DecayFactor <= 0 | DecayFactor > 1) {
stop('Must have 0< decay factor <= 1.')
}
if (WindowLength > NumObs){
stop(sprintf('Window Length #d must be <= number of observations #d',
WindowLength, NumObs))
}
# ------------------------------------------------------------------------
# size the data to the window
RetSeries <- RetSeries[NumObs-WindowLength+1:NumObs, ]
# Calculate decay coefficients
DecayPowers <- seq(WindowLength-1, 0, by = -1)
VarWts <- sqrt(DecayFactor)^DecayPowers
RetWts <- (DecayFactor)^DecayPowers
NEff = sum(RetWts) # number of equivalent values in computation
# Compute the exponentially weighted mean return
WtSeries <- matrix(rep(RetWts, times = NumSeries),
nrow = length(RetWts), ncol = NumSeries) * RetSeries
ERet <- colSums(WtSeries)/NEff;
# Subtract the weighted mean from the original Series
CenteredSeries <- RetSeries - matrix(rep(ERet, each = WindowLength),
nrow = WindowLength, ncol = length(ERet))
# Compute the weighted variance
WtSeries <- matrix(rep(VarWts, times = NumSeries),
nrow = length(VarWts), ncol = NumSeries) * CenteredSeries
ECov <- t(WtSeries) %*% WtSeries / NEff
list(ExpReturn = ERet, ExpCovariance = ECov, NumEffObs = NEff)
}
I'm trying to simulate the sampling of wildlife from a given site. I've made a species list that contains all species that can be found at that site and their associated rarity.
df <- data.frame(rarity = rep(c('common', 'uncommon', 'rare'), each = 2),
species = letters[1:6])
print(df)
rarity species
1 common a
2 common b
3 uncommon c
4 uncommon d
5 rare e
6 rare f
I then create another data set based on the random sampling of rows from df.
df.sampled <- df[sample(1:nrow(df), 30, T),]
The trouble is that this isn't realistic; you're not going to encounter rare species as frequently as uncommon species as common species. For example, 6 out of 10 animals encountered should be common, 3 out of 10 animals should be uncommon, and 1 out of 10 animals shouldbe rare. Here, we're getting all three rarities at equal frequency:
df.matrix <- matrix(NA, ncol = 3, nrow = 1000)
for(i in 1:1000){
df.sampled <- df[sample(1:6, 30, T),]
df.matrix[i,] <- c(table(df.sampled$rarity))
}
apply(df.matrix, 2, mean)
Is there a way I can sample particular rows more often than others given their rarity? I have a feeling qnorm() should be used, but I could be wrong...
Here is your line edited to use the prob argument with example values of 0.6 for common, 0.3 for uncommon and 0.1 for rare:
prob_vec <- c(0.6, 0.6, 0.3, 0.3, 0.1, 0.1)
df.sampled <- df[sample(1:nrow(df), 30, T, prob = prob_vec),]
df.sampled now has a more uneven distribution.
I have a data frame that looks like this, but obviously with many more rows etc:
df <- data.frame(id=c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
cond=c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'),
comm=c('X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y','X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'),
measure=c(0.8, 1.1, 0.7, 1.2, 0.9, 2.3, 0.6, 1.1, 0.7, 1.3, 0.6, 1.5, 1.0, 2.1, 0.7, 1.2))
So we have 2 factors (each with 2 levels, thus 4 combinations) and one continuous measure. We also have a repeated measures design in that we have multiple measure's within each cell that correspond to the same id.
I've attempted to first solve the groupby issue, then the bootstrap issue, then combine the two, but am pretty much stuck...
Stats, grouped by the 2 factors
I can get multiple summary stats for each of the 4 cells by:
summary_stats <- aggregate(df$measure,
by = list(df$cond, df$comm),
function(x) c(mean = mean(x), median = median(x), sd = sd(x)))
print(summary_stats)
resulting in
Group.1 Group.2 x.mean x.median x.sd
1 A X 0.85000000 0.85000000 0.12909944
2 B X 0.65000000 0.65000000 0.05773503
3 A Y 1.70000000 1.70000000 0.58878406
4 B Y 1.25000000 1.20000000 0.17320508
This is great as we are getting multiple stats for each of the 4 cells.
But what I'd really like is the 95% bootstrap CI's, for each stat, for each of the 4 cells. I don't mind if I have to run a final solution once for statistic (e.g. mean, median, etc), but bonus points for doing it all in one go.
Bootstrap for repeated measures
Can't quite make this work, but what I want is 95% bootstrap CI's, done in a way which is appropriate for this repeated measures design. Unless I'm mistaken then I want to select bootstrap samples on the basis of id (not on the basis of rows of the dataframe), then calculate a summary measure (e.g. mean) for each of the 4 cells.
library(boot)
myfunc <- function(data, indices) {
# select bootstrap sample to index into `id`
d <- data[data$id==indicies,]
return(c(mean=mean(d), median=median(d), sd = sd(d)))
}
bresults <- boot(data = CO2$uptake, statistic = myfunc, R = 1000)
Q1: I'm getting errors in selecting the bootstrap sample by id, i.e. the line d <- data[ data$id==indicies, ]
Combining bootstrap and the groupby 2 factors
Q2: I have no intuition of how to gel the two approaches together to achieve the final desired result. My only idea is to put the aggregate call in myfunc, to repeatedly calculate cell stats under each bootstrap replicate, but I'm out of my comfort zone with R here.
With your two questions, you have two issues:
How to bootstrap (resample) your data in such a way that you resample based on id, rather than rows
How to perform separate bootstraps for the four groups in your 2x2 design
One easy way to do this would be by using the following packages (all part of the tidyverse):
dplyr for manipulating your data (in particular, summarising the data you have for each id) and also for the neat %>% forward pipe operator which supplies the result of an expression as the first argument to the next expression so you can chain commands
broom for doing an operation for each group in your dataframe
boot (which you already use) for the bootstrapping
Load the packages:
library(dplyr)
library(broom)
library(boot)
First of all, to make sure when we resample we include a subject or not, I would save the various values each subject has as a list:
df <- df %>%
group_by(id, cond, comm) %>%
summarise(measure=list(measure)) %>%
ungroup()
Now the dataframe has fewer rows (4 per ID), and the variable measure is not numeric anymore (instead, it's a list). This means we can just use the indices that boot provides (solving issue 1), but also that we'll have to "unlist" it when we actually want to do calculations with it, so your function now becomes:
myfunc <- function(data, indices) {
data <- data[indices,]
return(c(mean=mean(unlist(data$measure)),
median=median(unlist(data$measure)),
sd = sd(unlist(data$measure))))
}
Now that we can simply use boot to resample each row, we can think about how to do it neatly per group. This is where the broom package comes in: you can ask it to do an operation for each group in your data frame, and store it in a tidy dataframe, with one row for each of your groups, and a column for the values that your function produces. So we simply group the dataframe again, and then call do(tidy(...)), with a . instead of the name of our variable. This hopefully solves issue 2 for you!
bootresults <- df %>%
group_by(cond, comm) %>%
do(tidy(boot(data = ., statistic = myfunc, R = 1000)))
This produces:
# Groups: cond, comm [4]
cond comm term statistic bias std.error
<fctr> <fctr> <chr> <dbl> <dbl> <dbl>
1 A X mean 0.85000000 0.000000000 5.280581e-17
2 A X median 0.85000000 0.000000000 5.652979e-17
3 A X sd 0.12909944 -0.004704999 4.042676e-02
4 A Y mean 1.70000000 0.000000000 1.067735e-16
5 A Y median 1.70000000 0.000000000 1.072347e-16
6 A Y sd 0.58878406 -0.005074338 7.888294e-02
7 B X mean 0.65000000 0.000000000 0.000000e+00
8 B X median 0.65000000 0.000000000 0.000000e+00
9 B X sd 0.05773503 0.000000000 0.000000e+00
10 B Y mean 1.25000000 0.001000000 7.283065e-02
11 B Y median 1.20000000 0.027500000 7.729634e-02
12 B Y sd 0.17320508 -0.030022214 5.067446e-02
Hopefully this is what you'd like to see!
If you want to then use the values from this dataframe a bit more, you can use other dplyr functions to select which rows in this table you look at. For example, to look at the bootstrapped standard error of the standard deviation of your measure for condition A / X, you can do the following:
bootresults %>% filter(cond=='A', comm=='X', term=='sd') %>% pull(std.error)
I hope that helps!
For a bootstrap with a cluster variable, here's a solution without additional packages. I didn't use the boot package though.
Part 1: Bootstrap
This function draws a random sample from a set of clustered observations.
.clusterSample <- function(x, id){
boot.id <- sample(unique(id), replace=T)
out <- lapply(boot.id, function(i) x[id%in%i,])
return( do.call("rbind",out) )
}
Part 2: Boostrap estimates and CIs
The next function draws multiple samples and applies the same aggregate statement to each of them. The bootstrap estimates and CIs are then obtained by mean and quantile.
clusterBoot <- function(data, formula, cluster, R=1000, alpha=.05, FUN){
# cluster variable
cls <- model.matrix(cluster,data)[,2]
template <- aggregate(formula, .clusterSample(data,cls), FUN)
var <- which( names(template)==all.vars(formula)[1] )
grp <- template[,-var,drop=F]
val <- template[,var]
x <- vapply( 1:R, FUN=function(r) aggregate(formula, .clusterSample(data,cls), FUN)[,var],
FUN.VALUE=val )
if(is.vector(x)) dim(x) <- c(1,1,length(x))
if(is.matrix(x)) dim(x) <- c(nrow(x),1,ncol(x))
# bootstrap estimates
est <- apply( x, 1:2, mean )
lo <- apply( x, 1:2, function(i) quantile(i,alpha/2) )
up <- apply( x, 1:2, function(i) quantile(i,1-alpha/2) )
colnames(lo) <- paste0(colnames(lo), ".lo")
colnames(up) <- paste0(colnames(up), ".up")
return( cbind(grp,est,lo,up) )
}
Note the use of vapply. I use it because I prefer working with arrays over lists. Note also that I used the formula interface to aggregate, which I also like better.
Part 3: Examples
It can be used with any kind of stats, basically, even without grouping variables. Some examples include:
myStats <- function(x) c(mean = mean(x), median = median(x), sd = sd(x))
clusterBoot(data=df, formula=measure~cond+comm, cluster=~id, R=10, FUN=myStats)
# cond comm mean median sd mean.lo median.lo sd.lo mean.up median.up sd.up
# 1 A X 0.85 0.850 0.11651125 0.85 0.85 0.05773503 0.85 0.85 0.17320508
# 2 B X 0.65 0.650 0.05773503 0.65 0.65 0.05773503 0.65 0.65 0.05773503
# 3 A Y 1.70 1.700 0.59461417 1.70 1.70 0.46188022 1.70 1.70 0.69282032
# 4 B Y 1.24 1.215 0.13856406 1.15 1.15 0.05773503 1.35 1.35 0.17320508
clusterBoot(data=df, formula=measure~cond+comm, cluster=~id, R=10, FUN=mean)
# cond comm est .lo .up
# 1 A X 0.85 0.85 0.85
# 2 B X 0.65 0.65 0.65
# 3 A Y 1.70 1.70 1.70
# 4 B Y 1.25 1.15 1.35
clusterBoot(data=df, formula=measure~1, cluster=~id, R=10, FUN=mean)
# est .lo .up
# 1 1.1125 1.0875 1.1375