Sampling using conditional probability table

Sampling using conditional probability table - r

I am trying to simulate certain discrete variable depicting "true state of the world" (say, "red", "green" or "blue") and its indicator, somewhat imperfectly describing it.
r_names <- c("real_R", "real_G", "real_B")
Lets say I have some prior belief about distribution of "reality" variable, which I will use to sample it.
r_probs <- c(0.3, 0.5, 0.2)
set.seed(100)
reality <- sample(seq_along(r_names), 10000, prob=r_probs, replace = TRUE)
Now, let's say I have conditional probability table that stipulates the value of indicator given each of the "realities"
ri_matrix <- matrix(c(0.7, 0.3, 0,
0.2, 0.6, 0.2,
0.05,0.15,0.8), byrow=TRUE,nrow = 3)
dimnames(ri_matrix) <- list(paste("real", r_names, sep="_"),
paste("ind", r_names, sep="_"))
ri_matrix
># ind_R ind_G ind_B
># real_Red 0.70 0.30 0.0
># real_Green 0.20 0.60 0.2
># real_Blue 0.05 0.15 0.8
Since base::sample() is not vectorized for prob argument, I have to:
sample_cond <- function(r, rim){
unlist(lapply(r, function(x)
sample(seq_len(ncol(rim)), 1, prob = rim[x,], replace = TRUE)))
}
Now I can sample my "indicator" variable using the conditional probability matrix
set.seed(200)
indicator <- sample_cond(reality, ri_matrix)
Just to make sure the distributions turned out as expected:
prop.table(table(reality, indicator), margin = 1)
#> indicator
#> reality 1 2 3
#> 1 0.70043610 0.29956390 0.00000000
#> 2 0.19976124 0.59331476 0.20692400
#> 3 0.04365278 0.14400401 0.81234320
Is there a better (i.e. more idiomatic and/or efficient) way to sample a discrete variable conditioned on another discrete random variable?
UPDATE:
As suggested by #Mr.Flick, this is at least 50x faster, because it reuses probability vectors instead of repeated subsetting of the conditional probability matrix.
sample_cond_group <- function(r, rim){
il <- mapply(function(x,y){sample(seq(ncol(rim)), length(x), prob = y, replace = TRUE)},
x=split(r, r),
y=split(rim, seq(nrow(rim))))
unsplit(il, r)
}

You can be a bit more efficient by drawing all the random samples per group with a split/combine type strategy. That might look something like this
simFun <- function(N, r_probs, ri_matrix) {
stopifnot(length(r_probs) == nrow(ri_matrix))
ind <- sample.int(length(r_probs), N, prob = r_probs, replace=TRUE)
grp <- split(data.frame(ind), ind)
unsplit(Map(function(data, r) {
draw <-sample.int(ncol(ri_matrix), nrow(data), replace=TRUE, prob=ri_matrix[r, ])
data.frame(data, draw)
}, grp, as.numeric(names(grp))), ind)
}
Than you can call with
simFun(10000, r_probs, ri_matrix)

Related

How to assign weights to sample in R

Before performing some statistical analysis I would like to add weights to my sample as a function of a variable (the population size for each areal unit) so that the higher the population size within each unit, the greater the weight it will get and the opposite. Do you have any suggestion on how to do this in R? Thanks in advance

You can do this with weighted.mean(), providing the weights as the second argument.
Here is a quick example, using population as weights.
dat <- data.frame(
country = c("UK", "US", "France", "Zimbabwe"),
pop = c(6.7e4, 3.31e8, 6.8e4, 1.5e4),
love_of_british_royal_family = c(5, 9, 2, 1)
)
mean(dat$love_of_british_royal_family) # 4.25
weighted.mean(
dat$love_of_british_royal_family,
w = dat$pop
) # 8.997391

SamR's weighted.mean requires a weight for each member of your vector. If you have a population vector with many members and want to weight by a catagories of population size, you could use the base R cut function. Here is a toy example:
population <- sample(200:200000, 100)
df <- data.frame(population)
breaks <- c(200, 10000, 50000, 100000, 200000)
labels <- c(0.1, 0.2, 0.3, 0.4)
cuts <- cut(df$population, breaks = breaks, labels = labels)
df$weights <- as.numeric(as.character(cuts))
head(df)
population weights
1 25087 0.2
2 92652 0.3
3 99051 0.3
4 136376 0.4
5 184573 0.4
6 147675 0.4
Note that cuts is a vector of factors. Therefore the as.character(cuts) conversion is required to maintain the intended fractional weights.

Generate totals of multinomial distribution directly

Let's assume we want to generate n samples from a multinomial distribution from given probabilities p. This works well with sample or rmultinorm. The totals can then be counted with table. Now I wonder if there is a direct way (or another distribution) available to get the result of table directly without generating complete sample vectors.
Here an example:
set.seed(123)
n <- 10000 # sample size
p <- c(0.1, 0.2, 0.7) # probabilities, sum up to 1.0
## 1) approach with sample
x <- sample(1:3, size = n, prob = p, replace = TRUE)
table(x)
# x
# 1 2 3
# 945 2007 7048
## 2) approach with rmultinorm
x <- rmultinom(n, size = 1, prob = p) * 1:3
table(x[x != 0])
# 1 2 3
# 987 1967 7046

Expected return and covariance from return time series

I’m trying to simulate the Matlab ewstats function here defined:
https://it.mathworks.com/help/finance/ewstats.html
The results given by Matlab are the following ones:
> ExpReturn = 1×2
0.1995 0.1002
> ExpCovariance = 2×2
0.0032 -0.0017
-0.0017 0.0010
I’m trying to replicate the example with the RiskPortfolios R package:
https://cran.r-project.org/web/packages/RiskPortfolios/RiskPortfolios.pdf
The R code I’m using is this one:
library(RiskPortfolios)
rets <- as.matrix(cbind(c(0.24, 0.15, 0.27, 0.14), c(0.08, 0.13, 0.06, 0.13)))
w <- 0.98
rets
w
meanEstimation(rets, control = list(type = 'ewma', lambda = w))
covEstimation(rets, control = list(type = 'ewma', lambda = w))
The mean estimation is the same of the one in the example, but the covariance matrix is different:
> rets
[,1] [,2]
[1,] 0.24 0.08
[2,] 0.15 0.13
[3,] 0.27 0.06
[4,] 0.14 0.13
> w
[1] 0.98
>
> meanEstimation(rets, control = list(type = 'ewma', lambda = w))
[1] 0.1995434 0.1002031
>
> covEstimation(rets, control = list(type = 'ewma', lambda = w))
[,1] [,2]
[1,] 0.007045044 -0.003857217
[2,] -0.003857217 0.002123827
Am I missing something?
Thanks

They give the same answer if type = "lw" is used:
round(covEstimation(rets, control = list(type = 'lw')), 4)
## 0.0032 -0.0017
## -0.0017 0.0010

They are using different algorithms. From the RiskPortfolio manual:
ewma ... See RiskMetrics (1996)
From the Matlab hlp page:
There is no relationship between ewstats function and the RiskMetrics® approach for determining the expected return and covariance from a return time series.
Unfortunately Matlab does not tell us which algorithm is used.

For those who eventually need an equivalent ewstats function in R, here the code I wrote:
ewstats <- function(RetSeries, DecayFactor=NULL, WindowLength=NULL){
#EWSTATS Expected return and covariance from return time series.
# Optional exponential weighting emphasizes more recent data.
#
# [ExpReturn, ExpCovariance, NumEffObs] = ewstats(RetSeries, ...
# DecayFactor, WindowLength)
#
# Inputs:
# RetSeries : NUMOBS by NASSETS matrix of equally spaced incremental
# return observations. The first row is the oldest observation, and the
# last row is the most recent.
#
# DecayFactor : Controls how much less each observation is weighted than its
# successor. The k'th observation back in time has weight DecayFactor^k.
# DecayFactor must lie in the range: 0 < DecayFactor <= 1.
# The default is DecayFactor = 1, which is the equally weighted linear
# moving average Model (BIS).
#
# WindowLength: The number of recent observations used in
# the computation. The default is all NUMOBS observations.
#
# Outputs:
# ExpReturn : 1 by NASSETS estimated expected returns.
#
# ExpCovariance : NASSETS by NASSETS estimated covariance matrix.
#
# NumEffObs: The number of effective observations is given by the formula:
# NumEffObs = (1-DecayFactor^WindowLength)/(1-DecayFactor). Smaller
# DecayFactors or WindowLengths emphasize recent data more strongly, but
# use less of the available data set.
#
# The standard deviations of the asset return processes are given by:
# STDVec = sqrt(diag(ECov)). The correlation matrix is :
# CorrMat = VarMat./( STDVec*STDVec' )
#
# See also MEAN, COV, COV2CORR.
NumObs <- dim(RetSeries)[1]
NumSeries <- dim(RetSeries)[2]
# size the series and the window
if (is.null(WindowLength)) {
WindowLength <- NumObs
}
if (is.null(DecayFactor)) {
DecayFactor = 1
}
if (DecayFactor <= 0 | DecayFactor > 1) {
stop('Must have 0< decay factor <= 1.')
}
if (WindowLength > NumObs){
stop(sprintf('Window Length #d must be <= number of observations #d',
WindowLength, NumObs))
}
# ------------------------------------------------------------------------
# size the data to the window
RetSeries <- RetSeries[NumObs-WindowLength+1:NumObs, ]
# Calculate decay coefficients
DecayPowers <- seq(WindowLength-1, 0, by = -1)
VarWts <- sqrt(DecayFactor)^DecayPowers
RetWts <- (DecayFactor)^DecayPowers
NEff = sum(RetWts) # number of equivalent values in computation
# Compute the exponentially weighted mean return
WtSeries <- matrix(rep(RetWts, times = NumSeries),
nrow = length(RetWts), ncol = NumSeries) * RetSeries
ERet <- colSums(WtSeries)/NEff;
# Subtract the weighted mean from the original Series
CenteredSeries <- RetSeries - matrix(rep(ERet, each = WindowLength),
nrow = WindowLength, ncol = length(ERet))
# Compute the weighted variance
WtSeries <- matrix(rep(VarWts, times = NumSeries),
nrow = length(VarWts), ncol = NumSeries) * CenteredSeries
ECov <- t(WtSeries) %*% WtSeries / NEff
list(ExpReturn = ERet, ExpCovariance = ECov, NumEffObs = NEff)
}

Extracting certain levels more than others

I'm trying to simulate the sampling of wildlife from a given site. I've made a species list that contains all species that can be found at that site and their associated rarity.
df <- data.frame(rarity = rep(c('common', 'uncommon', 'rare'), each = 2),
species = letters[1:6])
print(df)
rarity species
1 common a
2 common b
3 uncommon c
4 uncommon d
5 rare e
6 rare f
I then create another data set based on the random sampling of rows from df.
df.sampled <- df[sample(1:nrow(df), 30, T),]
The trouble is that this isn't realistic; you're not going to encounter rare species as frequently as uncommon species as common species. For example, 6 out of 10 animals encountered should be common, 3 out of 10 animals should be uncommon, and 1 out of 10 animals shouldbe rare. Here, we're getting all three rarities at equal frequency:
df.matrix <- matrix(NA, ncol = 3, nrow = 1000)
for(i in 1:1000){
df.sampled <- df[sample(1:6, 30, T),]
df.matrix[i,] <- c(table(df.sampled$rarity))
}
apply(df.matrix, 2, mean)
Is there a way I can sample particular rows more often than others given their rarity? I have a feeling qnorm() should be used, but I could be wrong...

Here is your line edited to use the prob argument with example values of 0.6 for common, 0.3 for uncommon and 0.1 for rare:
prob_vec <- c(0.6, 0.6, 0.3, 0.3, 0.1, 0.1)
df.sampled <- df[sample(1:nrow(df), 30, T, prob = prob_vec),]
df.sampled now has a more uneven distribution.

repeated measures bootstrap stats, grouped by multiple factors

I have a data frame that looks like this, but obviously with many more rows etc:
df <- data.frame(id=c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
cond=c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'),
comm=c('X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y','X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'),
measure=c(0.8, 1.1, 0.7, 1.2, 0.9, 2.3, 0.6, 1.1, 0.7, 1.3, 0.6, 1.5, 1.0, 2.1, 0.7, 1.2))
So we have 2 factors (each with 2 levels, thus 4 combinations) and one continuous measure. We also have a repeated measures design in that we have multiple measure's within each cell that correspond to the same id.
I've attempted to first solve the groupby issue, then the bootstrap issue, then combine the two, but am pretty much stuck...
Stats, grouped by the 2 factors
I can get multiple summary stats for each of the 4 cells by:
summary_stats <- aggregate(df$measure,
by = list(df$cond, df$comm),
function(x) c(mean = mean(x), median = median(x), sd = sd(x)))
print(summary_stats)
resulting in
Group.1 Group.2 x.mean x.median x.sd
1 A X 0.85000000 0.85000000 0.12909944
2 B X 0.65000000 0.65000000 0.05773503
3 A Y 1.70000000 1.70000000 0.58878406
4 B Y 1.25000000 1.20000000 0.17320508
This is great as we are getting multiple stats for each of the 4 cells.
But what I'd really like is the 95% bootstrap CI's, for each stat, for each of the 4 cells. I don't mind if I have to run a final solution once for statistic (e.g. mean, median, etc), but bonus points for doing it all in one go.
Bootstrap for repeated measures
Can't quite make this work, but what I want is 95% bootstrap CI's, done in a way which is appropriate for this repeated measures design. Unless I'm mistaken then I want to select bootstrap samples on the basis of id (not on the basis of rows of the dataframe), then calculate a summary measure (e.g. mean) for each of the 4 cells.
library(boot)
myfunc <- function(data, indices) {
# select bootstrap sample to index into `id`
d <- data[data$id==indicies,]
return(c(mean=mean(d), median=median(d), sd = sd(d)))
}
bresults <- boot(data = CO2$uptake, statistic = myfunc, R = 1000)
Q1: I'm getting errors in selecting the bootstrap sample by id, i.e. the line d <- data[ data$id==indicies, ]
Combining bootstrap and the groupby 2 factors
Q2: I have no intuition of how to gel the two approaches together to achieve the final desired result. My only idea is to put the aggregate call in myfunc, to repeatedly calculate cell stats under each bootstrap replicate, but I'm out of my comfort zone with R here.

With your two questions, you have two issues:
How to bootstrap (resample) your data in such a way that you resample based on id, rather than rows
How to perform separate bootstraps for the four groups in your 2x2 design
One easy way to do this would be by using the following packages (all part of the tidyverse):
dplyr for manipulating your data (in particular, summarising the data you have for each id) and also for the neat %>% forward pipe operator which supplies the result of an expression as the first argument to the next expression so you can chain commands
broom for doing an operation for each group in your dataframe
boot (which you already use) for the bootstrapping
Load the packages:
library(dplyr)
library(broom)
library(boot)
First of all, to make sure when we resample we include a subject or not, I would save the various values each subject has as a list:
df <- df %>%
group_by(id, cond, comm) %>%
summarise(measure=list(measure)) %>%
ungroup()
Now the dataframe has fewer rows (4 per ID), and the variable measure is not numeric anymore (instead, it's a list). This means we can just use the indices that boot provides (solving issue 1), but also that we'll have to "unlist" it when we actually want to do calculations with it, so your function now becomes:
myfunc <- function(data, indices) {
data <- data[indices,]
return(c(mean=mean(unlist(data$measure)),
median=median(unlist(data$measure)),
sd = sd(unlist(data$measure))))
}
Now that we can simply use boot to resample each row, we can think about how to do it neatly per group. This is where the broom package comes in: you can ask it to do an operation for each group in your data frame, and store it in a tidy dataframe, with one row for each of your groups, and a column for the values that your function produces. So we simply group the dataframe again, and then call do(tidy(...)), with a . instead of the name of our variable. This hopefully solves issue 2 for you!
bootresults <- df %>%
group_by(cond, comm) %>%
do(tidy(boot(data = ., statistic = myfunc, R = 1000)))
This produces:
# Groups: cond, comm [4]
cond comm term statistic bias std.error
<fctr> <fctr> <chr> <dbl> <dbl> <dbl>
1 A X mean 0.85000000 0.000000000 5.280581e-17
2 A X median 0.85000000 0.000000000 5.652979e-17
3 A X sd 0.12909944 -0.004704999 4.042676e-02
4 A Y mean 1.70000000 0.000000000 1.067735e-16
5 A Y median 1.70000000 0.000000000 1.072347e-16
6 A Y sd 0.58878406 -0.005074338 7.888294e-02
7 B X mean 0.65000000 0.000000000 0.000000e+00
8 B X median 0.65000000 0.000000000 0.000000e+00
9 B X sd 0.05773503 0.000000000 0.000000e+00
10 B Y mean 1.25000000 0.001000000 7.283065e-02
11 B Y median 1.20000000 0.027500000 7.729634e-02
12 B Y sd 0.17320508 -0.030022214 5.067446e-02
Hopefully this is what you'd like to see!
If you want to then use the values from this dataframe a bit more, you can use other dplyr functions to select which rows in this table you look at. For example, to look at the bootstrapped standard error of the standard deviation of your measure for condition A / X, you can do the following:
bootresults %>% filter(cond=='A', comm=='X', term=='sd') %>% pull(std.error)
I hope that helps!

For a bootstrap with a cluster variable, here's a solution without additional packages. I didn't use the boot package though.
Part 1: Bootstrap
This function draws a random sample from a set of clustered observations.
.clusterSample <- function(x, id){
boot.id <- sample(unique(id), replace=T)
out <- lapply(boot.id, function(i) x[id%in%i,])
return( do.call("rbind",out) )
}
Part 2: Boostrap estimates and CIs
The next function draws multiple samples and applies the same aggregate statement to each of them. The bootstrap estimates and CIs are then obtained by mean and quantile.
clusterBoot <- function(data, formula, cluster, R=1000, alpha=.05, FUN){
# cluster variable
cls <- model.matrix(cluster,data)[,2]
template <- aggregate(formula, .clusterSample(data,cls), FUN)
var <- which( names(template)==all.vars(formula)[1] )
grp <- template[,-var,drop=F]
val <- template[,var]
x <- vapply( 1:R, FUN=function(r) aggregate(formula, .clusterSample(data,cls), FUN)[,var],
FUN.VALUE=val )
if(is.vector(x)) dim(x) <- c(1,1,length(x))
if(is.matrix(x)) dim(x) <- c(nrow(x),1,ncol(x))
# bootstrap estimates
est <- apply( x, 1:2, mean )
lo <- apply( x, 1:2, function(i) quantile(i,alpha/2) )
up <- apply( x, 1:2, function(i) quantile(i,1-alpha/2) )
colnames(lo) <- paste0(colnames(lo), ".lo")
colnames(up) <- paste0(colnames(up), ".up")
return( cbind(grp,est,lo,up) )
}
Note the use of vapply. I use it because I prefer working with arrays over lists. Note also that I used the formula interface to aggregate, which I also like better.
Part 3: Examples
It can be used with any kind of stats, basically, even without grouping variables. Some examples include:
myStats <- function(x) c(mean = mean(x), median = median(x), sd = sd(x))
clusterBoot(data=df, formula=measure~cond+comm, cluster=~id, R=10, FUN=myStats)
# cond comm mean median sd mean.lo median.lo sd.lo mean.up median.up sd.up
# 1 A X 0.85 0.850 0.11651125 0.85 0.85 0.05773503 0.85 0.85 0.17320508
# 2 B X 0.65 0.650 0.05773503 0.65 0.65 0.05773503 0.65 0.65 0.05773503
# 3 A Y 1.70 1.700 0.59461417 1.70 1.70 0.46188022 1.70 1.70 0.69282032
# 4 B Y 1.24 1.215 0.13856406 1.15 1.15 0.05773503 1.35 1.35 0.17320508
clusterBoot(data=df, formula=measure~cond+comm, cluster=~id, R=10, FUN=mean)
# cond comm est .lo .up
# 1 A X 0.85 0.85 0.85
# 2 B X 0.65 0.65 0.65
# 3 A Y 1.70 1.70 1.70
# 4 B Y 1.25 1.15 1.35
clusterBoot(data=df, formula=measure~1, cluster=~id, R=10, FUN=mean)
# est .lo .up
# 1 1.1125 1.0875 1.1375

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sampling using conditional probability table - r

Related

How to assign weights to sample in R

Generate totals of multinomial distribution directly

Expected return and covariance from return time series

Extracting certain levels more than others

repeated measures bootstrap stats, grouped by multiple factors

Categories

Resources