Sample from an unknown probability distribution - r

I have a vector of ~100k length, with values between 0 and 1 representing habitat suitability at geographic locations. While some of the values are very small, many of them are 0.9 etc, so the sum is much greater than one.
I would like to generate 1000 random samples of locations, each sample having length 6 (without replacement), with the probability that a location is chosen being weighted by the value of the vector at that location.
Dummy data below. Any ideas?
mylocs = letters[1:10]
myprobs = c(0.1,NA,0.01,0.2,0.6,NA,0.001,0.03,0.9,NA)
mydata = data.frame(mylocs,myprobs)

I'm a bit confused with your question, so here are two possible answers.
If you want you want to sample 1000 groups of six values, where groups can share values, then:
locs = letters[1:15]
probs = c(0.1,NA,0.01,0.2,0.6,NA,0.001,0.03,0.9,NA, 0.1, 0.1, 0.1, 0.1, 0.1)
mydata = data.frame(locs,probs)
d = na.omit(mydata)
replicate(1000, sample(d$locs, size=6, prob=d$probs, replace=F))
If groups shouldn't share values, then just do:
## Change the "2" to 1000 in the real data set
s = sample(d$locs, size=6*2, prob=d$probs, replace=F)
matrix(s, ncol=6)

Related

How to generate spatially correlated random fields of very high dimension with R

This is an extended question I found from here (Method #1: http://santiago.begueria.es/2010/10/generating-spatially-correlated-random-fields-with-r/) and here (Method #2: https://gist.github.com/brentp/1306786). I know these two sites covered very well (Thanks!) with relatively small size of dimension (e.g., 1000x1). I am trying to generate spatially clustered binary data with large size of dimension like >=100000x1 dimension, for example, c(1,1,1,1,0,1,0,0,0,0, …, 0,0,0,0,0,0,0,0,0,0,0,0) with 1000 times / case study. Here are slightly modified codes from the sites.
# Method #1
dim1 <- 1000
dim2 <- 1
xy <- expand.grid(seq_len(dim1), seq_len(dim2))
colnames(xy) <- c("x", "y")
geo.model <- gstat(formula = z~x+y, locations = ~x+y, dummy = TRUE, beta = 0,
model = vgm(psill = 1,"Exp",
range = dim1), # Range parameter!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
nmax = 30) # Spatial correlation model
sim.mat <- predict(geo.model, newdata = xy, nsim = 1)
sim.mat[,3] <- ifelse(sim.mat[,3] > quantile(sim.mat[,3], .1), 0, 1)
plot(sim.mat[, 3])
# Method #2
# generate autocorrelated data.
nLags = 1000 # number of lags (size of region)
# fake, uncorrelated observations
X = rnorm(nLags)
# fake sigma... correlated decreases distance.
sigma = diag(nLags)
corr = .999
sigma <- corr ^ abs(row(sigma)-col(sigma))
#sigma
# Y is autocorrelated...
Y <- t(X %*% chol(sigma))
y <- ifelse(Y >= quantile(Y, probs=.9), 1, 0)[, 1]
plot(y)
Both methods work very well to generate binary data when dim1 is less than 10000. However, when I tried several hundred thousand (e.g., >= 100,000), it seems to take a long time or memory issue.
For example, when I used “nLags = 50000” in Method #2, I got an error message (“Error: cannot allocate vector of size 9.3 Gb”) after the code “sigma <- corr ^ abs(row(sigma)-col(sigma))”.
I would like to find an efficient (time- and memory-saving) way to generate such a spatially clustered binary data 1000 times (especially, with dim1 >= 100000) per each case study (about 200 cases).
I have thought about applying multiple probabilities in "sample" function or probability distribution. I am not sure how to and beyond my scope.

generating a discrete random probability distribution, by perturbing an existing one

If I wanted to efficiently generate a random discrete probability distribution of N probabilities which sum up to 1, I could go with Hadley's comment here:
prop.table(runif(N))
If I repeat this many times, the average probability for each of the N elements should be ~1/N.
What if I want the average probability for each of the N elements not to be 1/N but a specified number a priori?
E.g. N = 4 elements, I have the apriori distribution:
apriori <- c(0.2, 0.3, 0.1, 0.4)
And I would like random distributions based on this a priori, e.g.:
c(0.21, 0.29, 0.12, 0.38)
c(0.19, 0.29, 0.08, 0.44)
c(0.19, 0.33, 0.1, 0.38)
Etc.
Where we go by either of these rules:
1) On average each of the elements probabilities would be (approx.) its probability in the a priori distribution
2) There's a "perturbation" parameter, say perturbation = 0.05 which means either: (a) we're letting each of the probabilities i to be in the apriori[i] +- perturbation range or (b) we're letting each of the probabilities i to be in the apriori[i] +- perturbation * apriori[i] range (i.e. plus/minus 5% of that apriori probability, not absolute 5%)
I have no idea how to do this while keeping rule 1.
Regarding rule 2, my initial inefficient thought would be perturbing each of the first N - 1 elements by a random allowed amount, setting the last element to be 1 - sum(N-1_probs) and wrapping this with a while loop until the last element is also legitimate.
I didn't even implement it yet because that's very inefficient (say I want 100K of such distributions...). Ideas?
As proposed by prof.Bolker, you ought to look at Dirichlet distribution. Let's denote mean apriori values by capital letters Ci and sampled values by small letters ci. It will automatically, from distribution properties, provide you with two features:
Sum i ci = 1
Each ci is within [0...1] range
so right away you could use them as probabilities.
Given Ci, and looking at distribution definition (check the link), the only free parameter left is
a0 = Sum i ai
and each ai = Ci * a0
Such choice of ai will (again, automatically) provide proper mean value E[ci] = Ci.
Bigger a0 - ci would be more narrow around Ci. Variance is roughly speaking Var[ci] ~ Ci/a0, so for 5% you might try to use a0 of 50.
Some R code
library(MCMCpack)
apriori <- c(0.2, 0.3, 0.1, 0.4) # your C_i
a0 <- 50
a <- a0*apriori
set.seed(12345)
# sample your c_i and use it, for example, to throw uneven dice
ci <- rdirichlet(1, a)
dice <- rmultinom(1, 1, ci)
# another dice throw
ci <- rdirichlet(1, a)
dice <- rmultinom(1, 1, ci)
...
I have a solution, but it will end up with the draws being normal. I think you can probably do something similar to draw a uniform distribution. Don't have much experience with this, but I would lean towards a rejection kind of policy where you draw lots of things quickly, and then reject the ones that don't fit your criteria
rm(list = ls())
library(parallel)
library(data.table)
library(tictoc)
# set up the distribution informatoin
P <- 4
values <- 1:P
dist_scores <- data.table(param = values,
prob = c(0.2, 0.3, 0.1, 0.4), key = "param")
perturbation <- 0.05
method = "a"
switch (method,
"a" = {dist_scores[, min := prob - perturbation]
dist_scores[, max := prob + perturbation]},
"b" = {dist_scores[, min := prob * (1-perturbation)]
dist_scores[, max := prob * (1+perturbation)]}
)
# turn this in to a set of data that can be sampled
N <- 10000
v <- unlist(sapply(values, FUN = function(x){
rep(x, round(dist_scores$prob[x]*N, 0))
}))
table(v)/N
# set number of samples, and number of draws for each iteration
sams <- 10000
reps <- 200
tic()
# loop through and draw reps from the sample. Rejection policy will remove
# ones that dont meet the conditions
new_iters <- mclapply(1:sams, FUN = function(x){
y <- data.table(param = sample(v, reps, replace = TRUE))
out <- y[, .(val = .N/reps), keyby = param]
out <- dist_scores[out,]
if(out[,all(val >= min & val <= max)]){
return(out[, c("param", "val"), with = FALSE])
}else{
return(NULL)
}
})
reject_rate <- sum(sapply(new_iters, is.null))/sams
# number of samples
sams - reject_rate*sams
toc()
out <- rbindlist(new_iters)
par(mfrow = c(2,2))
for(i in values){
hist(out[param == i, val])
}enter code here
and using a normal distribution for each of your probability ?
perturbation <- 0.05
plouf <- sapply(apriori,function(x){max(rnorm(1,mean = x, sd = perturbation*x),0)})
plouf <- plouf/sum(plouf)
> plouf
[1] 0.2020629 0.3057111 0.0994482 0.3927778

How does createDataPartition function from caret package split data?

From the documentation:
For bootstrap samples, simple random sampling is used.
For other data splitting, the random sampling is done within the levels of y
when y is a factor in an attempt to balance the class distributions within
the splits.
For numeric y, the sample is split into groups sections based on percentiles
and sampling is done within these subgroups.
For createDataPartition, the number of percentiles is set via the groups
argument.
I don't understand why this "balance" thing is needed. I think I understand it superficially, but any additional insight would be really helpful.
It means, if you have a data set ds with 10000 rows
set.seed(42)
ds <- data.frame(values = runif(10000))
with 2 "classes" with unequal distribution (9000 vs 1000)
ds$class <- c(rep(1, 9000), rep(2, 1000))
ds$class <- as.factor(ds$class)
table(ds$class)
# 1 2
# 9000 1000
you can create a sample, which tries to maintain the ratio / "balance" of the factor classes.
dpart <- createDataPartition(ds$class, p = 0.1, list = F)
dsDP <- ds[dpart, ]
table(dsDP$class)
# 1 2
# 900 100

automation of subset process

It is probably easy, but I can't figure it out.
I have a data frame with over 70 variables. I make predictions using all those variables. For sensitivity analysis I would like to subset the data frame automatically to see how the prediction performs on this specific subset.
I have done this manually but with over 100 different subset options it is very tedious.
Here is the data/code and my desired solution:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
a = c(1.7, 3.3, 5.1)
df = data.frame(n, s, b, a)
df
To calculate the accuracy of prediction a
df$calc <- df$a - df$n
df$difference <- sqrt(df$calc * df$calc)
With these values I can now calculate the Mean and SD
Mean <- mean(df$difference)
SD <- sd(df$difference)
Let's say I would like to get an overview of the prediction accuracy for all cases where b = TRUE. (Or other subsets of the data)
Ideally I would like a data frame to look like this:
subset = c("b=TRUE", "b=FALSE", "s=aa")
amount = c(2, 1, 1) # count the number this subset occurs
Mean = c(0.22, 0.3, 0.1)
SD = c(0.1, 0.2, 0.5)
OV = data.frame(subset, amount, Mean, SD)
OV
Considering that I have more than 100 different subsets that I would like to create, I need a fast solution that generates an overview like the OV data frame. I tried a loop, but I have trouble defining a vector for subsetting the data.
Thanks!

Taking a disproportionate sample from a dataset in R

If I have a large dataset in R, how can I take random sample of the data taking into consideration the distribution of the original data, particularly if the data are skewed and only 1% belong to a minor class and I want to take a biased sample of the data?
The sample(x, n, replace = FALSE, prob = NULL) function takes a sample from a vector x of size n. This sample can be with or without replacement, and the probabilities of selecting each element to the sample can be either the same for each element, or a vector informed by the user.
If you want to take a sample of same probabilities for each element with 50 cases, all you have to do is
n <- 50
smpl <- df[sample(nrow(df), 50),]
However, if you want to give different probabilities of being selected for the elements, let's say, elements that sex is M has probability 0.25, while those whose sex is F has prob 0.75, you should do
n <- 50
prb <- ifelse(sex=="M",0.25,0.75)
smpl <- df[sample(nrow(df), 50, prob = prb),]

Resources