Generating random variables with specific correlation threshold value - r

I am generating random variables with specified range and dimension.I have made a following code for this.
generateRandom <- function(size,scale){
result<- round(runif(size,1,scale),1)
return(result)
}
flag=TRUE
x <- generateRandom(300,6)
y <- generateRandom(300,6)
while(flag){
corrXY <- cor(x,y)
if(corrXY>=0.2){
flag=FALSE
}
else{
x <- generateRandom(300,6)
y <- generateRandom(300,6)
}
}
I want following 6 variables with size 300 and scale of all is between 1 to 6 except for one variable which would have scale 1-7 with following correlation structure among them.
1 0.45 -0.35 0.46 0.25 0.3
1 0.25 0.29 0.5 -0.3
1 -0.3 0.1 0.4
1 0.4 0.6
1 -0.4
1
But when I try to increase threshold value my program gets very slow.Moreover,I want more than 7 variables of size 300 and between each pair of those variables I want some specific correlation threshold.How would I do it efficiently?

This answer is directly inspired from here and there.
We would like to generate 300 samples of a 6-variate uniform distribution with correlation structure equal to
Rhos <- matrix(0, 6, 6)
Rhos[lower.tri(Rhos)] <- c(0.450, -0.35, 0.46, 0.25, 0.3,
0.25, 0.29, 0.5, -0.3, -0.3,
0.1, 0.4, 0.4, 0.6, -0.4)
Rhos <- Rhos + t(Rhos)
diag(Rhos) <- 1
We first generate from this correlation structure the correlation structure of the Gaussian copula:
Copucov <- 2 * sin(Rhos * pi/6)
This matrix is not positive definite, we use instead the nearest positive definite matrix:
library(Matrix)
Copucov <- cov2cor(nearPD(Copucov)$mat)
This correlation structure can be used as one of the inputs of MASS::mvrnorm:
G <- mvrnorm(n=300, mu=rep(0,6), Sigma=Copucov, empirical=TRUE)
We then transform G into a multivariate uniform sample whose values range from 1 to 6, except for the last variable which ranges from 1 to 7:
U <- matrix(NA, 300, 6)
U[, 1:5] <- 5 * pnorm(G[, 1:5]) + 1
U[, 6] <- 6 * pnorm(G[, 6]) + 1
After rounding (and taking the nearest positive matrix to the copula's covariance matrix etc.), the correlation structure is not changed much:
Ur <- round(U, 1)
cor(Ur)

Related

Using R to sample 3 proportion variables so that the three samples add to 1

I have a data set that is split into 3 profiles
Profile 1 = 0.478 (95% confidence interval: 0.4, 0.56)
Profile 2 = 0.415 (95% confidence interval: 0.34, 0.49)
Profile 3 = 0.107 (95% confidence interval: 0.06, 0.15)
Profile 1 + Profile 2 + Profile 3 = 1
I want to create a stochastic model that selects a value for each profile from each proportion's confidence interval. I want to keep that these add up to one. I have been using
pro1_prop<- rpert (1, 0.4, 0.478, 0.56)
pro2_prop<- rpert (1, 0.34, 0.415, 0.49)
pro3_prop<- 1- (pro1_prop + pro2_prop)
But this does not seem robust enough. Also on some iterations, (pro1_prop + pro2_prop) >1 which results in a negative value for pro3_prop. Is there a better way of doing this? Thank you!
It is straightforward to sample from the posterior distributions of the proportions using Bayesian methods. I'll assume a multinomial model, where each observation is one of the three profiles.
Say the counts data for the three profiles are 76, 66, and 17.
Using a Dirichlet prior distribution, Dir(1/2, 1/2, 1/2), the posterior is also Dirichlet-distributed: Dir(76.5, 66.5, 17.5), which can be sampled using normalized random gamma variates.
x <- c(76, 66, 17) # observations
# take 1M samples of the proportions from the posterior distribution
theta <- matrix(rgamma(3e6, rep(x + 1/2, each = 1e6)), ncol = 3)
theta <- theta/rowSums(theta)
head(theta)
#> [,1] [,2] [,3]
#> [1,] 0.5372362 0.3666786 0.09608526
#> [2,] 0.4008362 0.4365053 0.16265852
#> [3,] 0.5073144 0.3686412 0.12404435
#> [4,] 0.4752601 0.4367119 0.08802793
#> [5,] 0.4428575 0.4520680 0.10507456
#> [6,] 0.4494075 0.4178494 0.13274311
# compare the Bayesian credible intervals with the frequentist confidence intervals
cbind(
t(mapply(function(i) quantile(theta[,i], c(0.025, 0.975)), seq_along(x))),
t(mapply(function(y) setNames(prop.test(y, sum(x))$conf.int, c("2.5%", "97.5%")), x))
)
#> 2.5% 97.5% 2.5% 97.5%
#> [1,] 0.39994839 0.5537903 0.39873573 0.5583192
#> [2,] 0.33939396 0.4910900 0.33840295 0.4959541
#> [3,] 0.06581214 0.1614677 0.06535702 0.1682029
If samples within the individual 95% CIs are needed, simply reject samples that fall outside the desired interval.
TL;DR: Sample all three values (for example from a pert distribution, as you did) and norm those values afterwards so they add up to one.
Sampling all three values independently from each other and then dividing by their sum so that the normed values add up to one seems to be the easiest option as it is quite hard to sample from the set of legal values directly.
Legal values:
The downside of my approach is that the normed values are not necessarily legal (i.e. in the range of the confidence intervals) any more. However, for these values using a pert distribution, this only happens about 0.5% of the time.
Code:
library(plotly)
library(freedom)
library(data.table)
# define lower (L) and upper (U) bounds and expected values (E)
prof1L <- 0.4
prof1E <- 0.478
prof1U <- 0.56
prof2L <- 0.34
prof2E <- 0.415
prof2U <- 0.49
prof3L <- 0.06
prof3E <- 0.107
prof3U <- 0.15
dt <- as.data.table(expand.grid(
Profile1 = seq(prof1L, prof1U, by = 0.002),
Profile2 = seq(prof2L, prof2U, by = 0.002),
Profile3 = seq(prof3L, prof3U, by = 0.002)
))
# color based on how far the points are away from the center
dt[, color := abs(Profile1 - prof1E) + abs(Profile2 - prof2E) + abs(Profile3 - prof3E)]
# only keep those points that (almost) add up to one
dt <- dt[abs(Profile1 + Profile2 + Profile3 - 1) < 0.01]
# plot the legal values
fig <- plot_ly(dt, x = ~Profile1, y = ~Profile2, z = ~Profile3, color = ~color, colors = c('#BF382A', '#0C4B8E')) %>%
add_markers()
fig
# try to simulate the legal values:
# first sample without considering the condition that the profiles need to add up to 1
nSample <- 100000
dtSample <- data.table(
Profile1Sample = rpert(nSample, prof1L, prof1U, prof1E),
Profile2Sample = rpert(nSample, prof2L, prof2U, prof2E),
Profile3Sample = rpert(nSample, prof3L, prof3U, prof3E)
)
# we want to norm the samples by dividing by their sum
dtSample[, SampleSums := Profile1Sample + Profile2Sample + Profile3Sample]
dtSample[, Profile1SampleNormed := Profile1Sample / SampleSums]
dtSample[, Profile2SampleNormed := Profile2Sample / SampleSums]
dtSample[, Profile3SampleNormed := Profile3Sample / SampleSums]
# now get rid of the cases where the normed values are not legal any more
# (e.g. Profile 1 = 0.56, Profile 2 = 0.38, Profile 3 = 0.06 => dividing by their sum
# will make Profile 3 have an illegal value)
dtSample <- dtSample[
prof1L <= Profile1SampleNormed & Profile1SampleNormed <= prof1U &
prof2L <= Profile2SampleNormed & Profile2SampleNormed <= prof2U &
prof3L <= Profile3SampleNormed & Profile3SampleNormed <= prof3U
]
# see if the sampled values follow the desired distribution
hist(dtSample$Profile1SampleNormed)
hist(dtSample$Profile2SampleNormed)
hist(dtSample$Profile3SampleNormed)
Histogram of normed sampled values for Profile 1:
Ok, some thoughts on the matter.
Lets think about Dirichlet distribution, as one providing RV summed up to 1.
We're talking about Dir(a1, a2, a3), and have to find needed ai.
From the expression for E[Xi]=ai/Sum(i, ai), it is obvious we could get three ratios solving equations
a1/Sum(i, ai) = 0.478
a2/Sum(i, ai) = 0.415
a3/Sum(i, ai) = 0.107
Note, that we have only solved for RATIOS. In other words, if in the expression for E[Xi]=ai/Sum(i, ai) we multiply ai by the same value, mean will stay the same. So we have freedom to choose multiplier m, and what will change is the variance/std.dev. Large multiplier means smaller variance, tighter sampled values around the means
So we could choose m freely to satisfy three 95% CI conditions, three equations for variance but only one df. So it is not possible in general.
One cold play with numbers and the code

Divide rows into groups given the similarity between them

Given this example data frame:
DF <- data.frame(x = c(1, 0.85, 0.9, 0, 0, 0.9, 0.95),
y = c(0, 0, 0.1, 0.9, 1, 0.9, 0.97),
z = c(0, 0, 0, 0.9, 0.9, 0.0, 0.9 ))
I am trying to assign each row to a group containing rows adjacent to one another, based on their similarity. I would like to use a cutoff of 0.35, meaning that consecutive rows of values c(1, 0.85, 0.7) can be assigned to one group, but c(0, 1, 0) cannot.
Regarding the columns, column-to-column differences are not important i.e. c(1, 1, 1) and c(0, 0, 0) could still be assigned to one group, HOWEVER, if rows in one column meet the criteria (e.g. c(1, 1, 1)) but the rows in another column(s) do not (e.g. c(1, 0, 1)) - the row is invalid.
Here is the desired output for the example I gave above:
[1] 1 1 1 2 2 NA NA
I am currently applying the abs(diff()) function to determine the difference between the values, and then for each row I take the largest value (adding 1 at the beginning to account for the first row):
diff <- apply(DF, MARGIN = 2, function (x) abs(diff(x)))
max_diff <- c(1, apply(diff, MARGIN = 1, function (x) max(x, na.rm = T)))
max_diff
[1] 1.00 0.15 0.10 0.90 0.10 0.90 0.90
I am stuck at this point, not quite sure what is the best way to proceed with the group assignment. I was initially trying to convert max_diff into a logical vector (max diff < 0.35), and then running a for loop grouping all the TRUEs together. This has a couple of problems:
My dataset has millions of rows so the forloop takes ages,
I "ignore" the first component of the group - e.g. I would not consider the first row as a member of the first group, because the max_diff value of 1 gives FALSE. I don't want to ignore anything.
I will be very grateful for any advice on how to proceed in an efficient way.
PS. The way of determining the difference between sites is not crucial - here it is just a difference of 0.35 but this is very flexible. All I am after is an adjustable method of finding similar rows.
You could do a cluster analysis and play around with different cutoffs h.
cl <- hclust(dist(DF))
DF$group <- cutree(cl, h=.5)
DF
# x y z group
# 1 1.00 0.00 0.0 1
# 2 0.85 0.00 0.0 1
# 3 0.90 0.10 0.0 1
# 4 0.00 0.90 0.9 2
# 5 0.00 1.00 0.9 2
# 6 0.90 0.90 0.0 3
# 7 0.95 0.97 0.9 4
A dendrogram helps to determine h.
plot(cl)
abline(h=.5, col=2)

Optimisation of matrix in R

I'm new to optimisation/calibration of models in R, but i'm eager to learn and really need some help. My question relates to demographic modelling.
I've done some research and found help here and here but neither have quite answered my question.
I have a matrix of scalars (propensities) where each column must total to 1. These propensities are used to estimate the number of households that would arise from a given population (persons by age). The propensities model tends to overestimate the number of households in history (for which I know the true number of households).
I want to calibrate the model to minimise the error in the number of households by tweaking the propensities such that the columns still add to 1 and propensities with an initial value of zero must remain zero.
Simple example:
# Propensities matrix
mtx <- matrix(c(0.00, 0.00, 0.85, 0.00, 0.15, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Population by age cohort
pop <- c(2600, 16200, 13400)
# True number of households
target <- c(7000, 4500, 5500)
# Function to optimise
hh <- function(mtx, pop, target) {
# Estimate living arrangements
x <- mtx %*% pop
# Estimate number of households using parent cohorts (1,2 and 4)
x <- c(x[1,1]/2, x[2,1]/2, x[4,1]) - target
return(x)
}
I haven't included any of my code for the optimisation/calibration step as it would be embarrassing and I've haven't been able to get anything to work!
Ideally i will have one set of propensities that generalises well for lots of different regions at the end of this process. Any advice on how i should go about achieving it? Helpful links?
Update
The snippet of code below executes the local search method as suggested by Enrico.
library(tidyverse)
library(NMOF)
data <- list(mtx = matrix(c(0.00, 0.00, 0.90, 0.00, 0.10, 0.25, 0.50, 0.00,
0.25, 0.00, 0.60, 0.20, 0.00, 0.20, 0.00), ncol = 3),
pop = c(2600, 16200, 13400),
target = c(7190, 4650, 5920))
# True mtx
mtx.true <- matrix(c(0.00, 0.00, 0.75, 0.00, 0.25, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Function to optimise
households <- function(x, data) {
# Estimate living arrangements
z <- x %*% data$pop
# Estimate number of households using parent cohorts (1,2 and 4)
z <- c(z[1,1]/2, z[2,1]/2, z[4,1]) - data$target
sum(abs(z))
}
# Local search function to perturb propensities
neighbour <- function(x, data) {
# Choose random column from mtx
i <- sample(1:ncol(x), 1)
# Select two non-zero propensities from mtx column
j <- which(x[, i] != 0) %>% sample(2, replace = FALSE)
# Randomnly select one to perturb positively
x[j[1], i] <- 0.1 * (1 - x[j[1], i]) + x[j[1], i]
# Perturb second propensity to ensure mtx column adds to 1
x[j[2], i] <- x[j[2], i] + (1 - sum(x[,i]))
x
}
# Local search algorithm inputs
localsearch <- list(x0 = data$mtx,
neighbour = neighbour,
nS = 50000,
printBar = FALSE)
# Execute
now <- Sys.time()
solution <- LSopt(OF = households, algo = localsearch, data)
#>
#> Local Search.
#> Initial solution: 2695
#> Finished.
#> Best solution overall: 425.25
Sys.time() - now
#> Time difference of 6.33272 secs
# Inspect propensity matrices
print(solution$xbest)
#> [,1] [,2] [,3]
#> [1,] 0.0000000 0.3925 0.6
#> [2,] 0.0000000 0.4250 0.2
#> [3,] 0.2937976 0.0000 0.0
#> [4,] 0.0000000 0.1825 0.2
#> [5,] 0.7062024 0.0000 0.0
print(mtx.true)
#> [,1] [,2] [,3]
#> [1,] 0.00 0.35 0.65
#> [2,] 0.00 0.45 0.15
#> [3,] 0.75 0.00 0.00
#> [4,] 0.00 0.20 0.20
#> [5,] 0.25 0.00 0.00
Thanks!
I can only comment on the optimisation part.
The code you have provided is sufficient; only your objective function evaluates to a vector. You will need to transform this vector into a single number that is to be minimised, such as the sum of squares or of absolute values.
When it comes to methods, I would try heuristics; in fact, I would try a Local-Search method. These methods operate on the solution through functions which you define; thus, you may code your solution as a matrix. More specifically, you would need two functions: the objective function (which you essentially have) and a neighbourhood function, which takes as input a solution and modifies it. In your particular case, it could take a matrix, select two none-zero elements from one column, and increase one and decrease the other. Thus, the column sum would remain unchanged.
Perhaps the tutorial http://enricoschumann.net/files/NMOF_Rmetrics2012.pdf is of interest, with R code http://enricoschumann.net/files/NMOF_Rmetrics2012.R .

Sampling using conditional probability table

I am trying to simulate certain discrete variable depicting "true state of the world" (say, "red", "green" or "blue") and its indicator, somewhat imperfectly describing it.
r_names <- c("real_R", "real_G", "real_B")
Lets say I have some prior belief about distribution of "reality" variable, which I will use to sample it.
r_probs <- c(0.3, 0.5, 0.2)
set.seed(100)
reality <- sample(seq_along(r_names), 10000, prob=r_probs, replace = TRUE)
Now, let's say I have conditional probability table that stipulates the value of indicator given each of the "realities"
ri_matrix <- matrix(c(0.7, 0.3, 0,
0.2, 0.6, 0.2,
0.05,0.15,0.8), byrow=TRUE,nrow = 3)
dimnames(ri_matrix) <- list(paste("real", r_names, sep="_"),
paste("ind", r_names, sep="_"))
ri_matrix
># ind_R ind_G ind_B
># real_Red 0.70 0.30 0.0
># real_Green 0.20 0.60 0.2
># real_Blue 0.05 0.15 0.8
Since base::sample() is not vectorized for prob argument, I have to:
sample_cond <- function(r, rim){
unlist(lapply(r, function(x)
sample(seq_len(ncol(rim)), 1, prob = rim[x,], replace = TRUE)))
}
Now I can sample my "indicator" variable using the conditional probability matrix
set.seed(200)
indicator <- sample_cond(reality, ri_matrix)
Just to make sure the distributions turned out as expected:
prop.table(table(reality, indicator), margin = 1)
#> indicator
#> reality 1 2 3
#> 1 0.70043610 0.29956390 0.00000000
#> 2 0.19976124 0.59331476 0.20692400
#> 3 0.04365278 0.14400401 0.81234320
Is there a better (i.e. more idiomatic and/or efficient) way to sample a discrete variable conditioned on another discrete random variable?
UPDATE:
As suggested by #Mr.Flick, this is at least 50x faster, because it reuses probability vectors instead of repeated subsetting of the conditional probability matrix.
sample_cond_group <- function(r, rim){
il <- mapply(function(x,y){sample(seq(ncol(rim)), length(x), prob = y, replace = TRUE)},
x=split(r, r),
y=split(rim, seq(nrow(rim))))
unsplit(il, r)
}
You can be a bit more efficient by drawing all the random samples per group with a split/combine type strategy. That might look something like this
simFun <- function(N, r_probs, ri_matrix) {
stopifnot(length(r_probs) == nrow(ri_matrix))
ind <- sample.int(length(r_probs), N, prob = r_probs, replace=TRUE)
grp <- split(data.frame(ind), ind)
unsplit(Map(function(data, r) {
draw <-sample.int(ncol(ri_matrix), nrow(data), replace=TRUE, prob=ri_matrix[r, ])
data.frame(data, draw)
}, grp, as.numeric(names(grp))), ind)
}
Than you can call with
simFun(10000, r_probs, ri_matrix)

Easily input a correlation matrix in R

I have a R script I'm running now that is currently using 3 correlated variables. I'd like to add a 4th, and am wondering if there's a simple way to input matrix data, particularly for correlation matrices---some Matlab-like technique to enter a correlation matrix, 3x3 or 4x4, in R without the linear to matrix reshape I've been using.
In Matlab, you can use the semicolon as an end-row delimiter, so it's easy to keep track of where the cross correlations are.
In R, where I first create
corr <- c(1, 0.1, 0.5,
0.1, 1, 0.9,
0.5, 0.9, 1)
cormat <- matrix(corr, ncol=3)
Versus
cormat = [1 0.1 0.5;
0.1 1 0.9;
0.5 0.9 1]
It just feels clunkier, which makes me suspect there's a smarter way I haven't looked up yet. Thoughts?
Welcome to the site! :) you should be able to do it in one step:
MyMatrix = matrix(
c(1, 0.1, 0.5,
0.1, 1, 0.9,
0.5, 0.9, 1),
nrow=3,
ncol=3)
Here is another way:
CorrMat <- matrix(scan(),3,3,byrow=TRUE)
1 0.1 0.5
0.1 1 0.9
0.5 0.9 1
Trailing white line is important.
If you want to input a symmetric matrix, you can use the xpnd() function in the MCMCpack library.
xpnd() takes a vector which corresponds to the upper-triangle of the matrix (thus you only have to enter each value once). For instance, if you want to input:
$\left(\begin{array}{c c c}
1 & 0.1 & 0.5 \\
0.1 & 1 & 0.9 \\
0.5 & 0.9 & 1
\end{array}\right)$
You would use
library(MCMCpack)
xpnd(c(1, 0.1, 0.5, 1, 0.9, 1), 3)
where 3 refers to the number of rows in the matrix.
Help page for xpnd.
rbind(c(1, 0.1, 0.5),
c(0.1, 1, 0.9),
c(0.5, 0.9, 1))
For the existing solutions. That may only work for 3*3 matrix. I tried this one.
a<-diag(3)
m<-diag(3)
m[lower.tri(m,diag=F)]<-c(0.1, 0.5, 0.9)
m<-m+t(m)-a
As you are working with correlation matrices, you are probably not interested in entering the diagonal, and both the upper and lower parts. You can manipulate/extract those three parts separately using diag(), upper.tri() and lower.tri().
> M <- diag(3) # create 3x3 matrix, diagonal defaults to 1's
> M[lower.tri(M, diag=F)] <- c(0.1, 0.5, 0.9) # read in lower part
> M # lower matrix containing all information
[,1] [,2] [,3]
[1,] 1.0 0.0 0
[2,] 0.1 1.0 0
[3,] 0.5 0.9 1
If you want the full matrix:
> M[upper.tri(M, diag=F)] <- M[lower.tri(M)] # fill upper part
> M # full matrix
[,1] [,2] [,3]
[1,] 1.0 0.1 0.5
[2,] 0.1 1.0 0.9
[3,] 0.5 0.9 1.0

Resources