Maximum pseudo-likelihood estimator for soft-core point process - r

I am trying to fit a soft-core point process model on a set of point pattern using maximum pseudo-likelihood. I followed the instructions given in this paper by Baddeley and Turner
And here is the R-code I came up with
`library(deldir)
library(tidyverse)
library(fields)
#MPLE
# irregular parameter k
k <- 0.4
## Generate dummy points 50X50. "RA" and "DE" are x and y coordinates
dum.x <- seq(ramin, ramax, length = 50)
dum.y <- seq(demin, demax, length = 50)
dum <- expand.grid(dum.x, dum.y)
colnames(dum) <- c("RA", "DE")
## Combine with data and specify which is data point and which is dummy, X is the point pattern to be fitted
bind.x <- bind_rows(X, dum) %>%
mutate(Ind = c(rep(1, nrow(X)), rep(0, nrow(dum))))
## Calculate Quadrature weights using Voronoi cell area
w <- deldir(bind.x$RA, bind.x$DE)$summary$dir.area
## Response
y <- bind.x$Ind/w
# the sum of distances between all pairs of points (the sufficient statistics)
tmp <- cbind(bind.x$RA, bind.x$DE)
t1 <- rdist(tmp)^(-2/k)
t1[t1 == Inf] <- 0
t1 <- rowSums(t1)
t <- -t1
# fit the model using quasipoisson regression
fit <- glm(y ~ t, family = quasipoisson, weights = w)
`
However, the fitted parameter for t is negative which is obviously not a correct value for a softcore point process. Also, my point pattern is actually simulated from a softcore process so it does not make sense that the fitted parameter is negative. I tried my best to find any bugs in the code but I can't seem to find it. The only potential issue I see is that my sufficient statistics is extremely large (on the order of 10^14) which I fear may cause numerical issues. But the statistics are large because my observation window spans a very small unit and the average distance between a pair of points is around 0.006. So sufficient statistics based on this will certainly be very large and my intuition tells me that it should not cause a numerical problem and make the fitted parameter to be negative.
Can anybody help and check if my code is correct? Thanks very much!

Related

generating random x and y coordinates with a minimum distance

Is there a way in R to generate random coordinates with a minimum distance between them?
E.g. what I'd like to avoid
x <- c(0,3.9,4.1,8)
y <- c(1,4.1,3.9,7)
plot(x~y)
This is a classical problem from stochastic geometry. Completely random points in space where the number of points falling in disjoint regions are independent of each other corresponds to a homogeneous Poisson point process (in this case in R^2, but could be in almost any space).
An important feature is that the total number of points has to be random before you can have independence of the counts of points in disjoint regions.
For the Poisson process points can be arbitrarily close together. If you define a process by sampling the Poisson process until you don't have any points that are too close together you have the so-called Gibbs Hardcore process. This has been studied a lot in the literature and there are different ways to simulate it. The R package spatstat has functions to do this. rHardcore is a perfect sampler, but if you want a high intensity of points and a big hard core distance it may not terminate in finite time... The distribution can be obtained as the limit of a Markov chain and rmh.default lets you run a Markov chain with a given Gibbs model as its invariant distribution. This finishes in finite time but only gives a realisation of an approximate distribution.
In rmh.default you can also simulate conditional on a fixed number of points. Note that when you sample in a finite box there is of course an upper limit to how many points you can fit with a given hard core radius, and the closer you are to this limit the more problematic it becomes to sample correctly from the distribution.
Example:
library(spatstat)
beta <- 100; R = 0.1
win <- square(1) # Unit square for simulation
X1 <- rHardcore(beta, R, W = win) # Exact sampling -- beware it may run forever for some par.!
plot(X1, main = paste("Exact sim. of hardcore model; beta =", beta, "and R =", R))
minnndist(X1) # Observed min. nearest neighbour dist.
#> [1] 0.102402
Approximate simulation
model <- rmhmodel(cif="hardcore", par = list(beta=beta, hc=R), w = win)
X2 <- rmh(model)
#> Checking arguments..determining simulation windows...Starting simulation.
#> Initial state...Ready to simulate. Generating proposal points...Running Metropolis-Hastings.
plot(X2, main = paste("Approx. sim. of hardcore model; beta =", beta, "and R =", R))
minnndist(X2) # Observed min. nearest neighbour dist.
#> [1] 0.1005433
Approximate simulation conditional on number of points
X3 <- rmh(model, control = rmhcontrol(p=1), start = list(n.start = 42))
#> Checking arguments..determining simulation windows...Starting simulation.
#> Initial state...Ready to simulate. Generating proposal points...Running Metropolis-Hastings.
plot(X3, main = paste("Approx. sim. given n =", 42))
minnndist(X3) # Observed min. nearest neighbour dist.
#> [1] 0.1018068
OK, how about this? You just generate random number pairs without restriction and then remove the onces which are too close. This could be a great start for that:
minimumDistancePairs <- function(x, y, minDistance){
i <- 1
repeat{
distance <- sqrt((x-x[i])^2 + (y-y[i])^2) < minDistance # pythagorean theorem
distance[i] <- FALSE # distance to oneself is always zero
if(any(distance)) { # if too close to any other point
x <- x[-i] # remove element from x
y <- y[-i] # and remove element from y
} else { # otherwise...
i = i + 1 # repeat the procedure with the next element
}
if (i > length(x)) break
}
data.frame(x,y)
}
minimumDistancePairs(
c(0,3.9,4.1,8)
, c(1,4.1,3.9,7)
, 1
)
will lead to
x y
1 0.0 1.0
2 4.1 3.9
3 8.0 7.0
Be aware, though, of the fact that these are not random numbers anymore (however you solve problem).
You can use rejection sapling https://en.wikipedia.org/wiki/Rejection_sampling
The principle is simple: you resample until you data verify the condition.
> set.seed(1)
>
> x <- rnorm(2)
> y <- rnorm(2)
> (x[1]-x[2])^2+(y[1]-y[2])^2
[1] 6.565578
> while((x[1]-x[2])^2+(y[1]-y[2])^2 > 1) {
+ x <- rnorm(2)
+ y <- rnorm(2)
+ }
> (x[1]-x[2])^2+(y[1]-y[2])^2
[1] 0.9733252
>
The following is a naive hit-and-miss approach which for some choices of parameters (which were left unspecified in the question) works well. If performance becomes an issue, you could experiment with the package gpuR which has a GPU-accelerated distance matrix calculation.
rand.separated <- function(n,x0,x1,y0,y1,d,trials = 1000){
for(i in 1:trials){
nums <- cbind(runif(n,x0,x1),runif(n,y0,y1))
if(min(dist(nums)) >= d) return(nums)
}
return(NA) #no luck
}
This repeatedly draws samples of size n in [x0,x1]x[y0,y1] and then throws the sample away if it doesn't satisfy. As a safety, trials guards against an infinite loop. If solutions are hard to find or n is large you might need to increase or decrease trials.
For example:
> set.seed(2018)
> nums <- rand.separated(25,0,10,0,10,0.2)
> plot(nums)
runs almost instantly and produces:
Im not sure what you are asking.
if you want random coordinates here.
c(
runif(1,max=y[1],min=x[1]),
runif(1,max=y[2],min=x[2]),
runif(1,min=y[3],max=x[3]),
runif(1,min=y[4],max=x[4])
)

Fit distribution to given frequency values in R

I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.
Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)
Update.
To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:
7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.
Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:
# Loop to simulate values
set.values <- c()
for(i in 1:length(sample)){
set.values <<- c(set.values, rep(i, times = sample[i]))
}
hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)
and use fitdistr on the set.values:
f2 <- fitdistr(set.values, 'weibull')
f2
Why I think it is incorrect way and why I am looking for a better solution in R?
in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)
Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.
Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.
If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.
library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22
s.fit <- sample[3:23]
t.fit <- t.sample[3:23]
wx <- function(param) {
res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
return(res)
}
minwx <- function(param){
v <- s.fit-wx(param)
sqrt(sum(v*v))
}
p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))
popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)
plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)
You can directly calculate the maximum likelihood parameters, as described here.
# Defining the error of the implicit function
k.diff <- function(k, vec){
x2 <- seq(length(vec))
abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2),
w = x2^k*sample))
}
# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min
# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)
# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))
Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
f<-fitdistr(sample, 'weibull')
f
If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.
For your example this would look like this:
ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
ks
The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.
Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.
f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)
f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1])
hist(z)

R -- Simulate sigmoidally correlated covariates

I am attempting to simulate two weight and age values for a population of children. These data should be sigmoidally correlated such that at low ages weight changes slowly, then by approximately 30 weeks postmenstrual age there is an acceleration in weight gain, which begins to level off past about 50 weeks.
I have been able to use the code below to get a linear correlation between weight and age to work fairly well. The part I am having trouble with is adapting this code to get a more sigmoidal shape to the data. Any suggestions would be greatly appreciated.
# Load required packages
library(MASS)
library(ggplot2)
# Set the number of simulated data points
n <- 100
# Set the mean and standard deviations for
# the two variables
mean_age <- 50
sd_age <- 20
mean_wt <- 10
sd_wt <- 4
# Set the desired level of correlation
# between the two variables
cor_agewt <- 0.9
# Build the covariance matrix
covmat <- matrix(c(sd_age^2, cor_agewt * sd_age * sd_wt,
cor_agewt * sd_age * sd_wt, sd_wt^2),
nrow = 2, ncol = 2, byrow = TRUE)
# Simulate the correlated results
res <- mvrnorm(n, c(mean_age, mean_wt), covmat)
# Reorganize the simulate data into a data frame
df <- data.frame(age = res[,1],
wt = res[,2])
# Plot the results and fit a loess spline
# to the data
ggplot(df, aes(x = age, y = wt)) +
geom_point() +
stat_smooth(method = 'loess')
Current output:
Ideal output (albeit over a smaller range of ages and weights):
One approach is to specify the functional form between weight and age more specific than just a single correlation. After specifying the functional form of weight~age+e you just draw (age,e) and then calculate the weight. A simple example follows:
set.seed(1234)
mean_age <- 50; sd_age <- 20
mean_wt <- 3.5; sd_wt <- 2.2
n<-400
age.seq<-rnorm(n,mean_age,sd_age)
age.seq<-age.seq[order(age.seq)]
#functional form: (here a "logistic" with a a location and scale)
f<-function(x,loc,sca) 1/(1+exp(-(x-loc)/sca))
wt<-f(age.seq,65,20) #wt
m<-mean_wt/mean(wt) #simple adjustment of the mean
sdfit<-sqrt( sd_wt^2-var(m*wt) )
sim_wt<-m*wt+rnorm(n,0,sdfit) #simulated wt
plot(age.seq,sim_wt)
lines(age.seq,m*wt)
with mean & sd:
>sd(age.seq); sd(sim_wt); mean(sim_wt); mean(age.seq) #check
[1] 20.29432
[1] 2.20271
[1] 3.437339
[1] 50.1549
:::::: EDIT partially wrt. comment::::::
A restriction on the samplespace, eg. nonzero criteria for weights, gonna make the problem much harder. But if you drop the mean+sd restriction on weights, then it's easy to extend the example to a flexible specification of the functional-form. Following, is a simple example using a truncated normal-dist.:
set.seed(1234)
mean_age<-30
sd_age<-10
n<-500
#ex. of control of functional-form
loc<-40 #location
scale<-10 #scaling
sd_wt <- 0.8 #in the truncated normal
ey_min<-c(0,0.2) #in the truncated normal
ey_max<-c(55,6) #in the truncated normal
age.seq<-rnorm(n,mean_age,sd_age)
#age.seq<-0:55
n<-length(age.seq)
age.seq<-age.seq[order(age.seq)]
#functional form: (here a "logistic" with a a location and scale)
f<-function(x,loc,sca) 1/(1+exp(-(x-loc)/sca))
wt<-f(age.seq,loc,scale) #wt
#correct lower:
corr_lower<-ey_min[2]-f(ey_min[1],loc,scale) #add. correction lower
wt<-wt+corr_lower
#correct upper
mult<-(ey_max[2]-ey_min[2])/(f(ey_max[1],loc,scale)+corr_lower) #mult. correction
wt<-ey_min[2]+wt*mult*(age.seq/ey_max[1])
plot(age.seq,wt,type="l",ylim=c(0,8)) #plot mean used as par in the truncated normal
sim_wt<-truncnorm::rtruncnorm(n,0,,mean=wt,sd=sd_wt)
points(age.seq,sim_wt)
abline(h=0.2,col=2);abline(v=0,col=2)
abline(h=6,col=2);abline(v=55,col=2)
which gives (red lines illustrating the controls):
Of course you could also try control the variance wrt. age, simplified:
plot(age.seq,wt,type="l",ylim=c(0,8)) #plot mean used as par in the truncated normal
sim_wt<-truncnorm::rtruncnorm(n,0,,mean=wt,sd=sd_wt*seq(0.3,1.3,len=n))
points(age.seq,sim_wt)
The point here is, that you need more structure to simulate specific data like that (not going into ex. bootstrap methods), eg. there is no internal R-function to the rescue. Of course it get's harder to sample from the distribution when introducing more restrictions. You can always consult Cross Validated for different approaches, choice of distribution etc.

Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))

Generating Random Variables with given correlations between pairs of them:

I want to generate 2 continuous random variables Q1, Q2 (quantitative traits, each are normal) and 2 binary random variables Z1, Z2 (binary traits) with given pairwise correlations between all possible pairs of them.
Say
(Q1,Q2):0.23
(Q1,Z1):0.55
(Q1,Z2):0.45
(Q2,Z1):0.4
(Q2,Z2):0.5
(Z1,Z2):0.47
Please help me generate such data in R.
This is crude but might get you started in the right direction.
library(copula)
options(digits=3)
probs <- c(0.5,0.5)
corrs <- c(0.23,0.55,0.45,0.4,0.5,0.47) ## lower triangle
Simulate correlated values (first two quantitative, last two transformed to binary)
sim <- function(n,probs,corrs) {
tmp <- normalCopula( corrs, dim=4 , "un")
getSigma(tmp) ## test
x <- rCopula(1000, tmp)
x2 <- x
x2[,3:4] <- qbinom(x[,3:4],size=1,prob=rep(probs,each=nrow(x)))
x2
}
Test SSQ distance between observed and target correlations:
objfun <- function(corrs,targetcorrs,probs,n=1000) {
cc <- try(cor(sim(n,probs,corrs)),silent=TRUE)
if (is(cc,"try-error")) return(NA)
sum((cc[lower.tri(cc)]-targetcorrs)^2)
}
See how bad things are when input corrs=target:
cc0 <- cor(sim(1000,probs=probs,corrs=corrs))
cc0[lower.tri(cc0)]
corrs
objfun(corrs,corrs,probs=probs) ## 0.112
Now try to optimize.
opt1 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5))
opt1$value ## 0.0208
Stops after 501 iterations with "max iterations exceeded". This will never work really well because we're trying to use a deterministic hill-climbing algorithm on a stochastic objective function ...
cc1 <- cor(sim(1000,probs=c(0.5,0.5),corrs=opt1$par))
cc1[lower.tri(cc1)]
corrs
Maybe try simulated annealing?
opt2 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5),
method="SANN")
It doesn't seem to do much better than the previous value. Two possible problems (left as an exercise for the reader are) (1) we have specified a set of correlations that are not feasible with the marginal distributions we have chosen, or (2) the error in the objective function surface is getting in the way -- to do better we would have to average over more replicates (i.e. increase n).

Resources