In the past I have used igraph to generate small world networks with a specified rewiring probability p, which is especially easy because it's an argument in the sample_smallworld function. For instance:
myNetwork <- sample_smallworld(dim = 1, size = 10, nei = 2, p = 0.25)
plot(myNetwork, layout = layout_in_circle)
I'd now like to generate small world networks with a specified clustering coefficient. I'm new to igraph and this seems like a functionality that it would have, but after some searching I've only found ways to calculate the coefficient from pre-existing networks, rather than a way to use it as a parameter for generating the network itself.
What's the best way to generate networks with a specified clustering coefficient ?
If you are okay with a few conditions, then it is possible to get a sort of hacky approximation using sna::rguman(). The conditions are: (1) using global transitivity (maybe you could work with it and modify); (2) using undirected graphs; (3) using large size graphs if using small values of transitivity, or use larger values of transitivity for small size graphs. Also, give up the sample_smallworld() algorithm. If that's ok this might get you where you want to go:
library(sna)
library(igraph)
sample_cluster <- function(nv = 150, clustering_coef = 0.5, thres = 0.05) {
g <- sna::rguman(1, nv, mut = clustering_coef, asym = 0, null = 1 - clustering_coef) %>%
graph_from_adjacency_matrix(mode = "undirected")
while (!(transitivity(g) >= clustering_coef-thres & transitivity(g) <= clustering_coef+thres)) {
g <- sna::rguman(1, nv, mut = clustering_coef, asym = 0, null = 1 - clustering_coef) %>%
graph_from_adjacency_matrix(mode = "undirected")
}
return(g)
}
sample_cluster(15, clustering_coef = 0.2, thres = 0.001) %>% transitivity()
#> [1] 0.2
sample_cluster(200, clustering_coef = 0.01, thres = 0.001) %>% transitivity()
#> [1] 0.009009009
sample_cluster(200, clustering_coef = 0.2, thres = 0.001) %>% transitivity()
#> [1] 0.2007628
sample_cluster(20, clustering_coef = 0.7, thres = 0.001) %>% transitivity()
#> [1] 0.7007168
Created on 2020-03-31 by the reprex package (v0.3.0)
Not fancy, not sophisticated, but might do the trick!
As far I understand, you cannot specify the clustering coefficient (or transitivity, as it is named in the igraph package) because it is conditional on the parameters you specify.
Why? sample_smallworld generates a graph according to the Watts-Strogatz Model, as explained in the documentation. Check out the maths of the model on Wikipedia. The Watts-Strogatz model has three parameters:
Number of nodes (size parameter in the sample_smallworld method, N in the wikipedia article);
Average number of links per node (nei parameter, K on wikipedia);
Probability of connecting random pair of nodes, for each link in the graph, via "rewiring" (p parameter, beta on Wikipedia).
Have a look at Wikipedia (section "Clustering coefficient") to understand how the clustering coefficient can be computed from these parameters.
More empirically, if you play around with the parameters in your model, you can see how they affect the clustering coefficient with the transitivity command.
# clustering coefficient in your data:
myNetwork <- sample_smallworld(dim = 1, size = 10, nei = 2, p = 0.25)
transitivity(myNetwork)
[1] 0.3870968
# Varying average links/node:
for(i in 1:5) {
set.seed(1) # use this to get the same results
myNetwork <- sample_smallworld(dim = 1, size = 10, nei = i, p = 0.25)
print(transitivity(myNetwork, type="global"))
}
[1] 0
[1] 0.2380952
[1] 0.6
[1] 0.8723404
[1] 1
# Varying rewiring probability:
for(i in c(0.05, 0.1, 0.2, 0.5, 1)) {
set.seed(1)
myNetwork <- sample_smallworld(dim = 1, size = 10, nei = 2, p = i)
print(transitivity(myNetwork, type="global"))
}
[1] 0.483871
[1] 0.4615385
[1] 0.328125
[1] 0.3802817
[1] 0.4347826
Related
I am trying to use kfold CV as a means of evaluating a model run using brms and I feel like I'm missing something. As a reproducible example, my data are structured as a binary response (0, 1) dependent on the length of an individual. Here is some code to generate and plot data similar to those I am working with:
library(brms)
library(tidyverse)
library(loo)
length <- seq(0, 100, by = 1)
n_fish_per_length <- 10
a0 <- -48
a1 <- 2
a2 <- -0.02
prob <- plogis(a0 + a1 * length + a2 * length^2)
plot(length, prob , type = 'l')
sim_data <-
expand_grid(fish_id = seq_len(n_fish_per_length),
length = length) %>%
mutate(prob_use = plogis(a0 + a1 * length + a2 * length^2)) %>%
mutate(is_carp = rbinom(n = n(), size = 1, prob= prob_use))
ggplot(sim_data, aes(x = length, y = is_carp)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method = "glm", formula = y ~ x + I(x^2),
method.args = list(family = binomial(link = "logit")))
I then use brms to run my model.
Bayes_Model_Binary <- brm(formula = is_carp ~ length + I(length^2),
data=sim_data,
family = bernoulli(link = "logit"),
warmup = 2500,
iter = 5000,
chains = 4,
inits= "0",
cores=4,
seed = 123)
summary(Bayes_Model_Binary)
I'd like to use kfold CV to evaluate the model. I can use something like this:
kfold(Bayes_Model_Binary, K = 10, chains = 1, save_fits = T)
but the response in my data is highly imbalanced (~18% = 1, ~82% = 0) and my reading suggests that I need to used stratified kfold cv to account for this. If I use:
sim_data$fold <- kfold_split_stratified(K = 10, x = sim_data$is_carp)
the data are split the way I would expect but I'm not sure what the best way is to move forward with the CV process from here. I saw this post https://mc-stan.org/loo/articles/loo2-elpd.html, but I'm not sure how to modify this to work with a brmsfit object. Alternatively, it appears that I should be able to use:
kfold(Bayes_Model_Binary, K = 10, folds = 'stratified', group = sim_data$is_carp)
but this throws an error. Likely because is_carp is the response rather than a predictor in the model. What would my group be in this context? Am I missing/misinterpreting something here? I'm assuming that there is a very simple solution here that I am overlooking but appreciate any thoughts.
After some additional digging and learning how to access information about each fold in the analysis, I was able to determine that the structure of the data (proportion of 0s and 1s in the response) is maintained using the default settings in the kfold() function. To do this I used the following code.
First, save the kfold CV analysis as an object.
kfold1 <- kfold(Bayes_Model_Binary, K = 10, save_fits = T)
kfold1$fits is a list of the model fitting results and the observations used in the test data set (omitted) for each fold.
From this information, I created a loop to print the proportion of observations in each training data set where is_carp = 1 (could also do this for each test data set) with the following code.
for(i in 1:10){
print(length(which(sim_data$is_carp[-kfold1$fits[i, ]$omitted] == 1)) /
nrow(sim_data[-kfold1$fits[i, ]$omitted, ]))
}
[1] 0.1859186
[1] 0.1925193
[1] 0.1991199
[1] 0.1914191
[1] 0.1881188
[1] 0.1848185
[1] 0.1936194
[1] 0.1980198
[1] 0.190319
[1] 0.1870187
and it's easy to then compare these proportions with the proportion of observations where is_carp = 1 from the original data set.
length(which(sim_data$is_carp == 1)) / nrow(sim_data)
[1] 0.1910891
I would like to understand how the R kknn package calculates weights, distances, and class probabilities for binary classification problems. In the R code below, there are three observations in the training sample and one observation in the holdout sample. The two predictor variables are height and weight. With Euclidean distance, the distances for each observation in the training sample are then:
sqrt((6-8)^2 + (4-5)^2) = 2.24
sqrt((6-3)^2 + (4-7)^2) = 4.24
sqrt((6-7)^2 + (4-3)^2) = 1.41.
With k=3 and with equal weights, I get a probability for the holdout as:
(1/3 * 1) + (1/3 * 0) + (1/3 * 1) = 0.67.
With k=2 and with equal weights, I get a probability for the holdout as:
(1/2 * 1) + (1/2 * 1) = 1.00.
I would like to understand how the R kknn package makes these same calculations with the "triangular," "gaussian," and "inverse" weights (and more generally).
library(kknn)
training <- data.frame(class = c(1, 0, 1), height = c(8, 3, 7), weight = c(5, 7, 3))
holdouts <- data.frame(class = 1, height = 6, weight = 4)
triangular_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "triangular", k = 3)
triangular_kernel[["fitted.values"]]
triangular_kernel[["W"]]
triangular_kernel[["D"]]
gaussian_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "gaussian", k = 3)
gaussian_kernel[["fitted.values"]]
gaussian_kernel[["W"]]
gaussian_kernel[["D"]]
inverse_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "inv", k = 3)
inverse_kernel[["fitted.values"]]
inverse_kernel[["W"]]
inverse_kernel[["D"]]
Calling kknn::kknn prints the source code for the kknn function in the console. With it, one can go through the function line by line to see what it does.
Distance
kknn calls a compiled C code dmEuclid. To obtain its source code, we follow this guide, writing the following code in R:
untar(download.packages(pkgs = "kknn", destdir = ".", type = "source")[,2])
and then open the src directory of kknn_1.3.1.tar in your working directory (getwd()) to find and open dm.C using any text editor. Scroll about halfway to find dmEuclid. To test the exact outputs of dmEuclid, you could install the build tools, and open a C++ file in Rstudio by selecting it in the dropdown menu, and run the code with different inputs.
Following the function outputs, in your case the dmtmp$dm results in
3.779645e-01 1.133893e+00 1.000000e+150 3.685210e-156
Per your specification k, the first 3 values are chosen as distance D.
This is manually converted to maxdist = 1e-06 by the package author, as the max distance is smaller than that in your case.
Weights
The kknn function uses the following section to allocate a weight scheme, per your defined kernel.
W <- D/maxdist
W <- pmin(W, 1 - (1e-06))
W <- pmax(W, 1e-06)
at this point your W values are larger than 1, and so W is then coerced to approximately 1.
if (kernel == "inv"
W <- 1/W
if (kernel == "triangular")
W <- 1 - W
if (kernel == "gaussian") {
alpha = 1/(2 * (k + 1))
qua = abs(qnorm(alpha))
W = W * qua
W = dnorm(W, sd = 1)
}
the explanation for which can be found in the paper linked by gowerc.
W is then converted to matrix W <- matrix(W, p, k) with 1 row (p=1), 3 columns (k=3)
Fitted value
p = 1 in your case is 1, k=3, cl = c(1,0,1).
C <- matrix(dmtmp$cl, nrow = p, ncol = k + 1)
C <- C[, 1:k] + 1
CL <- matrix(cl[C], nrow = p, ncol = k)
W <- matrix(W, p, k)
fit <- rowSums(W * CL)/pmax(rowSums(W), 1e-06)
I'd like to know the range of each parameter in the rugarch specification models.
For example for distribution error "nig" and model "apARCH". I'd like to know what is the range for the parameters "skew", "shape" related to the "nig" distribution and the parameters "gamma" and "delta" for the model "apARCH".
This is my code example:
varianceModel = list(model="apARCH", garchOrder=c(1,1))
meanModel = list(armaOrder=c(1,1))
distributionModel = "nig"
fixedPars = list(mu=0, ar1 = 0.1, ma1= 0.9, omega=0.001, alpha1=0.1, beta1=0.8, gamma1 = 0.01, delta = 2, shape=1.5, skew = 0.2)
spec <- ugarchspec(variance.model = varianceModel,
mean.model= meanModel, distribution.model=distributionModel,
fixed.pars=fixedPars)
path.sgarch <- ugarchpath(spec, n.sim=1000, n.start=1, m.sim=20)
Now for each of this parameters, how I can get the possible range or the "standard" parameters?
There doesn't seem to be a list of ranges of possible values of such parameters in the documentation of rugarch, while this introduction provides only some partial information.
Those ranges of possible values, however, are (at least should be) standard in the sense that they provide well-defined distributions and stationary models. Hence, you should be able to find all such ranges in some other sources.
However, regarding the distributions, there actually is a hidden source in rugarch that you can use---the rugarch:::.DistributionBounds function source code. For instance, it contains
if (distribution == "nig") {
skew = 0.2
skew.LB = -0.99
skew.UB = 0.99
shape = 0.4
shape.LB = 0.01
shape.UB = 25
}
meaning that the lower and upper bounds for skew are -0.99 and 0.99, respectively. To extract those numbers faster, you may use
rugarch:::.DistributionBounds("nig")[c("skew.LB", "skew.UB")]
# $skew.LB
# [1] -0.99
#
# $skew.UB
# [1] 0.99
Regarding the variance models, typically "simple" ranges, such as as -1 < gamma < 1 for APARCH, are not available/what you want, because they only allow the model to exist, but doesn't guarantee stationarity. For instance, for GARCH(1,1) to be stationary we need alpha + beta < 1; hence, we actually have higher dimensional constraints than just intervals. As I said, you may find those online.
However, ugarchpath also checks those conditions by computing persistence(spec). Now, as you can see in
getMethod("persistence", signature(object = "uGARCHspec", pars = "missing",
distribution = "missing", model = "missing",
submodel="missing"))
there is a different way to compute this persistence for each specification. For instance, for APARCH we look at
rugarch:::.persistaparch1
# function (pars, idx, distribution = "norm")
# {
# alpha = pars[idx["alpha", 1]:idx["alpha", 2]]
# beta = pars[idx["beta", 1]:idx["beta", 2]]
# gamma = pars[idx["gamma", 1]:idx["gamma", 2]]
# delta = pars[idx["delta", 1]:idx["delta", 2]]
# skew = pars[idx["skew", 1]:idx["skew", 2]]
# shape = pars[idx["shape", 1]:idx["shape", 2]]
# ghlambda = pars[idx["ghlambda", 1]:idx["ghlambda", 2]]
# ps = sum(beta) + sum(apply(cbind(gamma, alpha), 1, FUN = function(x) x[2] *
# aparchKappa(x[1], delta, ghlambda, shape, skew, distribution)))
# return(ps)
# }
and the condition is that ps < 1. Notice that
rugarch:::.persistsgarch1
# function (pars, idx, distribution = "norm")
# {
# ps = sum(pars[idx["alpha", 1]:idx["alpha", 2]]) + sum(pars[idx["beta",
# 1]:idx["beta", 2]])
# return(ps)
# }
gives exactly alpha + beta in the case of GARCH(1,1) and then ugarchpathchecks the aforementioned stationarity condition. Hence, the most straightforward thing that you can do is to check if persistence(spec) < 1 before simulating. For instance, in your example,
persistence(spec)
# [1] 0.8997927
I want to generate three correlated outcomes for 20 studies. Each study has 3 groups (control, treat1, and treat2). For the control group, my generating values are: mean=0, sd=1; for both treatment groups, my generating values are: mean=0.40, sd=1. Two things that I want to accomplish (which I’m having trouble doing):
1) Condition 1: I want to generate correlated outcome so that there are different correlations between each of the pairs of outcomes. The correlation should be sampled from the vector of correlations, rho=c(0.6, 0.7, 0.8); and
2) Condition 2: I want to generate correlated outcomes so that a subset of the studies (half) will be sample from a vector of correlations, rho1=c(0.6, 0.7, 0.8), and the other subset(remaining half) will be sampled from a vector of correlations, rho2=c(0.3, 0.4, 0.5)
I’m using the “mvtnorm” package to generate the outcomes for each of the groups. Here’s my code (please pardon my very basic knowledge of simulation and R):
library(“mvtnorm”)
set.seed(0307)
mean_c = c(0, 0, 0)
mean_t1 = c(0.4, 0.4, 0.4)
mean_t2 = c(0.4. 0.4, 0.4)
k <- 20 # no. of studies
n <- 50 # sample size
rho <- # the value is sampled from a vector of correlations
for (i in 1:k) {
Yc <-rmvnorm(n=n, mean=mean_c, sigma=rho)
Yt1<-rmvnorm(n=n, mean=mean_t1, sigma=rho)
Yt2 <-rmvnorm(n=n, mean=mean_t2, sigma=rho)
}
I appreciate any inputs from our programming experts here. Thanks!
I am not sure I have understood your question.
But just in case it might help you, here I provide an example of rmvnorm function using your "data". I modified some numbers in order to make clear all dependencies
library(mvtnorm)
set.seed(1234)
k = 10000
means = c(0, 0.4, 0.4)
sigmas = c(2, 1, 1)
rhoXY = 0.6
rhoXZ = 0.7
rhoYZ = 0.8
varMatrix <- matrix(c(
sigmas[1]*sigmas[1], rhoXY*sigmas[1]*sigmas[2], rhoXZ*sigmas[1]*sigmas[3],
rhoXY*sigmas[1]*sigmas[2], sigmas[2]*sigmas[2], rhoYZ*sigmas[2]*sigmas[3],
rhoXZ*sigmas[1]*sigmas[3], rhoYZ*sigmas[2]*sigmas[3], sigmas[3]*sigmas[3]
),
ncol=3, byrow=TRUE)
# Generate data
Yc <- rmvnorm(n = k,
mean = means,
sigma = varMatrix, method="chol")
# Check data satisfies what it should
colMeans(Yc)
var(Yc)
cor(Yc[,1], Yc[,2])
cor(Yc[,1], Yc[,3])
cor(Yc[,2], Yc[,3])
Check output
> colMeans(Yc)
[1] 0.007118385 0.406214538 0.401605464
> var(Yc)
[,1] [,2] [,3]
[1,] 4.024896 1.2026685 1.4204561
[2,] 1.202668 0.9998153 0.8046641
[3,] 1.420456 0.8046641 1.0052659
> cor(Yc[,1], Yc[,2])
[1] 0.599527
> cor(Yc[,1], Yc[,3])
[1] 0.7061712
> cor(Yc[,2], Yc[,3])
[1] 0.802628
Thanks for the email, it was nice to be asked! I don't totally understand the rmvnorm function (or your request!) but it looks like Roc answered your question. Nevertheless it is simple to perform the function 20 times, using different rho values in the two halves. My code is perhaps not the most elegant - it might be possible to generate all this data with a single call to rmvnorm, rather than 20 as in my code, but this seems to work just fine. You can access the results for your 20 studies as I have done with the square brackets.
library(mvtnorm)
set.seed(1234)
k = 10000
means = c(0, 0.4, 0.4)
sigmas = c(1, 1, 1)
rho.type1 <- c(0.3, 0.4, 0.5)
rho.type2 <- c(0.6, 0.7, 0.8)
study.number <-20
Yc <- matrix(0, ncol = 3, nrow = k* study.number)
for(i in 1: study.number)
{
ifelse(i < 11, rho <- rho.type1, rho <- rho.type2)
varMatrix <- matrix(c(
sigmas[1]*sigmas[1], rho[1]*sigmas[1]*sigmas[2], rho[2]*sigmas[1]*sigmas[3],
rho[1]*sigmas[1]*sigmas[2], sigmas[2]*sigmas[2], rho[3]*sigmas[2]*sigmas[3],
rho[2]*sigmas[1]*sigmas[3], rho[3]*sigmas[2]*sigmas[3], sigmas[3]*sigmas[3]
),
ncol=3, byrow=TRUE)
# Generate data, and save the 20 datasets in a list called Yc
Yc[(1 + (i-1)*k):(i*k), ] <- rmvnorm(n = k,
mean = means,
sigma = varMatrix, method="chol")
}
Yc <- data.frame(Yc, study = rep(1:20, each = k))
# Check output
cor(Yc[Yc$study==1,1], Yc[Yc$study==1,2]) # To check the first entry in the list
for(i in 1:20) print(cor(Yc[Yc$study==i,1], Yc[Yc$study==i,2])) # To check the lot
I've been reading about a few methods to fit a circle to data (like this). I would like to see how the methods work on real data and thought of using R for this. I tried searching rseek for packages that can help with this but came up with nothing useful.
So, are there packages that help to easily compute the best fit circle for a given data set (similar to how lm() will fit a linear model to a data set)? Otherwise, how might one perform such a task in R?
Here's a fairly naive implementation of a function that minimises SS(a,b,r) from that paper:
fitSS <- function(xy,
a0=mean(xy[,1]),
b0=mean(xy[,2]),
r0 = mean(sqrt((xy[,1]-a0)^2 + (xy[,2]-b0)^2)),
...){
SS <- function(abr){
sum((abr[3] - sqrt((xy[,1]-abr[1])^2 + (xy[,2]-abr[2])^2))^2)
}
optim(c(a0,b0,r0), SS, ...)
}
I've written a couple of supporting functions to generate random data on circles and to plot circles. Hence:
> xy = sim_circles(10)
> f = fitSS(xy)
The fit$par value is a vector of xcenter, ycenter, radius.
> plot(xy,asp=1,xlim=c(-2,2),ylim=c(-2,2))
> lines(circlexy(f$par))
Note it doesn't use the gradients nor does it check the error code for convergence. You can supply it with initial values or it can have a guess.
Code for plotting and generating circles follows:
circlexy <- function(xyr, n=180){
theta = seq(0,2*pi,len=n)
cbind(xyr[1] + xyr[3]*cos(theta),
xyr[2] + xyr[3]*sin(theta)
)
}
sim_circles <- function(n,x=0,y=0,r=1,sd=0.05){
theta = runif(n, 0, 2*pi)
r = r + rnorm(n, mean=0, sd=sd)
cbind(x + r*cos(theta),
y + r*sin(theta)
)
}
Well, looky here: an R-blogger column has written some code to fit to ellipses and circles. His code, which I won't repost here, is based on previous work done by Radim Halíř and Jan Flusser in Matlab. His code includes (commented) the original Matlab lines for comparison.
I've peeked at a number of papers on this topic, and can only say that I'm not qualified to determine which algorithms are the most robust. For those interested, take a look at these papers:
http://www.emis.de/journals/BBMS/Bulletin/sup962/gander.pdf
http://ralph.cs.cf.ac.uk/papers/Geometry/fit.pdf
http://autotrace.sourceforge.net/WSCG98.pdf
Followup edit: I ran Spacedman's code against the linked R-code for fitting ellipses, using the same "noisy" set of 1e5 points on a circle as input. The results are:
testcircle<-create.test.ellipse(Rx=200,Ry=200,Rot=.56,Noise=5.5,leng=100000)
dim(testcircle)
[1] 100000 2
microbenchmark(fitSS(testcircle),fit.ellipse(testcircle))
Unit: milliseconds
expr min lq median uq max
fitSS(testcircle) 649.98245 704.05751 731.61282 787.84212 2053.7096
fit.ellipse(testcircle) 25.74518 33.87718 38.87143 95.23499 256.2475
neval
100
100
For reference, the output of the two fitting functions were:
From SSfit, the list
ssfit
$par
[1] 249.9530 149.9927 200.0512
$value
[1] 185.8195
$counts
function gradient
134 NA
$convergence
[1] 0
$message
NULL
From fit.ellipse, we get
ellfit
$coef
a b c d e
-7.121109e-01 -1.095501e-02 -7.019815e-01 3.563866e+02 2.136497e+02
f
-3.195427e+04
$center
x y
249.0769 150.2326
$major
[1] 201.7601
$minor
[1] 199.6424
$angle
[1] 0.412268
You can see that the elliptic equation's coefficients are near-zero for terms which "deviate" from a circle; plotting the two results yields almost indistinguishable curves.
To fit an ellipse, there is the fitEllipse function in the PlaneGeometry package. It uses the fitConic package.
library(PlaneGeometry)
library(PlaneGeometry)
# the "true" ellipse:
ell <- Ellipse$new(center = c(1, 1), rmajor = 3, rminor = 2, alpha = 25)
# We add some noise to 30 points on this ellipse:
set.seed(666L)
points <- ell$randomPoints(30, "on") + matrix(rnorm(30*2, sd = 0.2), ncol = 2)
# Now we fit an ellipse to these points:
ellFitted <- fitEllipse(points)
# let's draw all this stuff, true ellipse in blue, fitted ellipse in green:
box <- ell$boundingbox()
plot(NULL, asp = 1, xlim = box$x, ylim = box$y, xlab = NA, ylab = NA)
draw(ell, border = "blue", lwd = 2)
points(points, pch = 19)
draw(ellFitted, border = "green", lwd = 2)