Setting layers for a Dynamic Bayesian Network with bnstruct in R - r

I am currently creating a DBN using bnstruct package in R. I have 9 variables in each 6 time steps. I have biotic and abiotic variables. I want to prevent the biotic variables to be parents of the abiotic variables.For a Bayesian Network, it's pretty easy to implement using for instance layering = c(1,1,2,2,2) in learn.dynamic.network().
But a problem rises for the Dynamic part: I would like to keep preventing biotic variables to be parents of abiotic ones in every time step while preventing edges to appear between any variables from t+1 to t.
If I use in layering =:
1 for abiotic variables at t1
2 for biotic variables at t1
3 for abiotic variables at t2
4 for biotic variables at t2...
I allow biotic variables from t-1 to explain the abiotic variables at t (and I don't want that).
So I tried:
## 9 variables for 6 time steps
test1 <- BNDataset(data = timedData,
discreteness = rep('d', 54),
variables = colnames(timedData),
node.sizes = rep(c(3,3,3,2,2,3,3,3,3), 6)
# num.time.steps = 6
)
## the 5 first variables are abiotic, the 4 last are biotics
dbn <- learn.dynamic.network(test1,
num.time.steps = 6,
layering = rep(c(1,1,1,1,1,2,2,2,2),6))
So now, I don't have any edges from biotic to abiotic (that's nice), but I have edges from variable_t(n+1) to variable_t(n).
I know that in bnlearn you can create a "blacklist" of edges that you don't want to see but I don't see any equivalent arguments in bnstruct. Any idea?

With the mmhc algorithm that is used as default, you can use the layer.struct parameter to specify which pairs of layers are allowed to have edges between them. layer.struct takes a binary matrix where cell i,j is 1 if there can be edges going from variables in layer i to variables in layer j, and 0 otherwise.
The best way to use this is to combine it with the manually-specified layering of your first solution.

Perfect, the combination of both arguments layering = and layer.struct = does what I wanted.
I post what I used here just to provide an example:
## DBN study
dbn <- learn.dynamic.network(test1,
num.time.steps = 6,
layering = rep(c(1,1,1,1,1,2,2,2,2, # set 2 layers per time step
3,3,3,3,3,4,4,4,4,
5,5,5,5,5,6,6,6,6,
7,7,7,7,7,8,8,8,8,
9,9,9,9,9,10,10,10,10,
11,11,11,11,11,12,12,12,12)),
layer.struct = matrix(c(1,0,0,0,0,0,0,0,0,0,0,0, ## allow certain layers to connect to others by hand
1,1,0,0,0,0,0,0,0,0,0,0,
1,0,1,0,0,0,0,0,0,0,0,0,
1,1,1,1,0,0,0,0,0,0,0,0,
1,0,1,0,1,0,0,0,0,0,0,0,
1,1,1,1,1,1,0,0,0,0,0,0,
1,0,1,0,1,0,1,0,0,0,0,0,
1,1,1,1,1,1,1,1,0,0,0,0,
1,0,1,0,1,0,1,0,1,0,0,0,
1,1,1,1,1,1,1,1,1,1,0,0,
1,0,1,0,1,0,1,0,1,0,1,0,
1,1,1,1,1,1,1,1,1,1,1,1),c(12,12)))
Thanks for the quick answer and the package btw

Related

Error in generalized linear mixed model cross-validation: The value in 'data[[cat_col]]' must be constant within each ID

I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html

R: Find cutoffpoint for continous variable to assign observations to two groups

I have the following data
Species <- c(rep('A', 47), rep('B', 23))
Value<- c(3.8711, 3.6961, 3.9984, 3.8641, 4.0863, 4.0531, 3.9164, 3.8420, 3.7023, 3.9764, 4.0504, 4.2305,
4.1365, 4.1230, 3.9840, 3.9297, 3.9945, 4.0057, 4.2313, 3.7135, 4.3070, 3.6123, 4.0383, 3.9151,
4.0561, 4.0430, 3.9178, 4.0980, 3.8557, 4.0766, 4.3301, 3.9102, 4.2516, 4.3453, 4.3008, 4.0020,
3.9336, 3.5693, 4.0475, 3.8697, 4.1418, 4.0914, 4.2086, 4.1344, 4.2734, 3.6387, 2.4088, 3.8016,
3.7439, 3.8328, 4.0293, 3.9398, 3.9104, 3.9008, 3.7805, 3.8668, 3.9254, 3.7980, 3.7766, 3.7275,
3.8680, 3.6597, 3.7348, 3.7357, 3.9617, 3.8238, 3.8211, 3.4176, 3.7910, 4.0617)
D<-data.frame(Species,Value)
I have the two species A and B and want to find out which is the best cutoffpoint for value to determine the species.
I found the following question:
R: Determine the threshold that maximally separates two groups based on a continuous variable?
and followed the accepted answer to find the best value with the dose.p function from the MASS package. I have several similar values and it worked for them, but not for the one given above (which is also the reason why i needed to include all 70 observations here).
D$Species_b<-ifelse(D$Species=="A",0,1)
my.glm<-glm(Species_b~Value, data = D, family = binomial)
dose.p(my.glm,p=0.5)
gives me 3.633957 as threshold:
Dose SE
p = 0.5: 3.633957 0.1755291
this results in 45 correct assignments. however, if I look at the data, it is obvious that this is not the best value. By trial and error I found that 3.8 gives me 50 correct assignments, which is obviously better.
Why does the function work for other values, but not for this one? Am I missing an obvious mistake? Or is there maybe a different/ better approach to solving my problem? I have several values I need to do this for, so I really do not want to just randomly test values until I find the best one.
Any help would be greatly appreciated.
I would typically use a receiver operating characteristic curve (ROC) for this type of analysis. This allows a visual and numerical assessment of how the sensitivity and specificity of your cutoff changes as you adjust your threshold. This allows you to select the optimum threshold based on when the overall accuracy is optimum. For example, using pROC:
library(pROC)
species_roc <- roc(D$Species, D$Value)
We can get a measure of how good a discriminator Value is for predicting Species by examining the area under the curve:
auc(species_roc)
#> Area under the curve: 0.778
plot(species_roc)
and we can find out the optimum cut-off threshold like this:
coords(species_roc, x = "best")
#> threshold specificity sensitivity
#> 1 3.96905 0.6170213 0.9130435
We see that this threshold correctly identifies 50 cases:
table(Actual = D$Species, Predicted = c("A", "B")[1 + (D$Value < 3.96905)])
#> Predicted
#> Actual A B
#> A 29 18
#> B 2 21

Global multi-optimization function specification in R

I would like to use ngsa2 of mco package to solve an optimization problem with 3 objectives. In short, I am lookink for optimal land uses to solve environmental problem.
Here is my experiment:
- 100 land uses are possible in total (all.options in the code below), each land use being characterized by three performances (main.goal1, main.goal2 and main.goal3).
- I have 50 fields, whose characteristics (soil in fields.Kq) subset the 100 land uses (i.e., all land uses are not possible for each field) => options.soil1 and options.soil2
My objective is to assign a land use to each of my 50 fields, in order to minimize alltogether main.goal1, main.goal2 and main.goal3. From what I read, Genetic Algorithms are very powerful for such type of problems.
So here are my virtual data.
set.seed(0)
all.options<-data.frame(num.option=1:100,main.goal1 = abs(rnorm(100)),
main.goal2 = abs(rnorm(100)),
main.goal3 = abs(rnorm(100))) # all possible combinations of the 3 goals
options.soil1<-subset(all.options, main.goal1>0.5) # possible combinations for soil1
options.soil2<-subset(all.options, main.goal3<0.5) # possible combinations for soil2
fields.Kq<-data.frame(num.field=1:50,soil=round(runif(50,0,1),0))
I guess that my objective function should look like
my.function<-function(x) {
x[1]<-sum(A[,1) # main.goal1 for selected options for each of fields.Kq
x[2]<-sum(A[,2) # main.goal2 for selected options for each of fields.Kq
x[3]<-sum(A[,3) # main.goal3 for selected options for each of fields.Kq
} # where A should be a matrix of 50 lines with one line per field, and #"choosen" land use option
nsga2(my.function)
Unfortunately I could not go further, as I am new in optimizing with R. How to build the matrix A, with choosen land use for each field?
And using, nga, how to return these land uses? (together with the optimized (minimized) values for main.goal1, main.goal2 and main.goal3?
Thanks in advance for all the help you could provide me, I am really looking forward advices/links/books... to advance on my optimization problem.
Best regards,
LH
Here is how I solved the problem:
library("mco")
set.seed(0)
all.options<-data.frame(num.option=1:100,main.goal1 = abs(rnorm(100)),
main.goal2 = abs(rnorm(100)),
main.goal3 = abs(rnorm(100)),soil=c(rep("soilType1",50),rep("soilType2",50))) # all possible combinations of the 3 goals
fields.Kq<-data.frame(num.field=1:50,soil=rep(c("soilType1","soilType2"),25))
main.goal1=function(x) # x - a vector
{
main.goal1=sum(all.options[x,1]) # compute main.goal1
return(main.goal1) }
main.goal2=function(x) # x - a vector
{
main.goal2=sum(all.options[x,2]) # compute main.goal2
return(main.goal2) }
main.goal3=function(x) # x - a vector
{
main.goal3=sum(all.options[x,3]) # compute main.goal3
return(main.goal3) }
eval=function(x) c(main.goal1(x),main.goal2(x),main.goal3(x)) #objectivefunction
D<-length(fields.Kq[,1]) # number of fields
D2<-length(fields.Kq[,1])/2 # number of fields per type (simplified)
D.soil1<-max(which(all.options$soil=="soilType1")) # get boundary for bound soil1
D.soil2<-min(which(all.options$soil=="soilType2")) # get boundary for bound soil2
G=nsga2(fn=eval,idim=D,odim=3,
lower.bounds=c(rep(1,D2),rep(D.soil2,D2)),upper.bounds=c(rep(D.soil1,D2),rep(100,D2)), # lower/upper bound: min/max num option
popsize=20,generations=1:1000, cprob = 0.7, cdist = 5,
mprob = 0.2, mdist = 10)
I defined it thanks to exemples found in the very helpful and informative book "Modern optimization in R" by Paulo Cortez.
LH

Consistent K-Means Clustering Results

I have read k-means: Same clusters for every execution.
But it doesn't solve the problem I am having. I am sampling data that varies in sizes (increases in sizes). I need to cluster the data using k-means but the problem I am having is that each sample the clusters differ. The important thing to note is that my t+1 sample will always incorporate all of the components from the tth sample. So it slowly gets bigger and bigger. What I need is a way to be able to have the clusters stay the same. Is there a way around this other than using set.seeds? I am open to any solution.
The best way I can think to accomplish this would be to initially cluster the data with k-means and then to simply assign all additional data to closest cluster (setting the random seed will not help you to get the new clusters to nest within the original ones). As detailed in the answer to this question, the flexclust package makes this pretty easy:
# Split into "init" (used for initial clustering) and "later" (assigned later)
set.seed(100)
spl <- sample(nrow(iris), 0.5*nrow(iris))
init <- iris[spl,-5]
later <- iris[-spl,-5]
# Build the initial k-means clusters with "init"
library(flexclust)
(km <- kcca(init, k=3, kccaFamily("kmeans")))
# kcca object of family ‘kmeans’
#
# call:
# kcca(x = init, k = 3, family = kccaFamily("kmeans"))
#
# cluster sizes:
#
# 1 2 3
# 31 25 19
# Assign each element of "later" to the closest cluster
head(predict(km, newdata=later))
# 2 5 7 9 14 18
# 2 2 2 2 2 2

D-efficient balanced design with R

I was wondering if there is any package for generating a D-efficient balanced design with R
I tried the AlgDesign package, but I did not manage to get a balanced design.
Not sure if this is because of the small full factorial of the attributes and levels I am considering (32) or because balanced designs are out of the scope of AlgDesign.
I have 3 attributes: two attributes have four levels, and 1 has 2 levels
att 1: 4 levels
att 2: 4 levels
att 3: 2 levels
I use the following R code:
library(AlgDesign)
#-----------------------------
# define attributes and levels
#-----------------------------
desVarNames <- c("esource", "certified", "cost")
desLevels <- c(4,2, 4)
n <- 6 #number of choice sets
desOpt <- 4 #num option per choice set
set.seed(123456)
#generate full factorial
dat<-gen.factorial(desLevels,length(desLevels),varNames=desVarNames, center=TRUE)
destT <- optFederov(~., dat, nTrials = (n*(desOpt)), criterion="D")
destT
First, when you say "balanced" design, I suspect you mean that each factor level should appear the same number of times in the final design. (Actually, your variable "dat" above is a balanced design, too, since each possible combination appears only once).
The way optFederov works is by randomly selecting and replacing trials using Federov's exchange algorithm. As such, everytime a trial is exchanged with another candidate trial, an initially balanced design will become unbalanced, since if a trial "balances" a design, replacing it with any other trial will unbalance the design. If the algorithm was forced to keep a balanced design at each step, no replacement would be possible, and it would get stuck.
Not only a balanced design is incompatible with the way Feverov's exchange algorithm works, it is actually not desirable from a D-efficiency point of view.
For instance, if you have 4 factors with 2, 3, 5 and 7 levels respectively, the only way to have a balanced design is to include all 2*3*5*7 = 210 trials, whereas AlgDesign only suggests 19.
data = gen.factorial(c(2,3,5,7), factors = "all")
trials = optFederov(data = data, center = FALSE, criterion = "D")
In short, the notion of a "balanced d-efficient" design is largely antagonistic.

Resources