D-efficient balanced design with R - r

I was wondering if there is any package for generating a D-efficient balanced design with R
I tried the AlgDesign package, but I did not manage to get a balanced design.
Not sure if this is because of the small full factorial of the attributes and levels I am considering (32) or because balanced designs are out of the scope of AlgDesign.
I have 3 attributes: two attributes have four levels, and 1 has 2 levels
att 1: 4 levels
att 2: 4 levels
att 3: 2 levels
I use the following R code:
library(AlgDesign)
#-----------------------------
# define attributes and levels
#-----------------------------
desVarNames <- c("esource", "certified", "cost")
desLevels <- c(4,2, 4)
n <- 6 #number of choice sets
desOpt <- 4 #num option per choice set
set.seed(123456)
#generate full factorial
dat<-gen.factorial(desLevels,length(desLevels),varNames=desVarNames, center=TRUE)
destT <- optFederov(~., dat, nTrials = (n*(desOpt)), criterion="D")
destT

First, when you say "balanced" design, I suspect you mean that each factor level should appear the same number of times in the final design. (Actually, your variable "dat" above is a balanced design, too, since each possible combination appears only once).
The way optFederov works is by randomly selecting and replacing trials using Federov's exchange algorithm. As such, everytime a trial is exchanged with another candidate trial, an initially balanced design will become unbalanced, since if a trial "balances" a design, replacing it with any other trial will unbalance the design. If the algorithm was forced to keep a balanced design at each step, no replacement would be possible, and it would get stuck.
Not only a balanced design is incompatible with the way Feverov's exchange algorithm works, it is actually not desirable from a D-efficiency point of view.
For instance, if you have 4 factors with 2, 3, 5 and 7 levels respectively, the only way to have a balanced design is to include all 2*3*5*7 = 210 trials, whereas AlgDesign only suggests 19.
data = gen.factorial(c(2,3,5,7), factors = "all")
trials = optFederov(data = data, center = FALSE, criterion = "D")
In short, the notion of a "balanced d-efficient" design is largely antagonistic.

Related

Error in generalized linear mixed model cross-validation: The value in 'data[[cat_col]]' must be constant within each ID

I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html

Binary outcome, different trial #s across low N?

I have a sample of 4 individuals, all who have a varying number of trials (I work with a special population so what I get is what I get!)
The outcome is a binary yes/no
I want to know:
did the total sample select yes more often than chance?
did each individual select yes more often than chance?
Here is dummy data in R.
SbjEL <- data.frame(Sbj = c('EL'),
TrialNum = c(1:12),
Choice = c(0,0,1,1,1,1,1,1,1,1,1, NA))
SbjKZ <- data.frame(Sbj = c('KZ'),
TrialNum = c(1:12),
Choice = c(0,1,1,1,1,1,1,1,1,1,1, 1))
SbjMA <- data.frame(Sbj = c('MA'),
TrialNum = c(1:12),
Choice = c(0,0,1,1,1,1,1,1,1,1,1, 1))
SbjTC <- data.frame(Sbj = c('EL'),
TrialNum = c(1:12),
Choice = c(1,1,1,1,1,1,1,1, NA,NA,NA, NA))
For a different experiment with the same sample, I had more trials and did a one sample t test for the sample, and a binomial distribution to see what # of trials of Yes would be higher than chance.
# Did group select YES more than chance? --> 43 yes/48
Response_v <- c(21,22)
t.test(Response_v, mu = 12, alternative = "two.sided")
# How many YES selections would be more often than chance?
# 24 trials were completed --> 21 yes / 24
binom.test(21, 24, 1/2)
My issue is this starts to fall apart when I get down to 8-12 trials.
Any ideas? I am lost
A t-test is not appropriate here for either Q1 or Q2. With large samples you can use some approximations, but your counts are very small. So, you’re on the right track with the binomial test, but not the t-test.
For your Q1: you first ought to decide how the subjects are assumed to relate to each other. Are you pretty confident that each is providing an estimate of the same Bernoulli probability, p? Or instead, a-priori do you want to allow the possibility that subjects have different p’s? There are further questions to answer, overlapping with those you need to consider for Q2.
For your Q2: The exact method of choice depends on a number of things: For example, do you want to incorporate prior information (e.g. using historical data as a reference)? If not, there are purely frequentist methods to use off the shelf. Next, do you expect the yes/no’s to be independent, or are they more like a ‘signal’ in which the order matters? Third, is it possible that there is a mixture of Bernoullis for any of the subjects? And so on. These questions can be considered through software such as that found at www.datatrie.com/advisor

Setting layers for a Dynamic Bayesian Network with bnstruct in R

I am currently creating a DBN using bnstruct package in R. I have 9 variables in each 6 time steps. I have biotic and abiotic variables. I want to prevent the biotic variables to be parents of the abiotic variables.For a Bayesian Network, it's pretty easy to implement using for instance layering = c(1,1,2,2,2) in learn.dynamic.network().
But a problem rises for the Dynamic part: I would like to keep preventing biotic variables to be parents of abiotic ones in every time step while preventing edges to appear between any variables from t+1 to t.
If I use in layering =:
1 for abiotic variables at t1
2 for biotic variables at t1
3 for abiotic variables at t2
4 for biotic variables at t2...
I allow biotic variables from t-1 to explain the abiotic variables at t (and I don't want that).
So I tried:
## 9 variables for 6 time steps
test1 <- BNDataset(data = timedData,
discreteness = rep('d', 54),
variables = colnames(timedData),
node.sizes = rep(c(3,3,3,2,2,3,3,3,3), 6)
# num.time.steps = 6
)
## the 5 first variables are abiotic, the 4 last are biotics
dbn <- learn.dynamic.network(test1,
num.time.steps = 6,
layering = rep(c(1,1,1,1,1,2,2,2,2),6))
So now, I don't have any edges from biotic to abiotic (that's nice), but I have edges from variable_t(n+1) to variable_t(n).
I know that in bnlearn you can create a "blacklist" of edges that you don't want to see but I don't see any equivalent arguments in bnstruct. Any idea?
With the mmhc algorithm that is used as default, you can use the layer.struct parameter to specify which pairs of layers are allowed to have edges between them. layer.struct takes a binary matrix where cell i,j is 1 if there can be edges going from variables in layer i to variables in layer j, and 0 otherwise.
The best way to use this is to combine it with the manually-specified layering of your first solution.
Perfect, the combination of both arguments layering = and layer.struct = does what I wanted.
I post what I used here just to provide an example:
## DBN study
dbn <- learn.dynamic.network(test1,
num.time.steps = 6,
layering = rep(c(1,1,1,1,1,2,2,2,2, # set 2 layers per time step
3,3,3,3,3,4,4,4,4,
5,5,5,5,5,6,6,6,6,
7,7,7,7,7,8,8,8,8,
9,9,9,9,9,10,10,10,10,
11,11,11,11,11,12,12,12,12)),
layer.struct = matrix(c(1,0,0,0,0,0,0,0,0,0,0,0, ## allow certain layers to connect to others by hand
1,1,0,0,0,0,0,0,0,0,0,0,
1,0,1,0,0,0,0,0,0,0,0,0,
1,1,1,1,0,0,0,0,0,0,0,0,
1,0,1,0,1,0,0,0,0,0,0,0,
1,1,1,1,1,1,0,0,0,0,0,0,
1,0,1,0,1,0,1,0,0,0,0,0,
1,1,1,1,1,1,1,1,0,0,0,0,
1,0,1,0,1,0,1,0,1,0,0,0,
1,1,1,1,1,1,1,1,1,1,0,0,
1,0,1,0,1,0,1,0,1,0,1,0,
1,1,1,1,1,1,1,1,1,1,1,1),c(12,12)))
Thanks for the quick answer and the package btw

SMOTE in r reducing sample size significantly

I have a data set with around 130000 records. The records divided in two class of target variable,0 & 1. 1 contains only 0.09% of total proportion.
I'm running my analysis in R-3.5.1 on Windows 10. I used SMOTE algorithm to work with this imbalanced data set.
I used following code to handle imbalanced data set
library(DMwR)
data_code$target=as.factor(data_code$target) #Converted to factor as
# SMOTE works with factor data type
smoted_data <- SMOTE(target~., data_code, perc.over=100)
But after executing the code,I'm seeing the count for 0 is 212 & 1 is also 212 which is significant reduction of my sample size.Can you suggest me how do I handle this imbalanced data set with SMOTE without changing my data size
You need to play a bit with the two parameters avaiable from the function: perc.over and perc.under.
As per the doc from SMOTE:
The parameters perc.over and perc.under control the amount of
over-sampling of the minority class and under-sampling of the majority
classes, respectively.
So:
perc.over will tipically be a number above 100. With this type of
values, for each case in the orginal data set belonging to the
minority class, perc.over/100 new examples of that class will be
created
I can't see your data but, if your minority class has 100 cases and perc.over=100, the algorithm will generate 100/100 = 1 new cases from that class.
The parameter perc.under controls the proportion of cases of the
majority class that will be randomly selected for the final "balanced"
data set. This proportion is calculated with respect to the number of
newly generated minority class cases.
So for example a value of perc.under=100 will select from the majority class on the original data the same amount of observation that have been generated for the minority class.
In our example just 1 new case was generated so it will add just another one, resulting in a new dataset with 2 cases.
I suggest to use values above 100 for perc.over, and an even higher value for perc.under (defaults are 100 and 200).
Keep in mind that you're adding new observations that are not real in your minority class, I'd try to keep these under control.
Numeric example:
set.seed(123)
data <- data.frame(var1 = sample(50),
var2 = sample(50),
out = as.factor(rbinom(50, 1, prob=0.1)))
table(data$out)
# 0 1
# 43 7 # 50 rows total (original data)
smote_data <- DMwR::SMOTE(out ~ var1, data, perc.over = 200, perc.under = 400)
table(smote_data$out)
# 0 1
# 56 21 # 77 rows total (smote data)
An alternative to the DMwR package is the smotefamily package which does not reduce the sample size.
Instead, it creates additional data (= synthesized data) from the minority class, and adds it to the original data. So the output in the $data argument is ready for training. To tune the amount of synthesized data, you can modify the parameter dup_size. However, the default dup_size = 0 already optimizes the output to achieve balanced classes, so you don't need to tune it.
This is greatly explained in this blog post by Richard Richard.
Example code (with features in first two columns):
smote1 <- smotefamily::SMOTE(features, target, K = 4, dup_size = 0)
formula1 <- "class ~ ." %>% as.formula
model.smote <- caret::train(formula1, method = "rpart", smote1$data)
predictions.smote <- predict(model.smote, smote1$data[,1:2]) %>% print
cv2 <- confusionMatrix(smote1$data$class %>% as.factor, predictions.smote)
I find the smotefamily::SMOTE more convenient because you don't have to tune the two parameters perc_over and perc_under until you get an acceptable sample size, and the DMwR::SMOTE often generates NA values.
I know I'm a little too late to answer your question but hope this answer would help others! The package you're using is DMwR which uses a combination of SMOTE and under-sampling of the majority class.
I'd suggest you to use smotefamily::SMOTE as it only over samples the minority class, so you wouldn't lose your majority class observations.

R - warning for dissimilarity calculation, clustering with numeric matrix

Reproducible data:
Data <- data.frame(
X = sample(c(0,1), 10, replace = TRUE),
Y = sample(c(0,1), 10, replace = TRUE),
Z = sample(c(0,1), 10, replace = TRUE)
)
Convert dataframe to matrix
Matrix_from_Data <- data.matrix(Data)
Check the structure
str(Matrix_from_Data)
num [1:10, 1:3] 1 0 0 1 0 1 0 1 1 1 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "X" "Y" "Z"
The question:
I have dataframe of binary, symmetric variables (larger than the example), and I'd like to do some hierarchical clustering, which I've never tried before. There are no missing or NA values.
I convert the dataframe into a matrix before attempting to run the daisy function from the 'cluster' package, to get the dissimilarity matrix. I'd like to explore the options for calculating different dissimilarity metrics, but am running into a warning (not an error):
library(cluster)
Dissim_Euc_Matrix_from_Data <- daisy(Matrix_from_Data, metric = "euclidean", type = list(symm =c(1:ncol(Matrix_from_Data))))
Warning message:
In daisy(Matrix_from_Data, metric = "euclidean", type = list(symm = c(1:ncol(Matrix_from_Data)))) :
with mixed variables, metric "gower" is used automatically
...which seems weird to me, since "Matrix_from_Data" is all numeric variables, not mixed variables. Gower might be a fine metric, but I'd like to see how the others impact the clustering.
What am I missing?
Great question.
First, that message is a Warning and not an Error. I'm not personally familiar with daisy, but my ignorant guess is that that particular warning message pops up when you run the function and doesn't do any work to see if the warning is relevant.
Regardless of why that warning appears, one simple way to compare the clustering done by several different distances measures in hierarchical clustering is to plot the dendograms. For simplicity, let's compare the "euclidean" and "binary" distance metrics programmed into dist. You can use ?dist to read up on what the "binary" distance means here.
# When generating random data, always set a seed if you want your data to be reproducible
set.seed(1)
Data <- data.frame(
X = sample(c(0,1), 10, replace = TRUE),
Y = sample(c(0,1), 10, replace = TRUE),
Z = sample(c(0,1), 10, replace = TRUE)
)
# Create distance matrices
mat_euc <- dist(Data, method="euclidean")
mat_bin <- dist(Data, method="binary")
# Plot the dendograms side-by-side
par(mfrow=c(1,2))
plot(hclust(mat_euc))
plot(hclust(mat_bin))
I generally read dendograms from the bottom-up since points lower on the vertical axis are more similar (i.e. less distant) to one another than points higher on the vertical axis.
We can pick up a few things from these plots:
4/6, 5/10, and 7/8 are grouped together using both metrics. We should hope this is true if the rows are identical :)
3 is most strongly associated with 7/8 for both distance metrics, although the degree of association is a bit stronger in the binary distance as opposed to the Euclidean distance.
1, 2, and 9 have some notably different relationships between the two distance metrics (e.g. 1 is most strongly associated with 2 in Euclidean distance but with 9 in binary distance). It is in situations like this where the choice of distance metric can have a significant impact on the resulting clusters. At this point it pays to go back to your data and understand why there are differences between the distance metrics for these three points.
Also remember that there are different methods of hierarchical clustering (e.g. complete linkage and single linkage), but you can use this same approach to compare the differences between methods as well. See ?hclust for a complete list of methods provided by hclust.
Hope that helps!

Resources