Stacking multiple design terms using survey package in R - r

I am trying to understand how to combine several designs using the survey package in R.
For example, to construct survey weights, I need to post-stratify, calibrate, trim weights, and re-calibrate.
I have tried to stack the design terms in the following sequences:
n<- nrow(my_data)
d<- rep(N/n, n)
f<- rep(n/N, n)
#specifying SRS survey design
srs.design<- svydesign(ids= ~0, strata= NULL,
data= my_data,
weight= ~d, fpc= ~f)
#specifying post-stratification survey design
ps.design<- postStratify(design= srs.design,
strata= ~postsurvey_strata,
population= N.ps) #where N.ps is the poststrata population distribution
#specifying raking survey design
rake.design<- calibrate(design= ps.design,
formula= ~as.factor(gender_age)+
as.factor(education)+
as.factor(race)+
as.factor(income),
calfun= "raking",
population= pop.P_sam) #where pop.P_sam is the demographic distributions in the population
#specifying trimming survey design
trim.design<- trimWeights(design= rake.design, lower= 0.2, upper= 6)
#specifying re-calibration
rerake.design<- calibrate(design= trim.design,
formula= ~as.factor(gender_age)+
as.factor(education)+
as.factor(race)+
as.factor(income),
calfun= "raking",
population= pop.P_sam)
(The main reason why I have post-stratification as a separate step is because there are 13 post-strata. Some of the post-strata are quite small, so I am concerned that folding the post-stratification variable into raking would lead to convergence problem.)
I would really like to know whether this is the right approach and whether there are more succinct ways. Thanks!

Related

How to combine all datasets into a data frame after multiple imputation (mice)

I read this article (https://journal.r-project.org/archive/2021/RJ-2021-073/RJ-2021-073.pdf) about multiple imputation and propensity score matching - here is the code from this article:
# code from "MatchThem:: Matching and Weighting after Multiple Imputation", Pishgar et al, The R Journal Vol. XX/YY, AAAA 20ZZ:
library(MatchThem)
data('osteoarthritis')
summary(osteoarthritis)
library(mice)
imputed.datasets <- mice(osteoarthritis, m = 5)
matched.datasets <- matchthem(OSP ~ AGE + SEX + BMI + RAC + SMK,
datasets = imputed.datasets,
approach = 'within',
method = 'nearest',
caliper = 0.05,
ratio = 2)
weighted.datasets <- weightthem(OSP ~ AGE + SEX + BMI + RAC + SMK,
datasets = imputed.datasets,
approach = 'across',
method = 'ps',
estimand = 'ATM')
library(cobalt)
bal.tab(matched.datasets, stats = c('m', 'ks'),
imp.fun = 'max')
bal.tab(weighted.datasets, stats = c('m', 'ks'),
imp.fun = 'max')
library(survey)
matched.models <- with(matched.datasets,
svyglm(KOA ~ OSP, family = quasibinomial()),
cluster = TRUE)
weighted.models <- with(weighted.datasets,
svyglm(KOA ~ OSP, family = quasibinomial()))
matched.results <- pool(matched.models)
summary(matched.results, conf.int = TRUE)
As far as I understand the author first uses multiple imputation with mice (m = 5) and continues with the matching procedure with MatchThem - in the end MatchThem gives back a "mimids-object" called "matched.datasets" which contains the 5 different dataset of multiple imputation.
There is the "complete" function which can extract one of the datasets, f.e.
newdataset <- complete(matched.datasets, 2) # extracts the second dataset.
So newdataset is a data frame without NAs (because imputed) and can be used for any further tests.
Now, I would like to use a dataset as a dataframe (like after using complete), but this dataset should be some kind of a "mean" of all datasets - because how could I decide, which of the 5 datasets I use for my further analyses? Is there a way of doing something like this:
meanofdatasets <- complete(matched.datasets, meanofall5datasets) # extracts a dataset which contains something like the mean values of all datasets
In my data, for which I want to use this method, I would like to use an imputed and matched dataset of my original about 500 rows to do further tests, f.e. cox regression, kaplan meier plots or competing risk analyses as well as simple descriptive statistics with plots about the matched population. But on which of the 5 datasets do I have to append my tests? For those tests I need a real data frame, don't I?
Thank you for any help!
here is some valuable source (from the creator of the mice package : Stef Vanbuuren) to learn why you should NOT average the multiples dataset, but POOL the estimates of each imputed dataset for instance doing your cox regression (see section 5.1 workflow).
Quick steps for Cox regression:
you can easily do the imputation + multiple imputation with matchthem() which will give you a mimids class object.
Then do your cox regression through with() function on your mimids object.
Finally pool your estimates through pool(), which will give you a mimira object.
Eventually mimira object is easily managed with gtsummary package (tbl_regression) which give you a fine and readily publishable table.

Summary statistics for weighted values using the ANESRAKE package in R

I have created weighted data with my survey using the anesrake and weights package. However, I am not sure how to use the weights afterwards, beside wpct function in the package. How can I compute say descriptive stats and integrate the weighted data with other functions/packages?
Reproducible data from the anesrake package:
data("anes04")
anes04$caseid <- 1:length(anes04$age)
anes04$agecats <- cut(anes04$age, c(0, 25,35,45,55,65,99))
levels(anes04$agecats) <- c("age1824", "age2534", "age3544",
"age4554", "age5564", "age6599")
marriedtarget <- c(.4, .6)
agetarg <- c(.10, .15, .17, .23, .22, .13)
names(agetarg) <- c("age1824", "age2534", "age3544",
"age4554", "age5564", "age6599")
targets <- list(marriedtarget, agetarg)
names(targets) <- c("married", "agecats")
outsave <- anesrake(targets, anes04, caseid=anes04$caseid,
verbose=TRUE)
caseweights <- data.frame(cases=outsave$caseid, weights=outsave$weightvec)
This will give me a new vector with weights for the dataframe. So, my question is, how can I analyze the data know? How can I incorporate these weights with summary statistics?
You could supply the weights as the weights= argument to survey::svydesign. Ideally, you'd do the raking in the survey package so that you could take account of the variance reductions from raking, but it's pretty standard (at least in public-use data) to analyse raked weights as if they were just sampling weights.
Or, if the raking specification you ended up with was simple enough to reproduce in survey::rake or survey::calibrate, you could redo the raking in the survey package.
The reason for using the survey package is the very wide range of other analyses it allows (and even more with svyVGAM).

R One-Class SVM - Get Probabilistic outputs

I am trying to find away to derive probabilistic outputs when predicting from a one-class svm in R. I know this is not supported in libsvm and I also know this question has been asked before and here a couple of years ago on SO but packages were not available at that time. I'm hoping things have changed now! Also this question is still valid as no approach implemented in R was given as a solution.
I could not find a package to do this so I tried two approaches myself to get around this:
Get the decision values and transform them through the use of the sigmoid activation function. This is described in this paper. Note the paragraph:
Furthermore, SVMs can also produce class probabilities as output instead of class labels. This
is can done by an improved implementation (Lin, Lin, and Weng 2001) of Platt’s a posteriori
probabilities (Platt 2000) where a sigmoid function is fitted to the decision values f of the binary SVM classifiers, A and B being estimated by minimizing the negative log-likelihood function
Use a logistic regression function on the predicted output and derive the probabilities from it. This approach was first described by Platt and an approach is outlined here
My problem is that to check if either of my two solutions are plausible, I tested these two approaches on a two-class svm problem as e1071, using libsvm, gives probabilities for two-class problems so this was taken as the 'truth'. I found that neither of my approaches aligned closely to libsvm.
Here are three graphs showing the resulting probabilities versus the known decision values.
Click to see image. Sorry I seem to have too low a reputation to embed the image which is frustrating! I'm not sure if someone in the community with a higher reputation can edit to embed?
I think my Platt approach is theoretically more sound but, as can be seen from the graph, it appears the logistic regression was somehow too good, the probabilities associated with either classification being extremely close to 1 for positive and 0 for negative.
My code for the Platt implementation is
platt_scale <- function(oc_svm, X){
# Get SVM predictions
y_pred <- predict(oc_svm$best.model,X)
#y_pred <- as.factor(ifelse(y_pred==T,"pos","neg"))
# Train using logistic regression with cross-validation
require(caret)
model <- train(x = X,
y = y_pred,
method = "glm",
family=binomial(),
trControl = trainControl(method = "cv",
number = 5),
control = list(maxit = 50) #BROUGHT IN TO STOP WARNING MESSAGES
)
return(predict(model,
newdata = X,
type = "prob")[,1])
}
I get the following warning when this runs
glm.fit: fitted probabilities numerically 0 or 1 occurred
So I am clearly doing something wrong! I feel like fixing this function is probably the best approach but I don't see where I have gone wrong? I am following the approach I mentioned earlier, here
I get the sigmoid of the decision values as follows
sig_mult <-e1071::sigmoid(decision_values)
The examples were done using the Iris dataset, full code is here
data(iris)
two_class<-iris[iris$Species %in% c("setosa","versicolor"),]
#Make Two-class SVM
svm_mult<-e1071::tune(svm,
train.x = two_class[,1:4],
train.y = factor(two_class[,5],levels=c("setosa", "versicolor")),
type="C-classification",
kernel="radial",
gamma=0.05,
cost=1,
probability = T,
tunecontrol = tune.control(cross = 5))
#Get related decision values
dec_vals_mult <-attr(predict(svm_mult$best.model,
two_class[,1:4],
decision.values = T #use decision values to get score
), "decision.values")
#Get related probabilities
prob_mult <-attr(predict(svm_mult$best.model,
two_class[,1:4],
probability = T #use decision values to get score
), "probabilities")[,1]
#transform decision values using sigmoid
sig_mult <-e1071::sigmoid(dec_vals_mult)
#Use Platt Implementation function to derive probabilities
platt_imp<-platt_scale(svm_mult,two_class[,1:4])
require(ggplot2)
data2<-as.data.frame(cbind(dec_vals_mult,sig_mult))
names(data2)<-c("Decision.Values","Sigmoid.Decision.Values(Prob)")
sig<-ggplot(data=data2,aes(x=Decision.Values,
y=`Sigmoid.Decision.Values(Prob)`,
colour=ifelse(Decision.Values<0,"neg","pos")))+
geom_point()+
ylim(0,1)+
theme(legend.position = "none")
data3<-as.data.frame(cbind(dec_vals_mult,prob_mult))
names(data3)<-c("Decision.Values","Probabilities")
actual<-ggplot(data=data3,aes(x=Decision.Values,
y=Probabilities,
colour=ifelse(Decision.Values<0,"neg","pos")))+
geom_point()+
ylim(0,1)+
theme(legend.position = "none")
data4<-as.data.frame(cbind(dec_vals_mult,platt_imp))
names(data4)<-c("Decision.Values","Platt")
plat_imp<-ggplot(data=data4,aes(x=Decision.Values,
y=Platt,
colour=ifelse(Decision.Values<0,"neg","pos")))+
geom_point()+
ylim(0,1)
require(ggpubr)
ggarrange(actual, plat_imp, sig,
labels = c("Actual", "Platt Implementation", "Sigmoid Transformation"),
ncol = 3,
label.x = -.05,
label.y = 1.001,
font.label = list(size = 8.5, color = "black", face = "bold", family = NULL),
common.legend = TRUE, legend = "bottom")

The xgboost package and the random forests regression

The xgboost package allows to build a random forest (in fact, it chooses a random subset of columns to choose a variable for a split for the whole tree, not for a nod, as it is in a classical version of the algorithm, but it can be tolerated). But it seems that for regression only one tree from the forest (maybe, the last one built) is used.
To ensure that, consider just a standard toy example.
library(xgboost)
library(randomForest)
data(agaricus.train, package = 'xgboost')
dtrain = xgb.DMatrix(agaricus.train$data,
label = agaricus.train$label)
bst = xgb.train(data = dtrain,
nround = 1,
subsample = 0.8,
colsample_bytree = 0.5,
num_parallel_tree = 100,
verbose = 2,
max_depth = 12)
answer1 = predict(bst, dtrain);
(answer1 - agaricus.train$label) %*% (answer1 - agaricus.train$label)
forest = randomForest(x = as.matrix(agaricus.train$data), y = agaricus.train$label, ntree = 50)
answer2 = predict(forest, as.matrix(agaricus.train$data))
(answer2 - agaricus.train$label) %*% (answer2 - agaricus.train$label)
Yes, of course, the default version of the xgboost random forest uses not a Gini score function but just the MSE; it can be changed easily. Also it is not correct to do such a validation and so on, so on. It does not affect a main problem. Regardless of which sets of parameters are being tried results are suprisingly bad compared with the randomForest implementation. This holds for another data sets as well.
Could anybody provide a hint on such strange behaviour? When it comes to the classification task the algorithm does work as expected.
#
Well, all trees are grown and all are used to make a prediction. You may check that using the parameter 'ntreelimit' for the 'predict' function.
The main problem remains: is the specific form of the Random Forest algorithm that is produced by the xgbbost package valid?
Cross-validation, parameter tunning and other crap have nothing to do with that -- every one may add necessary corrections to the code and see what happens.
You may specify the 'objective' option like this:
mse = function(predict, dtrain)
{
real = getinfo(dtrain, 'label')
return(list(grad = 2 * (predict - real),
hess = rep(2, length(real))))
}
This provides that you use the MSE when choosing a variable for the split. Even after that, results are suprisingly bad compared to those of randomForest.
Maybe, the problem is of academical nature and concerns the way how a random subset of features to make a split is chosen. The classical implementation chooses a subset of features (the size is specified with 'mtry' for the randomForest package) for EVERY split separately and the xgboost implementation chooses one subset for a tree (specified with 'colsample_bytree').
So this fine difference appears to be of great importance, at least for some types of datasets. It is interesting, indeed.
xgboost(random forest style) does use more than one tree to predict. But there are many other differences to explore.
I myself am new to xgboost, but curious. So I wrote the code below to visualize the trees. You can run the code yourself to verify or explore other differences.
Your data set of choice is a classification problem as labels are either 0 or 1. I like to switch to a simple regression problem to visualize what xgboost does.
true model: $y = x_1 * x_2$ + noise
If you train a single tree or multiple tree, with the code examples below you observe that the learned model structure does contain more trees. You cannot argue alone from the prediction accuracy how many trees are trained.
Maybe the predictions are different because the implementations are different. None of the ~5 RF implementations I know of are exactly alike, and this xgboost(rf style) is as closest a distant "cousin".
I observe the colsample_bytree is not equal to mtry, as the former uses the same subset of variable/columns for the entire tree. My regression problem is one big interaction only, which cannot be learned if trees only uses either x1 or x2. Thus in this case colsample_bytree must be set to 1 to use both variables in all trees. Regular RF could model this problem with mtry=1, as each node would use either X1 or X2
I see your randomForest predictions are not out-of-bag cross-validated. If drawing any conclusions on predictions you must cross-validate, especially for fully grown trees.
NB You need to fix the function vec.plot as does not support xgboost out of the box, because xgboost out of some other box do not take data.frame as an valid input. The instruction in the code should be clear
library(xgboost)
library(rgl)
library(forestFloor)
Data = data.frame(replicate(2,rnorm(5000)))
Data$y = Data$X1*Data$X2 + rnorm(5000)*.5
gradientByTarget =fcol(Data,3)
plot3d(Data,col=gradientByTarget) #true data structure
fix(vec.plot) #change these two line in the function, as xgboost do not support data.frame
#16# yhat.vec = predict(model, as.matrix(Xtest.vec))
#21# yhat.obs = predict(model, as.matrix(Xtest.obs))
#1 single deep tree
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=250))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget,grid=200)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget)
#clearly just one tree
#100 trees (gbm boosting)
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=100,params = list(max.depth=16,eta=.5,subsample=.6))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget) ##predictions are not OOB cross-validated!
#20 shallow trees (bagging)
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=250,
num_parallel_tree=20,colsample_bytree = .5, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #bagged mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2]))) #terrible fit!!
#problem, colsample_bytree is NOT mtry as columns are only sampled once
# (this could be raised as an issue on their github page, that this does not mimic RF)
#20 deep tree (bagging), no column limitation
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=500,
num_parallel_tree=200,colsample_bytree = 1, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #boosted mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])))
#voila model can fit data

Simulate an artificial state-change sequence from a fitted semi-Markov model in R

I have a sequence of behavioural states (for a single moving animal), each with an associated duration, and am interested in producing a synthetic state sequence that preserves the properties of the original (particularly, the state-change probabilities, and the dwell-time distributions).
However, preliminary investigations of the dwell-time distributions indicate they are not geometrically distributed, so I think I should fit a semi-Markov model.
The R package, 'SemiMarkov' provides the below reproducible example of how to fit a semi-Markov model to an observed sequence of state-changes.
However, I cannot see how to then produce a synthetic sequence of state-changes from the fitted object - here, 'fit1'.
library(SemiMarkov)
data(asthma)
## Definition of the model: states, names, possible transtions and waiting time distributions
states_1 <- c("1","2","3")
mtrans_1 <- matrix(FALSE, nrow = 3, ncol = 3)
mtrans_1[1, 2:3] <- c("E","E")
mtrans_1[2, c(1,3)] <- c("E","E")
mtrans_1[3, c(1,2)] <- c("E","E")
## semi-Markov model
fit1 <- semiMarkov(data = asthma, states = states_1, mtrans = mtrans_1)
print(fit1)
FWIW, it is possible to fit a state-change sequence to a regular markov model (using 'markovchainFit' in the package 'markovchain'), and then simulate a synthetic state-change sequence (using 'rmarkovchain'), but there seems to be no obvious way of doing the same for a semi-Markov model.

Resources