How to use bootstrapping and weighted data? - r

I have two variables that I'd like to analyze with a 2x2 table, which is easy enough.
datatable=table(data$Q1data1, data$Q1data2) summary(datatable)
However, I need to weight each variable separately using two frequency weighting variables that I have. So far, I've only found the wtd.chi.sq function in the weights package, which only allows you to weight both variables by the same weighting variable.
In addition, I need to perform this 2x2 chi-square 1000 times using bootstrapping or some resampling method, so that I can eventually peek at the distribution of p-values.

Related

How to transform data after fitting a distribution with gamlss?

I have a data set where observations come from highly distinct groups. Each group may have a wildly different distribution, so I am trying to find the best distribution using fitdist from fitdistrplus, then use gamlssML from the gamlss package to find the best parameters.
My issue is with transforming the data after this step. For some of the distributions, like the Box-Cox t, I can find the equation for normalizing the data using the BCT coefficients, but for many of these distributions I cannot.
Does gamlss have a function that normalizes the data after fitting? Their documentation only provides the transformations for a small number of distributions https://www.gamlss.com/wp-content/uploads/2018/01/DistributionsForModellingLocationScaleandShape.pdf
Thanks a lot
The normalised data values (for any distribution) are exactly equal to the residuals from a gamlss fit,
m1 <- gamlss()
which can be accessed by
residuals(m1) or
m1$residuals

How to compute delong R with a nested cross validation

I need to compute the confidence interval and p value for comparison of AUCs, but I have a nested cross validation so I have 10 models instead of one and my final result is an average of it. As I do not have much data, I do not have an additional test set. But I have been asked to perform the comparison of AUC using delong and obtain the confidence intervals for my final result (the average of the nested cross validation).
I wonder if I can concatenate all the scores of the 10 models of the nested as an input of DeLong test in R.
I have seen in the manual that also use the original features (without probabilities) how can it be?. It performs a logistic regression internally to get them?
library(Daim)
M2 <- deLong.test(iris[,1:4], labels=iris[,5], labpos="versicolor")
Could I use the same idea ?

Need help in understanding the below plot generated using randomForestExplainer library in R

I am not able to understand what this plot depicts and what we mean by the relation between rankings.Let me add a few more things to make it a bit precise. This plot is generated by using plot_importance_rankings function available as part of randomForestExplainer package.
A sample code which generates this code is as
plot_importance_rankings(var_importance_frame)
where var_importance_frame contains important variables which we get as from
var_importance_frame <- measure_importance(rf_model)
Here rf_model is the trained random forest model. A sample example can be found at this link RandomForestExplainer - sample example
The randomForestExplainer package implements several measures to assess a given variable's importance in the random forests models. On this plot, you have mean_min_depth (average minimum depth of a variable across all trees), accuracy_decrease (accuracy loss by randomly permuting on a given variable), gini_decrease (average gain of purity by splitting on a given variable), no_of_nodes (number of nodes that split on a given variable across trees), times_a_root (number of times a given variable is used as the root of trees). Ideally you'd want these importance measures to be somewhat consistent, in that a variable measured as of high importance by one metric is also measured high by the others. What this plot is showing is this as a sanity check. In your case the variable importances are largely consistent and positively correlated. Each dot on the scatter plot represents a variable.

How to describe data after multiple imputation using Amelia (which dataset should I use)?

I did multiple imputation using Amelia using the following code
binary<- c("Gender", "Diabetes")
exclude.from.IMPUTATION<-c( "Serial.ID")
NPvars<- c("age", "HDEF","BMI")#a skewed (non-parametric variable
a.out <- Amelia::amelia(x = for.imp.data,m=10,
idvars=exclude.from.IMPUTATION,
noms = binary, logs =NPvars)
summary(a.out)
## save imputed datasets ##
Amelia::write.amelia(obj=a.out, file.stem = "impdata", format = "csv")
I had 10 different output data csv files (shown in the picture below)
and I know that I can use any one of them to do descriptive analysis as shown in prior questions but
Why we should do MULTIPLE imputation if we will use any SINGLE file
of them?
Some authors reported using Rubin's Rule to summarize across
imputations as shown here, please advice on how to do that.
You do not use just one of these dataset. As you correctly stated, then the whole process of multiple imputation would be useless.
As jay.sf said, the different datasets express the uncertainty of the imputation. The missing data is ultimately lost - we can only estimate, what the real data could look like. With multiple imputation we generate multiple estimates, what the real data could look like. Overall, this can be used to say something like: the missing data most likely lies between ... and ... .
When you are generating descriptive statistics, you generate these for each of the imputed datasets separately. Looking at for example at the mean, you could then e.g. provide as additional information, the lowest mean and the highest mean over these imputed datasets. You can provide the mean of these means and the standard deviation for the mean over the imputed datasets. This way your readers will know how much uncertainty comes with the imputation.
You can also use your imputed datasets to describe the uncertainty for the output of linear models. You do this by using RubinĀ“s Rules (RR) to pool parameter estimates, such as mean differences, regression coefficients, standard errors and to derive confidence intervals and p-values. (see also https://bookdown.org/mwheymans/bookmi/rubins-rules.html)

Does the R survey package have a function like prop.test for comparing two population proportions?

I am working with a database that involves weighted statistics and complex survey design, namely the National/Nationwide Inpatient Sample, so I am using R's 'survey' package for tasks like summary statistics, regression, etc.
One test I was interesting in running was comparing the population of a certain group in 2 different populations (i.e. is the difference between the proportion of A in Population B and proportion of A in Population C statistically significant). I have already used scvyciprop to plot the confidence intervals for these proportions, and have seen that these two proportions are significantly different. However, I was wondering if there was a function like prop.test, whether or not it's in the 'survey' package, that can run this test for data in a complex survey (ex. takes a survey design object as an argument) and actually quantify the p-value / t-statistic. Does svyttest have this functionality, or is there another function I could potentially use?
Thank you!

Resources