how is obtained the observed cell counts in svychisq in R? - r

I'm using the function survey::svychisq() to test for independence in a two-way contingency table for complex samples.
With svytable() I get the observed counts considering the weights defined in design, and I would assume that the observed values saved in svychisq() objects would be the same, but they are not:
svytable(~row.var+col.var, design)
# 330.6634 867.6478 177.1630
# 687.4503 962.5404 228.2926
and
svychisq(~row.var+col.var, design)$observed
# 404.6712 1061.8411 216.8149
# 841.3126 1177.9722 279.3881
provide different results and I couldn't really understand why.
Could someone explain to me how the latter observed values are calculated?
Thanks!

The $observed component is a weighted table with weights scaled to sum to the sample size.

Related

How to estimate less conservative standard errors when using post-stratified weights without full information in the survey package?

I'm encountering (very) huge standard errors in my analysis of proportions with post-stratified data when using the survey package.
I'm working with a data set including (normalized) weights calculated via raking by another party. I don't know exactly how the strata have been defined (e.g. "ageXgender" has been used, but it's unclear which categorization has been used). Let's assume a simple random sample with a considerable amount of non-response.
Is there any way to estimate reduced standard errors due to post-stratification without the exact information about the procedure in survey? I could recallibrate the weights with rake() if I can exactly define the strata but I don't have enough information for this.
I have tried to infer the strata by grouping all equal weights together and thought that I would at least get an upper bound of the reduction in standard errors this way but using them did only lead to marginally reduced standard errors and sometimes even increased standard errors:
# An example with the api datasets, pretending that pw are post-stratification weights of unknown origin
library(survey)
data(api)
apistrat$pw <-apistrat$pw/mean(apistrat$pw) #normalized weights
# Include some more extreme weights to simulate my data
mins <- which(apistrat$pw == min(apistrat$pw))
maxs <- which(apistrat$pw == max(apistrat$pw))
apistrat[mins[1:5], "pw"] <- 0.1
apistrat[maxs[1:5], "pw"] <- 10
apistrat[mins[6:10], "pw"] <- 0.2
apistrat[maxs[6:10], "pw"] <- 5
dclus1<-svydesign(id=~1, weights=~pw, data=apistrat)
# "Estimate" stratas from the weights
apistrat$ps_est <- as.factor(apistrat$pw)
dclus_ps_est <-svydesign(id=~1, strata=~ps_est, weights=~pw, data=apistrat)
svymean(~api00, dclus1)
svymean(~api00, dclus_ps_est)
#this actually increases the se instead of reducing it
My real weights are also much more complex with 700 unique values in 1000 cases.
Is it possible to somehow approximate the reduction of standard errors due to post-stratification without knowing the real variables and categories and -especially- population values for rake? Could I use rake with some assumptions about the variables and categories used in the strata definitions but without the population totals in some way?
If your data are already raked, then you know the population totals exactly: raking makes the estimated population totals equal the true population totals for the raking variables. So, if you know the raking variables you can estimate the population totals then rake. The raking won't change the weights (because ex hypothesi these were already raked) but it will change the standard error estimates
(The next version of the survey package will have an option in svydesign to do exactly this.)

In R, what is the difference between ICCbare and ICCbareF in the ICC package?

I am not sure if this is a right place to ask a question like this, but Im not sure where to ask this.
I am currently doing some research on data and have been asked to find the intraclass correlation of the observations within patients. In the data, some patients have 2 observations, some only have 1 and I have an ID variable to assign each observation to the corresponding patient.
I have come across the ICC package in R, which calculates the intraclass correlation coefficient, but there are 2 commands available: ICCbare and ICCbareF.
I do not understand what is the difference between them as they do give completely different ICC values on the same variables. For example, on the same variable, x:
ICCbare(ID,x) gave me a value of -0.01035216
ICCbareF(ID,x) gave me a value of 0.475403
The second one using ICCbareF is almost the same as the estimated correlation I get when using random effects models.
So I am just confused and would like to understand the algorithm behind them so I could explain them in my research. I know one is to be used when the data is balanced and there are no NA values.
In the description it says that it is either calculated by hand or using ANOVA - what are they?
By: https://www.rdocumentation.org/packages/ICC/versions/2.3.0/topics/ICCbare
ICCbare can be used on balanced or unbalanced datasets with NAs. ICCbareF is similar, however ICCbareF should not be used with unbalanced datasets.

Compare variances between two populations with different means

I would like to compare two populations which have different means. I want to find a way to compare their variances, to have an idea of which of the two populations have values that disperse further from the mean.
The issues is that I think I should need a variance standardized/normalized on the mean value of each distribution.
Suggestions?
The next step would be to get a function in R that it is able to do that.
You don't need to standardise/normalise because variance is calculated as distance from the mean so is already normalised around the sample mean.
To demonstrate this run the following code
x<-runif(10000,min=100,max=101)
y<-runif(10000,min=1,max=2)
mean(x)
mean(y)
var(x)
var(y)
You'll see while the mean is different the variance of the two samples is identical (allowing for some difference due to pseudo-random number generation and sample size)

Is it possible to specify a range for numbers randomly generated by mvrnorm( ) in R?

I am trying to generate a random set of numbers that exactly mirror a data set that I have (to test it). The dataset consists of 5 variables that are all correlated with different means and standard deviations as well as ranges (they are likert scales added together to form 1 variable). I have been able to get mvrnorm from the MASS package to create a dataset that replicated the correlation matrix with the observed number of observations (after 500,000+ iterations), and I can easily reassign means and std. dev. through z-score transformation, but I still have specific values within each variable vector that are far above or below the possible range of the scale whose score I wish to replicate.
Any suggestions how to fix the range appropriately?
Thank you for sharing your knowledge!
To generate a sample that does "exactly mirror" the original dataset, you need to make sure that the marginal distributions and the dependence structure of the sample matches those of the original dataset.
A simple way to achieve this is with resampling
my.data <- matrix(runif(1000, -1, 2), nrow = 200, ncol = 5) # Some dummy data
my.ind <- sample(1:nrow(my.data), nrow(my.data), replace = TRUE)
my.sample <- my.data[my.ind, ]
This will ensure that the margins and the dependence structure of the sample (closely) matches those of the original data.
An alternative is to use a parametric model for the margins and/or the dependence structure (copula). But as staded by #dickoa, this will require serious modeling effort.
Note that by using a multivariate normal distribution, you are (implicity) assuming that the dependence structure of the original data is the Gaussian copula. This is a strong assumption, and it would need to be validated beforehand.

randomForest's importance only contains MeanDecreaseGini

I have two scripts which both generate random forests in R, which as far as I can work out have the same inputs, although my problem suggests this isn't the case. One of them returns an importance table containing
row.names importance.blue importance.red importance.MeanDecreaseAccuracy importance.MeanDecreaseGini
the other importance table just contains
row.names MeanDecreaseGini
Whats the difference between these two forests, and more importantly what's causing the difference given what I thought were identical inputs?
(The scripts are too large to paste here, but both are trying to predict a factor on the basis of a bunch of continuous variables)
The help page of randomForest tells us, that importance (when used for classification) is a matrix with nclass + 2 columns. The first nclass columns are the class-specific measures computed as mean descrease in accuracy. The nclass + 1st column is the mean descrease in accuracy over all classes. The last column is the mean decrease in Gini index.
If importance=FALSE, the last measure is still returned as a vector.
So, it seems to me, that you called randomForest once with importance=TRUE and once with importance=FALSE.

Resources