What R packages are available for binary data that is both correlated and clustered? - r

I'm working on a project now that's rather unlike anything I've done before. I have two tests with binary results that will be administered to the same sample, which is drawn from a clustered population (i.e., some subjects will be from the same family). I'd like to compare proportions of positive test results, but the clustering makes McNemar's test inappropriate so I've been reading up on alternative approaches. The two main routes seem to be 1) the clustering-adjusted McNemar alternatives by Rao and Scott (1992), Eliasziw and Donner (1991), and Obuchowski (1998), and 2) GEE.
Do you know of any implementations of the Rao-Obuchowski lineage in R (or, I suppose, SAS)? GEE is easy to find, but have you had a positive or negative experience with any particular packages? Is there another route to analyzing these data that I'm completely missing?

You could always just use a clustered bootstrap. Resample across families, which you believe are independent. That is, keep families together when you resample. Compute p2 - p1 for each sample. After 1000 iterations or so, compute the upper and bottom 2.5% quantiles. This will give you a bootstrapped 95% confidence interval. Alternatively compute the fraction of samples above zero, or whatever your hypothesis is. The procedure should have good pretty good properties unless the number of families is small.
It's probably easiest to do this by hand in R rather than relying on any package.

Check out the survey package: it is designed to take into account correlations induced by clustered sampling.

Have you already checked the CorrBin package in R?
It is for analysis of correlated binary data, there is a paper named: Using the CorrBin package for nonparametric analysis of
correlated binary data by Szabo, it includes the Rao-Scott, stochastic ordering and three versions of a GEE-based test.

The clust.bin.pair package for clustered binary matched-pair data was recently published to CRAN.
It contains implementations of Eliasziw and Donner (1991) and Obuchowski (1998), as well as two more recent tests in the same family Durkalski (2003) and Yang (2010).

Related

meta-analysis multiple outcome variables

As you might be able to tell from the sample of my dataset, it contains a lot of dependency, with each study providing multiple outcomes for the construct I am looking at. I was planning on using the metacor library because I only have information about the sample size but not variance. However, all methods I came across to that deal with dependency such as the package rubometa use variance (I know some people average the effect size for the study but I read that tends to produce larger error rates). Do you know if there is an equivalent package that uses only sample size or is it mathematically impossible to determine the weights without it?
Please note that I am a student, no expert.
You could use the escalc function of the metafor package to calculate variances for each effect size. In the case of correlations, it only needs the raw correlation coefficients and the corresponding sample sizes.
See section "Outcome Measures for Variable Association" # https://www.rdocumentation.org/packages/metafor/versions/2.1-0/topics/escalc

How to take a Probability Proportional to Size (PPS) Unequal Probability sample using R?

I have very little programming experience, but I'm working on a statistics project and would like to generate an unequal probability sample where the inclusion probability of a unit is based on its size (PPS).
Basically, I have two datasets:
ds1 lists US states and the parameter I'm trying to estimate
ds2 has the population size of each state.
My questions:
I want to use R to select a random sample from the first dataset using inclusion probabilities based on the population of each state (second dataset).
Also is there any way to use R to calculate these Generalized Unequal Probability Estimator formulas?
Also just a note on the formulas: pi_i is inclusion probability and pi_ij is joint inclusion probability.
There is a package for the same in R - pps and the documentation is here.
Also, there is another package called survey with a bit of documentation here.
I'm not sure of the difference between the two and haven't used them myself. Hope this is what you're looking for.
Yes, that's called weighted sampling. Simply set the weight to the size of the state, strictly you don't even need to normalize them by 1/sum(sizes) although it's always good practice to. There are tons of duplicate posts on SO showing how to do weighted sampling.
The only tiny complication is that you need to do a join() of the datasets ds1, ds2. Show us what code you've tried if it's causing problems. Recommend you use either dplyr or data.table.
Your second question should be asked as a separate question, and is offtopic on SO, or at least won't get a great response - best to ask statistical questions at sister site CrossValidated

calibration and liftchart with caret R package

I am comparing various predictive models on a binary classification task using the caret R package with respect to their predictive performance (liftChart) and prediction accuracy (calibration plot). I found the following issues:
1. Sometimes the lift function is very very slow when the number of observation is quite big or there are various competing classifiers. In addition I wonder whether it is possible to manually define the cuts of the calibration plot. I have a severe imbalanced model (average probability is 5%) and the calibration plot function assumes evenly spaced cuts.
The lift plot does the calculation for every unique probability value (much like an ROC curve), which is why it is slow.
Neither of those options are available right now. You can add two issues to the github page. I'm fairly swamped right now but those shouldn't be a big deal to change (you could always contribute solutions too).
Max

two-sided censored model in R (similar to Zeligs Tobit)?

Is there a model for dependent variables that are censored on both sides? And if so is there an implementation in R? I am only aware of tobit models (e.g. in Zelig package), but they´re obviously only censored on the left side... I wonder if it even makes sense to truncate on both sides...
There's a difference between truncation and censoring. You need to be aware of which is the case before you start modeling. (in a nutshell: Censoring means events can be detected, but the measurements are not known completely (i.e. in your case you neither know the exact beginning nor the exact end of the time interval subjects were under risk for the event you're considering). Truncation means events can be observed only if another condition is fullfilled: a popular example is survival in a retirement home that only accepts people over 65 to take up residence - entry into the study population is then truncated at age 65.)
if you have both left- and right censored data or data that are simultaneously right- and left-censored, the techncal term you are looking for is interval censored. ?Surv in package survival will show you how to define interval censored observations for modelling time-to-event in that case.
In a very real sense most of the observational studies on "free-range human" populations are doubly censored... i.e. we do not observe the individuals over all of their lifespans. Here is a citation to a PhD thesis that seems to lay out the statistical terminology well. Furthermore, several of the packages in R will function properly when set up for interval censoring or left-censoring, including packages survival, NADA, sand (from their DOE website) and several others for which you can search at Baron's website with appropriate search strategies in this link that sets up that page to get both functions and r-help entries.
Edit: Adding comments to address the clarification that this is about truncation rather than censoring.
If one is looking to fit to truncated distributions then look at the gamlss package, or create a suitable density for a doubly-truncated distribution and use fitdistr in the MASS package.

Histogram matching - image processing - c/c++

I have two histograms.
int Hist1[10] = {1,4,3,5,2,5,4,6,3,2};
int Hist1[10] = {1,4,3,15,12,15,4,6,3,2};
Hist1's distribution is of type multi-modal;
Hist2's distribution is of type uni-modal with single prominent peak.
My questions are
Is there any way that i could determine the type of distribution programmatically?
How to quantify whether these two histograms are similar/dissimilar?
Thanks
Raj,
I posted a C function in your other question ( automatically compare two series -Dissimilarity test ) that will compute divergence between two sets of similar data. It's actually intended to tell you how closely real data matches predicted data but I suspect you could use it for your purpose.
Basically, the smaller the error, the more similar the two sets are.
These are just guesses, but I would try fitting each distribution as a gaussian distribution and use something like the R-squared value to determine if the distribution is uni-modal or not.
As to the similarity between the two distributions, I would try doing an autocorrelation and using the peak positive value in the autocorrelation as a similarity measure. These ideas are pretty rough, but hopefully they give you some ideas.
For #2, you could calculate their cross-correlation (so long as the buckets themselves can be sorted). That would give you a rough estimation of what "similarity".
Comparison of Histograms (For Use in Cloud Modeling).
(That's an MS .doc file.)
There are a variety of software packages that will "fit" your distributions to known discrete distributions for you - Minitab, STATA, R, etc. A reference to fitting distributions in R is here. I wouldn't advise programming this from scratch.
Regarding distribution comparisons, if neither distribution fits a known distribution (Poisson, Binomial, etc.), then you need to use non-parametric methods described here.

Resources