Why exact matching with MatchIt R package finds matched pairs that have 2 different levels of categorical variable? - r

I'm actually working on tuna tag-recapture data. I want to balance my sampling between two groups of individuals, the ones that where tagged in the reference area (Treated group) and the ones that where tagged outside this area (Control group). To do this, I used the MatchIt package.
I have 3 covariates: length (by 5 cm bins), month of tagging (January to December) and structure on which the tuna was tagged.
So there is the model: treatment ~ length + month + structure
This last variable, is a categorical variable with 5 levels coded as A to E. The level A is almost only represented in the Treated group (6000 individuals with structure = A, vs on 300 individuals with structure = A in control group).
I first used the nearest neighbour method, but the improvement in balance was not satisfying. So I ran exact and Coarsened Exact Matching methods.
I though that Exact methods should match pairs with the same values for each covariates. But in the output matched data, there are still more than 3000 individuals with structure = A in the treated group.
Do you guys have one explanation ? I red a lot but I didn't find answers.
Thanks

Exact and coarsened exact matching do not perform 1:1 matching. They find all members in the control group that exactly match each member in the treated group. Subclasses are formed based on each combination of the predictor values, and any subclass that has both treated and control units is retained, and others dropped. There is no pairing that takes place. Your results indicate that you have many control units that have identical (or near-identical in the case of CEM) values of the covariates as some treated units.

Related

Pavlidis Template Matching (PTM, DataVisEasy) R function with 3 levels

I need to perform a correlation analysis on a data frame constructed as follows:
Rows: Features --> gene variants related with different levels of severity of the disease we are studying, in a format of a Boolean matrix
Columns: Observations --> List of patients;
The discriminant of my analysis is, thus, the severity marked as follows:
A: less severe than expected
B: equal to what expected
C: more severe than expected
Suppose I have a lot more features than observations and I want to use the PTM function with a three-level annotation (i.e. A,B,C) as a match template. The function requires you to set the annotation.level.set.high parameter, but it's not clear to me how it works. For example, if I set annotation.level.set.high='A', does that mean I'm making a comparison between A vs B&C? So I can only do a comparison between two groups/classes even if I have multiple levels? Because my goal is to compare all levels with each other (i.e. A vs B vs C), but it is not clear to me how to achieve this comparison, if it is possible.
Thanks

ANOVA and post-test for expression data in R

I have a dataset of expression values from 3 groups (proteome expression data as log2).
Now I would like to perform first an ANOVA to check for proteins differentially expressed in any of the 3 groups and next perform a post-test (eg Tukeys) to figure out which proteins are affected in which specific group comparison.
Unfortunately, no resources I found discussed this rather simple case but only much more complicated cases with multiple group combinations (and eg two treatments).
With the HybridMTest package, the ANOVA went fine and I have now the FDR for differentially expressed proteins between the 3 groups (6 samples per group).
But now I'm stuck on how to calculate the posttest for every protein (= rows, n = 3878) between all of the 3 groups.
I could not find an appropriate package or function, maybe one of you could offer a hint on how to solve this? I would like to obtain as a result a df with the protein_id, the comparison group, the FDR and the logFoldChange.
Many thanks!
(and sorry for providing the data only in a very descriptive way. I dont know how to create example expression data but will look into it)
Sebastian
Data description:
expression_df:
rownames(expression_df) = protein_id, colnames(expression_df) = sample_id
pheno_df:
rownames(pheno_df) = sample_id, pheno_df$cluster = cluster group (n=3)
anova_results:
rownames(protein_id), anova_results$FDR = FDR controlled ANOVA result

R: Rank cells in a list of matrices based on cell position

I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.

adehabitat compana() doesn't work or returns lambda=NaN

I'm trying to do the compositional analysis of habitat use with the compana() function in the adehabitatHS package (I use adehabitat because I can't install adehabitatHS).
Compana() needs two matrices: one of habitat use and one of avaiable habitat.
When I try to run the function it doesn't work (it never stops), so I have to abort the RStudio session.
I read that one problem could be the 0-values in some habitat types for some animals in the 'avaiable' matrix, whereas other animals have positive values for the same habitat. As done by other people, I replaced 0-values with small values (0,001), ran compana and it worked BUT the lambda values returned me NaN.
The problem is similar to the one found here
adehabitatHS compana test returns lambda = NaN?
They said they resolved using as 'used' habitat matrix the counts (integers) and not the proportions.
I tried also this approach, but never changed (it freezes when there are 0-values in the available matrix, or returns NaN value for Lambda if I replace 0- values wit small values).
I checked all matrices and they are ok, so I'm getting crazy.
I have 6 animals and 21 habitat types.
Can you resolve this BIG problem?
PARTIALLY SOLVED: Asking to some researchers, they told me that the number of habitats shouldn't be higher than the number of animals.
In fact I merged some habitats in order to have six animals per six habitats and now the function works when I replace 0-values in the 'avaiable' matrix with small values (e.d. 0.001).
Unfortunately this is not what I wanted, because I needed to find values (rankings, Log-ratios, etc..) for each habitat type (originally they were 21).

How to create contingency table with multiple criteria subpopulation from weighted data using svyby in the survey package?

I am working with a large federal dataset with thousands of observations and thousands of variables. Replicate weights are provided. I am using the "survey" package in R to apply these weights:
els.weighted=svrepdesign(data=els, repweights = ~els$F3F1PNLWT,
combined.weights = TRUE).
I am interested in some categorical descriptive characteristics of a subset of the population, such as family living arrangements. I want to get these sorted out into a contingency table that shows frequency. I would like to sort people based on four variables (none of which are binary, but all of which are numeric) This is what I would like to get:
.
The blank boxes are where the cross-tabulation/frequency counts would show. (I only put in 3 columns beneath F1COMP for brevity's sake, but it has 9 outcomes – indexed 1-9)
My current code: svyby(~F1FCOMP, ~F1RTRCC +BYS33C +F1A10 +byurban, els.weighted, svytotal)
This code does sort the data, but it sorts every single combination, by default. I want them pared down to represent only specific subpopulations of each variable. I tried:
svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C==1 +F1A10==2 | F1A10==3 +byurban==3, els.weighted, svytotal)
But got stopped:
Error: unexpected '==' in "svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C=="
Additionally, my current version of the code tells me how many cases occur for each combination, This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down.
This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down
.
You can see in that picture that I only get one number for F1FCOMP per row – the number of cases who fit the specified combination – a specific subpopulation. I want to know more about that subpopulation. That is, F1COMP has nine different outcomes (indexed 1-9), and I want to see how many of each subpopulation fits into each of the 9 outcomes of F1COMP.

Resources