I have done a 1:5 propensity score matching in R using MatchIt package(ratio=5), but how can I know which one of the "5" matches the "1" best and which the worst? And from the exported outcome, I see a variable called "distance", what does it mean? Can I use it to mearsure the fitness of macthing?
distance is the propensity score (or whatever value is used to create the distance between two units). See my answer here for an explanation. It will be empty if you use Mahalanobis distance matching.
To find who is matched to whom, look in the $match.matrix component of the output object. Each row represents one treated unit, whose rowname or index is given as the rowname of this matrix. For a given row, the values in that row represent the control units that the treated unit was matched to. If one entry is NA, that means no match was given. Often you'll see something like four non-NA values and one NA value; this means that that treated unit was only matched to four control units.
If you used nearest neighbor matching, the columns will be in order of closeness to the treated unit in terms of distance. So, those indices in the first column will be closer to the treated units than the indices in the second column, and so on. If another kind of matching was used, this will not be the case.
There are two aspects to the "fitness" of the matching: covariate balance and remaining (effective) sample size. To assess both, use the cobalt package, and run bal.tab() on your output object. You want small values for the mean differences and large values for the (effective) sample size. If you are concerned with how close individuals are within matched strata, you can manually compute the distances between individuals within matched strata. Just know that being close on the propensity score doesn't mean two units are actually similar to each other.
Related
Is there a way to do 1:1 paired matching of cases and controls in R on multiple variables? I've tried MatchIt function, but even specifying variables as "exact" only results in frequency matching (the final dataset will have exactly equal frequencies of those variables individually, but not in combination). I am hoping to match two datasets with exact pairings of sex and race, as well as with age matched +/- 3 years. Ideally the matching algorithm would prioritize matches that maximize the total number of matches between the datasets, and otherwise would match randomly within those parameters. Any cases or controls that don't have an exact matches would be excluded from the final matched dataset.
Thanks so much for any ideas you have.
I have an rna seq dataset and I am using Deseq2 to find differentially expressed genes between the two groups. However, I also want to remove genes in low counts by using a base mean threshold. I used pre-filtering to remove any genes that have no counts or only one count across the samples, however, I also want to remove those that have low counts compared to the rest of the genes. Is there a common threshold used for the basemean or a way to work out what this threshold should be?
Thank you
Cross-posted: https://support.bioconductor.org/p/9143776/#9143785
Posting my answer from there which got an "agree" from the DESeq2 author over there:
I would not use the baseMean for any filtering as it is (at least to me) hard to deconvolute. You do not know why the baseMean is low, either because there is no difference between groups and the gene is just lowly-expressed (and/or short), or it is moderately expressed in one but off in the other. The baseMean could be the same in these two scenarios. If you filter I would do it on the counts. So you could say that all or a fraction of samples of at least one group must have 10 or more counts. That will ensure that you remove genes that have many low counts or zeros across the groups rather than nested by group, the latter would be a good DE candidate so it should not be removed. Or you do that automated, e.g. using the edgeR function filterByExpr.
I would like to do case-control matching 1(1:N) in R.
For example, sex should be matched exactly.
On the other hand, age is matched with range +-5.
(e.g. if case's age=45, then I want to consider that the range of controls' age is 40 ~ 50.)
AS I know, MatchIt or matching package is for propensity score matching not for case-control.
Moreover, e1071 package does not support the function of range matching.
Please let me know how to do this.
Many thanks advance.
P.S.
The example data can be used for matching with age and sex as below.
library(survival)
data(pbc)
data <- na.omit(pbc)
case:1, control:0 in "status" variable
(As this data is originally for competing risk analysis, you can not consider "2" in "status" variable.)
This is called matching with a caliper. The caliper is on age and its value in this case is 5. MatchIt allows calipers but only for the distance measure (i.e., the propensity score). Two other packages for matching, Matching and optmatch both allow highly customizable matching, including the requirements of exact matching (i.e., what you want for sex) and caliper matching (i.e., what you want for age). Matching allows for nearest neighbor matching and genetic matching, while optmatch allows for optimal matching.
Suppose I have a dataset like this:
that I need to examine for possible duplicates. Here, the 2nd and 3rd rows are suspected duplicates. I'm aware of string distance methods as well as approximate matches for numeric variables. But have the two approaches been combined? Ultimately, I'm looking for an approach that I can implement in R.
I don't think there is a straightforward approach to this problem. You could treat each column separatly: datetime as timestamp proximity, string as string proximity (Levenshtein distance) and freq as numeric distance. You can then individually rank each row for each column in increasing fashion. Row numbers that rank high in all three of the metrics (least differences) are the better candidates to be duplicates. You can then choose the threshold for which you consider a dulicated case.
I have log2ratio values of each chromosome position (137221 coordinates) for different samples (15 samples). I want to calculate the Zscore of log2ratio for each chromosome position (row). Also i want to exclude first three columns because it contains ID. There are also some NAs in between the variables..
Thanking you in advance
It's not completely clear what you want. If you want a Z-score for the entire row (i.e., its mean divided by standard error) for all but the first three rows then
f <- function(x) {
mean(x,na.rm=TRUE)/(sd(x,na.rm=TRUE)*sqrt(length(na.omit(x))))
}
apply(as.matrix(df[-(1:3),]),1,f)
will do it. That gives you a vector equal to the number of columns (minus 3).
If you want entire columns of normalized data (Z-scores) then I think
t(scale(t(as.matrix(df[-(1:3),]))))
should work. If neither of those work, you need to post a reproducible example -- or at least tell us precisely what the error messages are.