Approximate de-duplication - r

Suppose I have a dataset like this:
that I need to examine for possible duplicates. Here, the 2nd and 3rd rows are suspected duplicates. I'm aware of string distance methods as well as approximate matches for numeric variables. But have the two approaches been combined? Ultimately, I'm looking for an approach that I can implement in R.

I don't think there is a straightforward approach to this problem. You could treat each column separatly: datetime as timestamp proximity, string as string proximity (Levenshtein distance) and freq as numeric distance. You can then individually rank each row for each column in increasing fashion. Row numbers that rank high in all three of the metrics (least differences) are the better candidates to be duplicates. You can then choose the threshold for which you consider a dulicated case.

Related

1:1 exact matching of cases and controls in R on multiple variables, NOT propensity score matching

Is there a way to do 1:1 paired matching of cases and controls in R on multiple variables? I've tried MatchIt function, but even specifying variables as "exact" only results in frequency matching (the final dataset will have exactly equal frequencies of those variables individually, but not in combination). I am hoping to match two datasets with exact pairings of sex and race, as well as with age matched +/- 3 years. Ideally the matching algorithm would prioritize matches that maximize the total number of matches between the datasets, and otherwise would match randomly within those parameters. Any cases or controls that don't have an exact matches would be excluded from the final matched dataset.
Thanks so much for any ideas you have.

Differential expression analysis- basemean threshold

I have an rna seq dataset and I am using Deseq2 to find differentially expressed genes between the two groups. However, I also want to remove genes in low counts by using a base mean threshold. I used pre-filtering to remove any genes that have no counts or only one count across the samples, however, I also want to remove those that have low counts compared to the rest of the genes. Is there a common threshold used for the basemean or a way to work out what this threshold should be?
Thank you
Cross-posted: https://support.bioconductor.org/p/9143776/#9143785
Posting my answer from there which got an "agree" from the DESeq2 author over there:
I would not use the baseMean for any filtering as it is (at least to me) hard to deconvolute. You do not know why the baseMean is low, either because there is no difference between groups and the gene is just lowly-expressed (and/or short), or it is moderately expressed in one but off in the other. The baseMean could be the same in these two scenarios. If you filter I would do it on the counts. So you could say that all or a fraction of samples of at least one group must have 10 or more counts. That will ensure that you remove genes that have many low counts or zeros across the groups rather than nested by group, the latter would be a good DE candidate so it should not be removed. Or you do that automated, e.g. using the edgeR function filterByExpr.

R: 1:n propensity score match using MatchIt

I have done a 1:5 propensity score matching in R using MatchIt package(ratio=5), but how can I know which one of the "5" matches the "1" best and which the worst? And from the exported outcome, I see a variable called "distance", what does it mean? Can I use it to mearsure the fitness of macthing?
distance is the propensity score (or whatever value is used to create the distance between two units). See my answer here for an explanation. It will be empty if you use Mahalanobis distance matching.
To find who is matched to whom, look in the $match.matrix component of the output object. Each row represents one treated unit, whose rowname or index is given as the rowname of this matrix. For a given row, the values in that row represent the control units that the treated unit was matched to. If one entry is NA, that means no match was given. Often you'll see something like four non-NA values and one NA value; this means that that treated unit was only matched to four control units.
If you used nearest neighbor matching, the columns will be in order of closeness to the treated unit in terms of distance. So, those indices in the first column will be closer to the treated units than the indices in the second column, and so on. If another kind of matching was used, this will not be the case.
There are two aspects to the "fitness" of the matching: covariate balance and remaining (effective) sample size. To assess both, use the cobalt package, and run bal.tab() on your output object. You want small values for the mean differences and large values for the (effective) sample size. If you are concerned with how close individuals are within matched strata, you can manually compute the distances between individuals within matched strata. Just know that being close on the propensity score doesn't mean two units are actually similar to each other.

Cluster analysis on two columns that contain name of person in R

I am a beginner in R. I have to do cluster analysis in data that contains two columns with name of persons. I converted it in data frame but it is character type. To use dist() function the data frame must be numeric. example of my data:
Interviewed.Type interviewed.Relation.Type
1. An1 Xuan
2. An2 The
3. An3 Ngoc
4. Bui Thi
5. ANT feed
7. Bach Thi
8. Gian1 Thi
9. Lan5 Thi
.
.
.
1100. Xung Van
I will be grateful for your help.
You can convert a character vector to a factor using factor. A factor is basically a vector of numbers together with an attribute giving the text associated with each number, which are called levels in R. One can use as.numeric or unclass to get at the raw numbers. These can then be fed into algorithms which require numbers, like e.g. dist.
Note that the order in which numbers are associated with texts is pretty much arbitrary (in fact alphabetical), so the difference between numbers has no meaning in most applications. Therefore calling dist on this result is technically possible, but not neccessarily meaningful. For this reason, the author of this answer is not satisfied with it, even if the original poster seems to be happy about it. :-)
Also note that if there are different vectors, converting each separately will mean that the same number will represent different textual values and vice versa, unless both vectors are compromised from exactly the same set of distinct values. Additional care has to be taken if you want the same levels for both factors. One way would be to concatenate both vecotrs, turn that into a factor, and then split the result into two factor vectors.

R - Assign observation into the classes (sturges rule)

I have a list of 70 observations (amounts) that I would like to assign to classes (intervals) and perform some basic calculations (relative frequency, cumulative frequency, etc).
First question is, if there is a function for Sturges rule (i.e returns the number and length of the classes)?
Second question is, if there is a function in R that is similar to Excel's function frequency (based on classes borders counts the observations per class)?
Thanks!
The Sturges rule is the default split used by the hist function and the function that does it is:
?nclass.Sturges
There are various grouping functions in R. I suspect one of cut, table or xtabs may be what you want. (I didn't understand what was meant by "based on classes borders counts the observations per class".) cut gives a vector of the same length, whereas the other two tally the counts, returning a contingency table.

Resources