Differential expression analysis- basemean threshold - r

I have an rna seq dataset and I am using Deseq2 to find differentially expressed genes between the two groups. However, I also want to remove genes in low counts by using a base mean threshold. I used pre-filtering to remove any genes that have no counts or only one count across the samples, however, I also want to remove those that have low counts compared to the rest of the genes. Is there a common threshold used for the basemean or a way to work out what this threshold should be?
Thank you

Cross-posted: https://support.bioconductor.org/p/9143776/#9143785
Posting my answer from there which got an "agree" from the DESeq2 author over there:
I would not use the baseMean for any filtering as it is (at least to me) hard to deconvolute. You do not know why the baseMean is low, either because there is no difference between groups and the gene is just lowly-expressed (and/or short), or it is moderately expressed in one but off in the other. The baseMean could be the same in these two scenarios. If you filter I would do it on the counts. So you could say that all or a fraction of samples of at least one group must have 10 or more counts. That will ensure that you remove genes that have many low counts or zeros across the groups rather than nested by group, the latter would be a good DE candidate so it should not be removed. Or you do that automated, e.g. using the edgeR function filterByExpr.

Related

1:1 exact matching of cases and controls in R on multiple variables, NOT propensity score matching

Is there a way to do 1:1 paired matching of cases and controls in R on multiple variables? I've tried MatchIt function, but even specifying variables as "exact" only results in frequency matching (the final dataset will have exactly equal frequencies of those variables individually, but not in combination). I am hoping to match two datasets with exact pairings of sex and race, as well as with age matched +/- 3 years. Ideally the matching algorithm would prioritize matches that maximize the total number of matches between the datasets, and otherwise would match randomly within those parameters. Any cases or controls that don't have an exact matches would be excluded from the final matched dataset.
Thanks so much for any ideas you have.

R: 1:n propensity score match using MatchIt

I have done a 1:5 propensity score matching in R using MatchIt package(ratio=5), but how can I know which one of the "5" matches the "1" best and which the worst? And from the exported outcome, I see a variable called "distance", what does it mean? Can I use it to mearsure the fitness of macthing?
distance is the propensity score (or whatever value is used to create the distance between two units). See my answer here for an explanation. It will be empty if you use Mahalanobis distance matching.
To find who is matched to whom, look in the $match.matrix component of the output object. Each row represents one treated unit, whose rowname or index is given as the rowname of this matrix. For a given row, the values in that row represent the control units that the treated unit was matched to. If one entry is NA, that means no match was given. Often you'll see something like four non-NA values and one NA value; this means that that treated unit was only matched to four control units.
If you used nearest neighbor matching, the columns will be in order of closeness to the treated unit in terms of distance. So, those indices in the first column will be closer to the treated units than the indices in the second column, and so on. If another kind of matching was used, this will not be the case.
There are two aspects to the "fitness" of the matching: covariate balance and remaining (effective) sample size. To assess both, use the cobalt package, and run bal.tab() on your output object. You want small values for the mean differences and large values for the (effective) sample size. If you are concerned with how close individuals are within matched strata, you can manually compute the distances between individuals within matched strata. Just know that being close on the propensity score doesn't mean two units are actually similar to each other.

Complex dataframe selecting and sorting by quintile

I have a complex dataframe (orig_df). Of the 25 columns, 5 are descriptions and characteristics that I wish to use as grouping criteria. The remainder are time series. There are tens of thousands of rows.
I noted in initial analysis and numerical summary that there are significant issues with outlier observations within some of the specific grouping criteria. I used "group by" and looking at the quintile results within those groups. I would like to eliminate the low and high (individual observation) outliers relative to the (group-by based quintile) to improve the decision tree and clustering analytics. I also want to keep the outliers to analyze separately for the root cause.
How do I manipulate the dataframe such that the individual observations are compared to the group-based quintile results and the parse is saved (orig_df becomes ideal_df and outlier_df)?
After identifying the outliers using the link Nikos Tavoularis share above, you can use ifelse to create a new variable and identify which records are outliers and the ones that are not. This way you can keep the data there, but you can use this new variable to sort them out whenever you want

nzv filter for continuous features in caret

I am a beginner to practical machine learning using R, specifically caret.
I am currently applying a random forest algorithm for a microbiome dataset. The values are relative abundance transformed so if my features are columns, sum of all columns for Row 1 == 1
It is common to have cells with a lot of 0 values.
Typically I used the default nzv preprocessing feature in caret.
Default:
a. One unique value across the entire dataset
b. few unique values relative to the number of samples in the dataset (<10 %)
c. large ratio of the frequency of the most common value to the frequency of the second most common value (cutoff used is > 19)
So is this function not actually calculating variance, but determining a frequency of occurence of features and filter based on the frequency? If so is it only safe to use it for discrete/categorical variables?
I have a number of features in my dataset ~12k, many of which might be singletons or have a zero value for a lot of features.
My question: Is nzv suitable for such a continuous, zero inflated dataset?
What pre-processing options would you recommend?
When I use default nzv I am dropping a tonne of features (from 12k to ~2,700 k) in the final table
I do want a less noisy dataset but at the same time do not want to loose good features
This is my first question and I am willing to re-revise, edit and resubmit if required.
Any solutions will be appreciated.
Thanks a tonne!

R cluster analysis Ward auto deleting outliers

How can I code in R to duplicate cluster analyses done in SAS which involved
method=Ward and the TRIM=10 option to automatically delete 10% of the cases as outliers? (This dataset has 45 variables, each variable with some outlier responses.)
When I searched for R cluster analysis using Ward's method, the trim option was described as something that shortens names rather than something that removes outliers.
If I don't trim the datasets before the cluster analysis, one big cluster emerges with lots of single-case "clusters" representing outlying individuals. With the outlying 10% of cases automatically removed, 3 or 4 meaningful clusters emerge. There are too many variables and cases for me to remove the outliers on a case-by-case basis.
Thanks!
You haven't provided any information on how you want to identify outliers. Assuming the simplest case of removing the top and the bottom 5% of cases of every variable (i.e. on a variable by variable basis), you could do this with quantile function.
Illustrating using the example from the link above, you could do something like:
duration = faithful$eruptions
duration[duration <= quantile(duration,0.95) & duration > quantile(duration,0.05)]

Resources