R cluster analysis Ward auto deleting outliers - r

How can I code in R to duplicate cluster analyses done in SAS which involved
method=Ward and the TRIM=10 option to automatically delete 10% of the cases as outliers? (This dataset has 45 variables, each variable with some outlier responses.)
When I searched for R cluster analysis using Ward's method, the trim option was described as something that shortens names rather than something that removes outliers.
If I don't trim the datasets before the cluster analysis, one big cluster emerges with lots of single-case "clusters" representing outlying individuals. With the outlying 10% of cases automatically removed, 3 or 4 meaningful clusters emerge. There are too many variables and cases for me to remove the outliers on a case-by-case basis.
Thanks!

You haven't provided any information on how you want to identify outliers. Assuming the simplest case of removing the top and the bottom 5% of cases of every variable (i.e. on a variable by variable basis), you could do this with quantile function.
Illustrating using the example from the link above, you could do something like:
duration = faithful$eruptions
duration[duration <= quantile(duration,0.95) & duration > quantile(duration,0.05)]

Related

Differential expression analysis- basemean threshold

I have an rna seq dataset and I am using Deseq2 to find differentially expressed genes between the two groups. However, I also want to remove genes in low counts by using a base mean threshold. I used pre-filtering to remove any genes that have no counts or only one count across the samples, however, I also want to remove those that have low counts compared to the rest of the genes. Is there a common threshold used for the basemean or a way to work out what this threshold should be?
Thank you
Cross-posted: https://support.bioconductor.org/p/9143776/#9143785
Posting my answer from there which got an "agree" from the DESeq2 author over there:
I would not use the baseMean for any filtering as it is (at least to me) hard to deconvolute. You do not know why the baseMean is low, either because there is no difference between groups and the gene is just lowly-expressed (and/or short), or it is moderately expressed in one but off in the other. The baseMean could be the same in these two scenarios. If you filter I would do it on the counts. So you could say that all or a fraction of samples of at least one group must have 10 or more counts. That will ensure that you remove genes that have many low counts or zeros across the groups rather than nested by group, the latter would be a good DE candidate so it should not be removed. Or you do that automated, e.g. using the edgeR function filterByExpr.

Complex dataframe selecting and sorting by quintile

I have a complex dataframe (orig_df). Of the 25 columns, 5 are descriptions and characteristics that I wish to use as grouping criteria. The remainder are time series. There are tens of thousands of rows.
I noted in initial analysis and numerical summary that there are significant issues with outlier observations within some of the specific grouping criteria. I used "group by" and looking at the quintile results within those groups. I would like to eliminate the low and high (individual observation) outliers relative to the (group-by based quintile) to improve the decision tree and clustering analytics. I also want to keep the outliers to analyze separately for the root cause.
How do I manipulate the dataframe such that the individual observations are compared to the group-based quintile results and the parse is saved (orig_df becomes ideal_df and outlier_df)?
After identifying the outliers using the link Nikos Tavoularis share above, you can use ifelse to create a new variable and identify which records are outliers and the ones that are not. This way you can keep the data there, but you can use this new variable to sort them out whenever you want

Removing data frames from a list that contains a certain value under a variable in R

Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.

nzv filter for continuous features in caret

I am a beginner to practical machine learning using R, specifically caret.
I am currently applying a random forest algorithm for a microbiome dataset. The values are relative abundance transformed so if my features are columns, sum of all columns for Row 1 == 1
It is common to have cells with a lot of 0 values.
Typically I used the default nzv preprocessing feature in caret.
Default:
a. One unique value across the entire dataset
b. few unique values relative to the number of samples in the dataset (<10 %)
c. large ratio of the frequency of the most common value to the frequency of the second most common value (cutoff used is > 19)
So is this function not actually calculating variance, but determining a frequency of occurence of features and filter based on the frequency? If so is it only safe to use it for discrete/categorical variables?
I have a number of features in my dataset ~12k, many of which might be singletons or have a zero value for a lot of features.
My question: Is nzv suitable for such a continuous, zero inflated dataset?
What pre-processing options would you recommend?
When I use default nzv I am dropping a tonne of features (from 12k to ~2,700 k) in the final table
I do want a less noisy dataset but at the same time do not want to loose good features
This is my first question and I am willing to re-revise, edit and resubmit if required.
Any solutions will be appreciated.
Thanks a tonne!

Imputation using related data (R)

I have a panel dataset with population data. I am working mostly with two vectors - population and households. The household vector(there are 3 countries) has a substantial amount of missing values, the population vector is full. I use a model with population as the independent variable to get the missing values of households. What function should I use to extract these values? I do not need to make any forecasts, just to imput the missing data.
Thank you.
EDIT:
This is a printscreen of my dataset:
https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
As you can see, many values of datatype = "original" data are missing and I need to input it somehow. I have created several panel data models (Pooled, within, between) and without further considerations tried to extract the missing data with each of them; however I do not know how to do this.
EDIT 2: What I need is not how to determine which model to use but how to get the missing values(so making the dataset more balanced) of the model.

Resources