meaning of ICC in rergression - r

I'm stuck on this question and can not find a logical explanation.
I'm given the following regression output -
The question is this - a one-way analysis model of variance was mistakenly adapted to explain the variable “level of violence” using the random factor “grade” The grade factor in this study is a constant factor. The partial results in the output are based on a balanced experiment.
Does it make sense in this case to calculate the ICC? Is it at all possible to calculate it manually from this output data only?
I know that the ICC describes the relationship between the observations within the groups. So I thought maybe to describe the connection within the classes, and between the different classes. But how can the ICC be reached by manual calculation from the data in the output?

Related

Finding how variable affect output of time-series random-forest regression model

I created a Random-Forest Regression model for time-series data in R that have three predictors and one output variable.
Is there a way to find (perhaps in more absolute terms) how changes in a specific variable affect the prediction output?
I know about variable importance, I am not trying to find the variables that have the biggest effect instead I am trying to see if I pick input variable X_1 and increase its value (or decrease it) how that would change the prediction output.
Does it even makes sense to do this? or is it even possible with a random-forest model? Rereading my question a few times it made me dubious, but any insight/recommendation would be greatly appreciated.
I would guess what this question is actually about is called exploratory data analysis (EDA). For starters, I would calculate the correlations between the variables to get a feeling for the strength of the [linear] relationship between two variables. Further, I would look at scatter plots between the variables to get a feeling for the relationships. Depending on the variables [linear] regression could tell how an increase in variable x1 would affect variable x2.

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

R vegan: adjusted p values for permanova (adonis2)

I am running an analysis of variances on a large distance matrix using adonis2 as described here: https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/adonis
That method is frequently used in microbiome analysis to calculate beta diversity. That's also what I would like to do, i.e. to find out whether my community composition differs in response to an environmental variable (continuous)
Permanova returns one p value and there is no "official" post hoc test yet. That's where my question comes in:
I've come across publications saying they adjusted their permanova result using FDR/BH method. I cannot wrap my head around this. I'm confident I understand how FDR correction is calculated, I just don't see how that would be done for PERMANOVA, or, even more, how I would code it.
Can anyone help me out here?
Would be clearer if you provide an example of so-called publication. You are right that for each variable, permanova returns 1 p-value. However, if the model includes many variables, you would have 1 p-value for each variable and you need to correct for FDR.
For example in this publication looking at variation in gut microbiome, they wrote:
To calculate the variation explained by each of our collected host
factors, we performed an Adonis test implemented in QIIME. Each host
factor was calculated according to its explanation rate, and P values
were generated based on 1,000 permutations. All P values were then
adjusted using the Benjamini–Hochberg method.
You can also see an example of this in Table S2, I attached a screenshot here:

How to account for two covariates in differential gene expression of single cell RNA seq data?

I have human data from different ages and gender. After using integration with seurat, how would I best control for these confounding factors during differential gene expression analysis. I see the option of latent.vars in FindMarkers function. Can I give latent.vars = c("Age", "gender") to account for both together? or can I only use one at a time?
Is there alternative package to do the test better?
Thanks in advance!
You can use that argument, but what it means is that you are shifting to a glm based model, and not the default wilcoxon test. you can also see it in the help page (?FindMarkers) :
latent.vars: Variables to test, used only when ‘test.use’ is one of
'LR', 'negbinom', 'poisson', or 'MAST'
You can see how the glm is called in the source code, under GLMDETest. Basically these two covariates are included in the glm to account for their effects on the dependent variable. What is also important is how you treat the covariate age in this case. Would it be categorical or continuous.. this could affect your results.

How do you do an ANOVA with means and standard deviation as values?

I want to check whether there is a difference between three treatment groups with the help of a One-way ANOVA.
The values I have for each treatment group are means with standard deviation. Even though I know the values from which the means are calculated, they are repeated measures of the same sample, and I want to use the means of three independent samples to check the difference between the groups.
My dataset is pretty simple however I can't seem to find a solution to let R (statistical program I use) know that the value I have for each sample within a group resembles a mean with standard deviation. I know that an ANOVA takes the average of all the samples within a group and then compares the means between the groups to see for differences between groups but surely when your values for each sample are already means then this will certainly have an effect, right?
Intuitively I feel that this effects the outcome of my analysis, or is my intuition miles off...?
Thanks in advance!!
screenshot from data in excel
My suggestion is to add the raw data in and do a two-way ANOVA. One predictor is Treatment and the other is Sample. Then you can just use the anova function from R. This tutorial may help you in using the anova function for a two-way ANOVA.

Resources