calculating the between group and within group variances in a dataset - r

Here is my dataset :
data <- data.frame(group = c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,5,5),
weight = c(11,14,15,67,85,46,37,86,76,48,89,56,45,24,32,12,12,09,09,11))
I would like to calculate the intraclass correlation coefficient (ICC), Within- and between- group variance. I think I got the hang of the ICC, but really unsure how I would go about calculating the within and between group variance. Any help would be really appreciated. Thank you!
#ICC
multilevel.icc(data$weight, cluster = data$group)
[1] 0.3125195

Related

How to account for clustering when calculate Spearman coefficient in R?

I have a group of people who had their drug concentrations measured by using blood and hair over time (i.e., everyone had three values measured by blood samples and another three values measured by hair samples). I wanted to calculate the Spearman coefficient between the two measurements, but I don't know how to account for the repeated measures within individuals. Is there a way to do that in R?
id<-rep(c(1:100),times=3) ##id variable
df1<-data.frame(id)
df1$var1 <- sample(500:1000, length(df1$id)) ##measurement1
df1$var2 <- sample(500:1000, length(df1$id)) ##measurement2
cor.test(x=df1$var1, y=df1$var2, method = 'spearman') ## this doesn't account for clustering within individuals
Thanks!
Maybe the R-package 'rmcorr' provides the functionality you are looking for. The package helps to compute repeated measures correlation:
install.packages("rmcorr")
rmcorr::rmcorr(participant = id, measure1 = var1, measure2 = var2, dataset = df1)

Different output for group mean when comparing t-test and mean() function

I guess this is a trivial concern but I can't figure it out... When performing an independent t-test I always get slightly different results for the mean of group 1 when comparing it to the calculated mean with the simple mean(x, ...) function. Is there a simple explanation for this?
Here's my code:
dummy=ifelse(df$age>=median(df$age,na.rm = TRUE),1,0)
t.test(Meaningfulness~dummy, var.equal=TRUE, na.rm=TRUE)
mean(Meaningfulness[df$age<=median(df$age,na.rm = TRUE)], na.rm = TRUE)
The mean for group 1 as in the output of the t-test is: 4.948307
When calculated with the mean function it is: 4.979567
Interestingly the mean for the second group doesn't differ between the t-test function and the mean function...
Also which mean should I report then? I assumed the mean from the t-test output as significance levels trace back to that one. On the other hand, when calculating effect size (Cohen's d) I use the mean from the mean function. So which number do you recommend to report in the tables and text?
Thanks in advance! :)

Regressing out or Removing age as confounding factor from experimental result

I have obtained cycle threshold values (CT values) for some genes for diseased and healthy samples. The healthy samples were younger than the diseased. I want to check if the age (exact age values) are impacting the CT values. And if so, I want to obtain an adjusted CT value matrix in which the gene values are not affected by age.
I have checked various sources for confounding variable adjustment, but they all deal with categorical confounding factors (like batch effect). I can't get how to do it for age.
I have done the following:
modcombat = model.matrix(~1, data=data.frame(data_val))
modcancer = model.matrix(~Age, data=data.frame(data_val))
combat_edata = ComBat(dat=t(data_val), batch=Age, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
pValuesComBat = f.pvalue(combat_edata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
data_val is the gene expression/CT values matrix.
Age is the age vector for all the samples.
For some genes the p-value is significant. So how to correctly modify those gene values so as to remove the age effect?
I tried linear regression as well (upon checking some blogs):
lm1 = lm(data_val[1,] ~ Age) #1 indicates first gene. Did this for all genes
cor.test(lm1$residuals, Age)
The blog suggested checking p-val of correlation of residuals and confounding factors. I don't get why to test correlation of residuals with age.
And how to apply a correction to CT values using regression?
Please guide if what I have done is correct.
In case it's incorrect, kindly tell me how to obtain data_val with no age effect.
There are many methods to solve this:-
Basic statistical approach
A very basic method to incorporate the effect of Age parameter in the data and make the final dataset age agnostic is:
Do centring and scaling of your data based on Age. By this I mean group your data by age and then take out the mean of each group and then standardise your data based on these groups using this mean.
For standardising you can use two methods:
1) z-score normalisation : In this you can change each data point to as (x-mean(x))/standard-dev(x)); by using group-mean and group-standard deviation.
2) mean normalization: In this you simply subtract groupmean from every observation.
3) min-max normalisation: This is a modification to z-score normalisation, in this in place of standard deviation you can use min or max of the group, ie (x-mean(x))/min(x)) or (x-mean(x))/max(x)).
On to more complex statistics:
You can get the importance of all the features/columns in your dataset using some algorithms like PCA(principle component analysis) (https://en.wikipedia.org/wiki/Principal_component_analysis), though it is generally used as a dimensionality reduction algorithm, still it can be used to get the variance in the whole data set and also get the importance of features.
Below is a simple example explaining it:
I have plotted the importance using the biplot and graph, using the decathlon dataset from factoextra package:
library("factoextra")
data(decathlon2)
colnames(data)
data<-decathlon2[,1:10] # taking only 10 variables/columns for easyness
res.pca <- prcomp(data, scale = TRUE)
#fviz_eig(res.pca)
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
hep.PC.cor = prcomp(data, scale=TRUE)
biplot(hep.PC.cor)
output
[1] "X100m" "Long.jump" "Shot.put" "High.jump" "X400m" "X110m.hurdle"
[7] "Discus" "Pole.vault" "Javeline" "X1500m"
On these similar lines you can use PCA on your data to get the importance of the age parameter in your data.
I hope this helps, if I find more such methods I will share.

ANOVA (AOV function) in R: Misleading p_value reported on equal values

I would greatly appreciate any guidance on the following: I am running ANOVA (aov) to retrieve p_value s for a number of subsets of a larger data set. So I kind of bumped into a subset where my numeric variables/values are equally 36. Because it is a part of a loop ANOVA is still executed along with reporting an seemingly infinitely small p_value 1.2855e-134--> Correct me if I am wrong but the smaller the p_value the higher the probability that the difference between the factors is significantly different?
For simplicity this is the subset:
sUBSET_FOR_ANOVA
Here is how I calculate ANOVA and retrieve p_value, where TEMP_DF2 is just the subset you see attached:
#
anova_sweep <- aov(TEMP_DF2$GOOD_PTS~TEMP_DF2$MACH,data = TEMP_DF2)
p_value <- summary(anova_sweep)[[1]][["Pr(>F)"]]
p_value <- p_value[1]
#
Many thanks for any guidance,
I can't replicate your findings. Let's produce an example dataset with all values being 36:
df <- data.frame(gr = rep(letters[1:2], 100),
y = 36)
summary(aov(y~gr, data = df))
Gives:
Df Sum Sq Mean Sq F value Pr(>F)
gr 1 1.260e-27 1.262e-27 1 0.319
Residuals 198 2.499e-25 1.262e-27
Basically, depending on the sample size, we obtain a p-value around 0.3 or so. The F statistic is (by definition) always 1, since the between and within group variances are equal.
Are there results misleading? To some extent, yes. The estimated SS and MS should be 0, aov calculates them as very very small. Some other statistical tests in R and in some packages check for zero variance and would produce an error, but aov apparently does not.
However, more importantly, I would say your data is violating the assumptions of the ANOVA and therefore any result cannot be trusted to base conclusion on. The expectation in R when it comes to statistical tests is usually that it is upon the user to employ the tests in the correct circumstances.

R Trouble getting correlation coefficient

I'm getting difficulties on my quest to get a correlation coefficient for my data set.
I started by using ggpairsand then cor function.
It might sound a lack of knowledge, but I didn’t realize that I can’t calculate the correlation for columns which type is not numeric.
For example, I would like to now the correlation between some AGE and CITY. What alternative do I have to situations like this? Or what data transformations I should do?
Thank you.
As thelatemail put it, sometimes graphs speak more than a stat...
cities <- c("Montreal", "Toronto", "New York", "Plattsburgh")
dat <- data.frame(city = sample(cities,size = 200, replace = TRUE), age = rnorm(n = 200, mean = 40, sd = 20))
dat$city <- as.factor(dat$city)
plot(age ~ city, data = dat)
Then for proper analysis you have several options... anova, or regression with cities as an explanatory variable (factor)... Although your question might have better responses on Cross Validated!
Btw: pls just ignore negative ages, this has been done quickly.
I think you first need to answer the question of what it is you are trying to do. The correlation coefficient (Pearson's r) is a specific statistic that can be calculated on two numerical values (where a dichotomous variable can be considered numeric). It has some special characteristics, including that it is bounded by -1 and 1 and that it does not have a concept of dependent or independent variable. Also it does not represent the proportion of variance explained; you need to square it to get the usual measure of that. What it does do is give you an estimate of the size and direction of the association between two variables.
These characteristics make it inappropriate to use r when you have a variable such as city as one of the two variables. If you want to know the proportion of variance in age explained by city, you can run a regression of age on a set of dummy variables for city and look at the overall R squared for the model. However unlike r, you won't have a simple direction (just direction for each city) and it won't necessarily be the same as if you built a model predicting city based on age.
Regarding the qualitative data such as City, you can use the Spearman's correlation.
You can find more information about this correlation here
It can be simply used in R with the help of this command :
cor(x, use=, method= )
So , if you want to use it in a simple example :
cor(AGE, CITY, method = "Spearman")
I hope that helps you

Resources