How to account for clustering when calculate Spearman coefficient in R? - r

I have a group of people who had their drug concentrations measured by using blood and hair over time (i.e., everyone had three values measured by blood samples and another three values measured by hair samples). I wanted to calculate the Spearman coefficient between the two measurements, but I don't know how to account for the repeated measures within individuals. Is there a way to do that in R?
id<-rep(c(1:100),times=3) ##id variable
df1<-data.frame(id)
df1$var1 <- sample(500:1000, length(df1$id)) ##measurement1
df1$var2 <- sample(500:1000, length(df1$id)) ##measurement2
cor.test(x=df1$var1, y=df1$var2, method = 'spearman') ## this doesn't account for clustering within individuals
Thanks!

Maybe the R-package 'rmcorr' provides the functionality you are looking for. The package helps to compute repeated measures correlation:
install.packages("rmcorr")
rmcorr::rmcorr(participant = id, measure1 = var1, measure2 = var2, dataset = df1)

Related

calculating the between group and within group variances in a dataset

Here is my dataset :
data <- data.frame(group = c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,5,5),
weight = c(11,14,15,67,85,46,37,86,76,48,89,56,45,24,32,12,12,09,09,11))
I would like to calculate the intraclass correlation coefficient (ICC), Within- and between- group variance. I think I got the hang of the ICC, but really unsure how I would go about calculating the within and between group variance. Any help would be really appreciated. Thank you!
#ICC
multilevel.icc(data$weight, cluster = data$group)
[1] 0.3125195

Model predicted values around mean using training data

I tried to ask these questions through imputations, but I want to see if this can be done with predictive modelling instead. I am trying to use information from 2003-2004 NHANES to predict future NHANES cycles. For some context, in 2003-2004 NHANES measured blood contaminants in individual people's blood. In this cycle, they also measured things such as triglycerides, cholesterol etc. that influence the concentration of these blood contaminants.
The first step in my workflow is the impute missing blood contaminant concentrations in 2003-2004 using the measured values of triglycerides, cholesterol etc. This is an easy step and very straightforward. This will be my training dataset.
For future NHANES years (for example 2005-2006), they took individual blood samples combined them (or pooled in other words) and then measured blood contaminants. I need to figure out what the individual concentrations were in these cycles. I have individual measurements for triglycerides, cholesterol etc. and the pooled value is considered the mean. Could I use the mean, 2003-2004 data to unpool or predict the values? For example, if a pool contains 8 individuals, we know the mean, the distribution (2003-2004) and the other parameters (triglycerides) which we can use in the regression to estimate the blood contaminants in those 8 individuals. This would be my test dataset where I have the same contaminants as in the training dataset, with a column for the number of individuals in each pool and the mean value. Alternatively, I can create rows of empty values for contaminants, add mean values separately.
I can easily run MICE, but I need to make sure that the distribution of the imputed data matches 2003-2004 and that the average of the imputed 8 individuals from the pools is equal to the measured pool. So the 8 values for each pool, need to average to the measured pool value while the distribution has to be the same as 2003-2004.
Does that make sense? Happy to provide context if need be. There is an outline code below.
library(mice)
library(tidyverse)
library(VIM)
#Papers detailing these functions can be found in MICE Cran package
df <- read.csv('2003_2004_template.csv', stringsAsFactors = TRUE, na.strings = c("", NA))
#Checking out the NA's that we are working with
non_detect_summary <- as.data.frame(df %>% summarize_all(funs(sum(is.na(.)))))
#helpful representation of ND
aggr_plot <- aggr(df[, 7:42], col=c('navyblue', 'red'),
numbers=TRUE,
sortVars=TRUE,
labels=names(df[, 7:42]),
cex.axis=.7,
gap=3,
ylab=c("Histogram of Missing Data", "Pattern"))
#Mice time, m is the number of imputed datasets (you can think of this as # of cycles)
#You can check out what regression methods below in console
methods(mice)
#Pick Method based on what you think is the best method. Read up.
#Now apply the right method
imputed_data <- mice(df, m = 30)
summary(imputed_data)
#if you want to see imputed values
imputed_data$imp
#finish the dataset
finished_imputed_data <- complete(imputed_data)
#Check for any missing values
sapply(finished_imputed_data, function(x) sum(is.na(x))) #All features should have a value of zero
#Helpful plot is the density plot. The density of the imputed data for each imputed dataset is showed
#in magenta while the density of the observed data is showed in blue.
#Again, under our previous assumptions we expect the distributions to be similar.
densityplot(x = imputed_data, data = ~ LBX028LA+LBX153LA+LBX189LA)
#Print off finished dataset
write_csv(finished_imputed_data, "finished_imputed_data.csv")
#This is where I need to use the finished_imputed_data to impute the values in the future years.

Grouped sign tests on large data set in R

I have the sex ratios (M / M + F) of the offspring of ~35,000 mother birds from >400 species organized in this manner
dat <- data.frame(ID=sample(100:200, n),
Species=rep(LETTERS[1:3],n/3),
SR=sample(0:100,n,replace=TRUE))
ID is the anonymous ID of the mother bird, Species is the species name of the mother bird, and SR is the sex ratio of the mother bird's offspring. In this sample, SR is between 1 and 100 because I do not know how to create sample datasets of ratios.
I want to group the data by species and calculate medians, IQRs, and sign tests. Using my own messy code I can calculate species' medians and IQRs but I am at a loss at how to calculate sign tests on this data. I want to use these sign tests to see if the species' medians differ significantly from 50/50.
Does anyone know code which would allow me to
(1) calculate medians, IQRs, and sign tests on this data
(2) create a summary table with species names, medians, IQRs, sign test p-values and n's.
Thanks in advance - I appreciate any help as I am pretty new to R and really at a loss.

Regressing out or Removing age as confounding factor from experimental result

I have obtained cycle threshold values (CT values) for some genes for diseased and healthy samples. The healthy samples were younger than the diseased. I want to check if the age (exact age values) are impacting the CT values. And if so, I want to obtain an adjusted CT value matrix in which the gene values are not affected by age.
I have checked various sources for confounding variable adjustment, but they all deal with categorical confounding factors (like batch effect). I can't get how to do it for age.
I have done the following:
modcombat = model.matrix(~1, data=data.frame(data_val))
modcancer = model.matrix(~Age, data=data.frame(data_val))
combat_edata = ComBat(dat=t(data_val), batch=Age, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
pValuesComBat = f.pvalue(combat_edata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
data_val is the gene expression/CT values matrix.
Age is the age vector for all the samples.
For some genes the p-value is significant. So how to correctly modify those gene values so as to remove the age effect?
I tried linear regression as well (upon checking some blogs):
lm1 = lm(data_val[1,] ~ Age) #1 indicates first gene. Did this for all genes
cor.test(lm1$residuals, Age)
The blog suggested checking p-val of correlation of residuals and confounding factors. I don't get why to test correlation of residuals with age.
And how to apply a correction to CT values using regression?
Please guide if what I have done is correct.
In case it's incorrect, kindly tell me how to obtain data_val with no age effect.
There are many methods to solve this:-
Basic statistical approach
A very basic method to incorporate the effect of Age parameter in the data and make the final dataset age agnostic is:
Do centring and scaling of your data based on Age. By this I mean group your data by age and then take out the mean of each group and then standardise your data based on these groups using this mean.
For standardising you can use two methods:
1) z-score normalisation : In this you can change each data point to as (x-mean(x))/standard-dev(x)); by using group-mean and group-standard deviation.
2) mean normalization: In this you simply subtract groupmean from every observation.
3) min-max normalisation: This is a modification to z-score normalisation, in this in place of standard deviation you can use min or max of the group, ie (x-mean(x))/min(x)) or (x-mean(x))/max(x)).
On to more complex statistics:
You can get the importance of all the features/columns in your dataset using some algorithms like PCA(principle component analysis) (https://en.wikipedia.org/wiki/Principal_component_analysis), though it is generally used as a dimensionality reduction algorithm, still it can be used to get the variance in the whole data set and also get the importance of features.
Below is a simple example explaining it:
I have plotted the importance using the biplot and graph, using the decathlon dataset from factoextra package:
library("factoextra")
data(decathlon2)
colnames(data)
data<-decathlon2[,1:10] # taking only 10 variables/columns for easyness
res.pca <- prcomp(data, scale = TRUE)
#fviz_eig(res.pca)
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
hep.PC.cor = prcomp(data, scale=TRUE)
biplot(hep.PC.cor)
output
[1] "X100m" "Long.jump" "Shot.put" "High.jump" "X400m" "X110m.hurdle"
[7] "Discus" "Pole.vault" "Javeline" "X1500m"
On these similar lines you can use PCA on your data to get the importance of the age parameter in your data.
I hope this helps, if I find more such methods I will share.

R correlation score calculation

My dataset contains information such as feedbackDate and Subcategory(issue) and location (6 months data). The temporal calculation was by cross tabulating the subcategory of issues with their feebackDates and then calculating Pearson correlation score for every pair of cross tabulated issues. See the code below
#weekly correlation
require(ISOweek)
datacfs_date$FeedbackWeek <- ISOweek(datacfs_date$FeedbackDate)
raw_timecor_matrix <- table(datacfs_date$SubCategory, datacfs_date$FeedbackWeek)
raw_timecor_matrix <- t(raw_timecor_matrix)
timecor_matrix <- cor(raw_timecor_matrix)
#Invert correlation to get distance matrix
inverse_tcc <- 1-timecor_matrix
Now the question is how do I calculate this on biweekly and monthly basis instead of weekly correlation of six months data.
Just make your labels, e.g.
datacfs_date$FeedbackMonth<-paste0(year(datacfs_date$FeedbackDate),"-M",month(datacfs_date$FeedbackDate))
datacfs_date$FeedbackBiWeek<-paste0(year(datacfs_date$FeedbackDate),"-W",(ceiling(week(datacfs_date$FeedbackDate)/2)*2)-1,":",(ceiling(week(datacfs_date$FeedbackDate)/2)*2))
and correlate on those

Resources