I have fitted mixture distributions to multi-modal biological measurement data in order to group individuals accordingly (picture a multi-modal histogram of length measurements; assuming each mode represents a different age cohort I can infer numbers at age from the easily measured length data).
The mixture distribution provides posterior probabilities for each individual's membership to each mode, and so once binned by length class one line of data might look like:
l.class freq age1 age2 age3 age5
9 41 0.2 0.25 0.3 0.25
Where l.class is the length bin, freq is the number of individuals, and age1, age2, age3 and age5 are the probabilities of association with a given mixture mode / age group. As these are probabilities as opposed to proportions I wanted to iterate over each entry a number of times in order to get an estimate of numbers at age for each length bin.
I have tried using sample() to achieve this in R, but cannot get my head around the classification to one of a number of potential groups according to probability.
x <- sample(names(data1)[3:ncol(data1)], data1$freq, replace=T, prob=c(data1[i,3:ncol(data1)]))
Here is the approach I ended up using. I wanted to run the sampling in a loop in order to sample by probabilities a number of times (i.e. 1000), so I did this and then took the mean number of samples for each age class as my estimate.
Related
I tried to ask these questions through imputations, but I want to see if this can be done with predictive modelling instead. I am trying to use information from 2003-2004 NHANES to predict future NHANES cycles. For some context, in 2003-2004 NHANES measured blood contaminants in individual people's blood. In this cycle, they also measured things such as triglycerides, cholesterol etc. that influence the concentration of these blood contaminants.
The first step in my workflow is the impute missing blood contaminant concentrations in 2003-2004 using the measured values of triglycerides, cholesterol etc. This is an easy step and very straightforward. This will be my training dataset.
For future NHANES years (for example 2005-2006), they took individual blood samples combined them (or pooled in other words) and then measured blood contaminants. I need to figure out what the individual concentrations were in these cycles. I have individual measurements for triglycerides, cholesterol etc. and the pooled value is considered the mean. Could I use the mean, 2003-2004 data to unpool or predict the values? For example, if a pool contains 8 individuals, we know the mean, the distribution (2003-2004) and the other parameters (triglycerides) which we can use in the regression to estimate the blood contaminants in those 8 individuals. This would be my test dataset where I have the same contaminants as in the training dataset, with a column for the number of individuals in each pool and the mean value. Alternatively, I can create rows of empty values for contaminants, add mean values separately.
I can easily run MICE, but I need to make sure that the distribution of the imputed data matches 2003-2004 and that the average of the imputed 8 individuals from the pools is equal to the measured pool. So the 8 values for each pool, need to average to the measured pool value while the distribution has to be the same as 2003-2004.
Does that make sense? Happy to provide context if need be. There is an outline code below.
library(mice)
library(tidyverse)
library(VIM)
#Papers detailing these functions can be found in MICE Cran package
df <- read.csv('2003_2004_template.csv', stringsAsFactors = TRUE, na.strings = c("", NA))
#Checking out the NA's that we are working with
non_detect_summary <- as.data.frame(df %>% summarize_all(funs(sum(is.na(.)))))
#helpful representation of ND
aggr_plot <- aggr(df[, 7:42], col=c('navyblue', 'red'),
numbers=TRUE,
sortVars=TRUE,
labels=names(df[, 7:42]),
cex.axis=.7,
gap=3,
ylab=c("Histogram of Missing Data", "Pattern"))
#Mice time, m is the number of imputed datasets (you can think of this as # of cycles)
#You can check out what regression methods below in console
methods(mice)
#Pick Method based on what you think is the best method. Read up.
#Now apply the right method
imputed_data <- mice(df, m = 30)
summary(imputed_data)
#if you want to see imputed values
imputed_data$imp
#finish the dataset
finished_imputed_data <- complete(imputed_data)
#Check for any missing values
sapply(finished_imputed_data, function(x) sum(is.na(x))) #All features should have a value of zero
#Helpful plot is the density plot. The density of the imputed data for each imputed dataset is showed
#in magenta while the density of the observed data is showed in blue.
#Again, under our previous assumptions we expect the distributions to be similar.
densityplot(x = imputed_data, data = ~ LBX028LA+LBX153LA+LBX189LA)
#Print off finished dataset
write_csv(finished_imputed_data, "finished_imputed_data.csv")
#This is where I need to use the finished_imputed_data to impute the values in the future years.
I have a group of people who had their drug concentrations measured by using blood and hair over time (i.e., everyone had three values measured by blood samples and another three values measured by hair samples). I wanted to calculate the Spearman coefficient between the two measurements, but I don't know how to account for the repeated measures within individuals. Is there a way to do that in R?
id<-rep(c(1:100),times=3) ##id variable
df1<-data.frame(id)
df1$var1 <- sample(500:1000, length(df1$id)) ##measurement1
df1$var2 <- sample(500:1000, length(df1$id)) ##measurement2
cor.test(x=df1$var1, y=df1$var2, method = 'spearman') ## this doesn't account for clustering within individuals
Thanks!
Maybe the R-package 'rmcorr' provides the functionality you are looking for. The package helps to compute repeated measures correlation:
install.packages("rmcorr")
rmcorr::rmcorr(participant = id, measure1 = var1, measure2 = var2, dataset = df1)
I did some regressions/paired t tests on my sample I now want to test for my final hypothesis where I want to see if a single group of 50 observations is significantly under 0(not mean) so x<0 and if it is significant by how much as in an average of all the negative values which kind of te
I understand I can use lmer but I would like to undertake a repeated measures anova in order to carry out both a within group and a between group analysis.
So I am trying to compare the difference in metabolite levels between three groups ( control, disease 1 and disease 2) over time ( measurements collected at two timepoints), and to also make a within group comparison, comparing time point 1 with time point 2.
Important to note - these are subjects sending in samples not timed trial visits where samples would have been taken on the same day or thereabouts. For instance time point 1 for one subject could be 1995, time point 1 for another subject 1996, the difference between timepoint 1 and timepoint 2 is also not consistent. There is an average of around 5 years, however max is 15, min is .5 years.
I have 43, 45, and 42 subjects respectively in each group. My response variable would be say metabolite 1, the predictor would be Group. I also have covariates I would like to be accounted for such as age, BMI, and gender. I would also need to account for family ID (which I have as a random effect in my lmer model). My column with Time has a 0 to mark the time point 1 and 1 is timepoint 2). I understand I must segregate the within and between subjects command, however, I am unsure how to do this. From my understanding so far;
If I am using the anova_test, my formula that needs to be specified for between subjects would be;
Metabolite1 ~ Group*Time
Whilst for within subjects ( seeing whether there is any difference within each group at TP1 vs TP2), I am unsure how I would specify this ( the below is not correct).
Metabolite1 ~ Time + Error(ID/Time)
The question is, how do I combine this altogether to specify the between and within subject comparisons I would like and accounting for the covariates such as gender, age and BMI? I am assuming if I specify covariates it will become an ANCOVA not an ANOVA?
Some example code that I found that had both a between and within subject comparison design (termed mixed anova).
aov1 <- aov(Recall~(Task*Valence*Gender*Dosage)+Error(Subject/(Task*Valence))+(Gender*Dosage),ex5)
Where he specifies that the within subject comparison is within the Error term. Also explained here https://rpkgs.datanovia.com/rstatix/reference/anova_test.html
However, mine, which I realise is very wrong currently ( is missing a correct within subject comparison).
repmes<-anova_test(data=mets, Metabolite1~ Group*Time + Error(ID/Time), covariate=c("Age", "BMI",
"Gender", "FamilyID")
I ultimately would like to determine from this with appropriate post hoc tests ( if p < 0.05) whether there are any significant differences in Metabolite 1 expression between groups between the two time points (i.e over time), and whether there are any significant differences between subjects comparing TP1 with TP2. Please can anybody help.
This is not really a coding question but more of a statistical question.
I'm doing a proportions test on multiple proportions for many subjects.
For example, subject 1 will have multiple proportions (multiple "successes per total trials"), and subject 2 will have multiple proportions. And for each subject we're testing if all these proportions are the same. For each subject, there are multiple proportions where there is number of successes per total trials. The proportions could range from being 30 successes out of 60 to like 300 successes out of a 1000 (just to show the range of trials and successes for each subject). Furthermore, for each subject, there could be varying number of proportions. Subject 1 could have 50 proportions, whereas subject 2 could only have 2. The idea is that we're trying to test that for each subject that all of their proportions are the same, and then reject if they are different.
However, I'm realizing that subjects that have many more proportions, will have more significant p-values than subjects that only have 2 proportions, when using the prop.test. I was wondering if there is a way to approach this problem in a different way. Any sort of correction I could do, or take into account the number of positions.
Any suggestions would be helpfil.
One way you can approach your example of comparing proportions for a single subject is by performing null hypothesis testing is by using the Z-statistic to compare one proportion with the other proportions. The Z-statistic inherently normalizes data for different sample sizes. As an example, assuming that one subject has 50 proportions, you would have 50 tests, and in the method below 5% error is allowed for each subject. You can set this up with the following:
Research Question:
For a single subject with 50 proportions, is the first proportion the same as the other proportions?
Hypothesis
Null hypothesis: u_1 = u_2 = ... = u_50
Alternative hypothesis: u_i != 1/49 sum (u_j) where j != i
Calculate the Statistic
Use the Z-test to compare the mean to the average mean of the other 49 proportions (for all 50 samples)
N is your number of trials
Compute the appropriate test statistic and rejection criteria
5% error is allowed for each subject
p-value, 5% / 50
You would repeat this method for each proportion for this subject (i.e. perform null hypothesis testing 50 times for this subject).