Grouped sign tests on large data set in R

Grouped sign tests on large data set in R - r

I have the sex ratios (M / M + F) of the offspring of ~35,000 mother birds from >400 species organized in this manner
dat <- data.frame(ID=sample(100:200, n),
Species=rep(LETTERS[1:3],n/3),
SR=sample(0:100,n,replace=TRUE))
ID is the anonymous ID of the mother bird, Species is the species name of the mother bird, and SR is the sex ratio of the mother bird's offspring. In this sample, SR is between 1 and 100 because I do not know how to create sample datasets of ratios.
I want to group the data by species and calculate medians, IQRs, and sign tests. Using my own messy code I can calculate species' medians and IQRs but I am at a loss at how to calculate sign tests on this data. I want to use these sign tests to see if the species' medians differ significantly from 50/50.
Does anyone know code which would allow me to
(1) calculate medians, IQRs, and sign tests on this data
(2) create a summary table with species names, medians, IQRs, sign test p-values and n's.
Thanks in advance - I appreciate any help as I am pretty new to R and really at a loss.

Related

How to account for clustering when calculate Spearman coefficient in R?

I have a group of people who had their drug concentrations measured by using blood and hair over time (i.e., everyone had three values measured by blood samples and another three values measured by hair samples). I wanted to calculate the Spearman coefficient between the two measurements, but I don't know how to account for the repeated measures within individuals. Is there a way to do that in R?
id<-rep(c(1:100),times=3) ##id variable
df1<-data.frame(id)
df1$var1 <- sample(500:1000, length(df1$id)) ##measurement1
df1$var2 <- sample(500:1000, length(df1$id)) ##measurement2
cor.test(x=df1$var1, y=df1$var2, method = 'spearman') ## this doesn't account for clustering within individuals
Thanks!

Maybe the R-package 'rmcorr' provides the functionality you are looking for. The package helps to compute repeated measures correlation:
install.packages("rmcorr")
rmcorr::rmcorr(participant = id, measure1 = var1, measure2 = var2, dataset = df1)

GLMM: Needing overall advice on selecting model terms for glmm modelling in R

I would like to create a model to understand how habitat type affects the abundance of bats found, however I am struggling to understand which terms I should include. I wish to use lme4 to carry out a glmm model, I have chosen glmm as the distribution is poisson - you can't have half a bat, and also distribution is left skewed - lots of single bats.
My dataset is very big and is comprised of abundance counts recorded by an individual on a bat survey (bat survey number is not included as it's public data). My data set includes abundance, year, month, day, environmental variables (temp, humidity, etc.), recorded_habitat, surrounding_habitat, latitude and longitude, and is structured like the set shown below. P.S Occurrence is an anonymous recording made by an observer at a set location, at a location a number of bats will be recorded - it's not relevant as it's from a greater dataset.
occurrence
abundance
latitude
longitude
year
month
day
(environmental variables
3456
45
53.56
3.45
2000
5
3
34.6
surrounding_hab
recorded_hab
A
B
Recorded habitat and surrounding habitat range in letters (A-I) corresponding to a habitat type. Also, the table is split as it wouldn't fit in the box.
These models shown below are the models I think are a good choice.
rhab1 <- glmer(individual_count ~ recorded_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab1)
rhab2 <- glmer(individual_count ~ surrounding_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab2)
I'll now explain my questions in regards to the models I have chosen, with my current thinking/justification.
Firstly, I am confused about the mix of categorical and numeric variables, is it wise to include the environmental variables as they are numeric? My current thinking is scaling the environmental variables allowed the model to converge so including them is okay?
Secondly, I am confused about the mix of spatial and temporal variables, primarily if I should include temporal variables as the predictor is a temporal variable. I'd like to include year as a random variable as bat populations from one year directly affect bat populations the next year, and also latitude and longitude, does this seem wise?
I am also unsure if latitude and longitude should be random? The confusion arises because latitude and longitude do have some effect on the land use.
Additionally, is it wise to include recorded_habitat and surrounding_habitat in the same model? When I have tried this is produces a massive output with a huge correlation matrix, so I'm thinking I should run two models (year ~ recorded_hab) and (year ~ surrounding_hab) then discuss them separately - hence the two models.
Sorry this question is so broad! Any help or thinking is appreciated - including data restructuring or model term choice. I'm also new to stack overflow so please do advise on question lay out/rules etc if there are glaringly obvious mistakes.

Regressing out or Removing age as confounding factor from experimental result

I have obtained cycle threshold values (CT values) for some genes for diseased and healthy samples. The healthy samples were younger than the diseased. I want to check if the age (exact age values) are impacting the CT values. And if so, I want to obtain an adjusted CT value matrix in which the gene values are not affected by age.
I have checked various sources for confounding variable adjustment, but they all deal with categorical confounding factors (like batch effect). I can't get how to do it for age.
I have done the following:
modcombat = model.matrix(~1, data=data.frame(data_val))
modcancer = model.matrix(~Age, data=data.frame(data_val))
combat_edata = ComBat(dat=t(data_val), batch=Age, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
pValuesComBat = f.pvalue(combat_edata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
data_val is the gene expression/CT values matrix.
Age is the age vector for all the samples.
For some genes the p-value is significant. So how to correctly modify those gene values so as to remove the age effect?
I tried linear regression as well (upon checking some blogs):
lm1 = lm(data_val[1,] ~ Age) #1 indicates first gene. Did this for all genes
cor.test(lm1$residuals, Age)
The blog suggested checking p-val of correlation of residuals and confounding factors. I don't get why to test correlation of residuals with age.
And how to apply a correction to CT values using regression?
Please guide if what I have done is correct.
In case it's incorrect, kindly tell me how to obtain data_val with no age effect.

There are many methods to solve this:-
Basic statistical approach
A very basic method to incorporate the effect of Age parameter in the data and make the final dataset age agnostic is:
Do centring and scaling of your data based on Age. By this I mean group your data by age and then take out the mean of each group and then standardise your data based on these groups using this mean.
For standardising you can use two methods:
1) z-score normalisation : In this you can change each data point to as (x-mean(x))/standard-dev(x)); by using group-mean and group-standard deviation.
2) mean normalization: In this you simply subtract groupmean from every observation.
3) min-max normalisation: This is a modification to z-score normalisation, in this in place of standard deviation you can use min or max of the group, ie (x-mean(x))/min(x)) or (x-mean(x))/max(x)).
On to more complex statistics:
You can get the importance of all the features/columns in your dataset using some algorithms like PCA(principle component analysis) (https://en.wikipedia.org/wiki/Principal_component_analysis), though it is generally used as a dimensionality reduction algorithm, still it can be used to get the variance in the whole data set and also get the importance of features.
Below is a simple example explaining it:
I have plotted the importance using the biplot and graph, using the decathlon dataset from factoextra package:
library("factoextra")
data(decathlon2)
colnames(data)
data<-decathlon2[,1:10] # taking only 10 variables/columns for easyness
res.pca <- prcomp(data, scale = TRUE)
#fviz_eig(res.pca)
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
hep.PC.cor = prcomp(data, scale=TRUE)
biplot(hep.PC.cor)
output
[1] "X100m" "Long.jump" "Shot.put" "High.jump" "X400m" "X110m.hurdle"
[7] "Discus" "Pole.vault" "Javeline" "X1500m"
On these similar lines you can use PCA on your data to get the importance of the age parameter in your data.
I hope this helps, if I find more such methods I will share.

Survey weights and boostrap wieghts to get counts and CI's

I have a file containing survey data. For example, the file looks like this:
IDNUMBER AGE SEX NumPrescr OnPrescr SURV_WGT BSW1 BSW2....BSW500
123456 22 1 6 1 ... ... ... ...
Here, OnPrescrp is a binary variable indicating whether or not the subjects is on prescription meds and BSW1 - BSW500 are the bootstrap weights and SURV_WGT is the survery weight per subject. There are roughly 20000 entries.
I am tasked with creating tables of various statistics within certain age-gender group breakdowns. For example, how many males from 17 to 24 are on prescription medications. And I need a count N and 95% CI for each of these types of calculations. I'm not familiar at all with survey methods.
From what I understand, I can't just simply add the number of people in each category to get the final count N for each question/category (i.e., cannot just add all the males 17 to 24 who are using prescription meds). Instead, I have to take into account the survery weights and bootstrap weights when constructing my final count N and confidence intervals.
I was then told in STATA this is a one line command:
svyset [pw=SURV_WGT], brr(bsw1-bsw500)
I am working in R however. What is the equivalent command in R and what exactly is the above command doing?
PS: My sample of roughly 20000 indiviudals is a sample of a population of roughly 35 million.

You will want to use the survey package in R. This will be your best friend for weighted/complex survey analysis in R.
install.packages("survey")
The survey package has two main steps to your analysis. The first is creating the svydesign object, which stores information about your survey design including weights, replicate weights, data, etc. Then use any number of analysis functions to run analysis/descriptives on those design objects (e.g., svymean, svyby - for subgroup analysis, svyglm, and many more).
Based on your question, you have survey weights and replicate weights (bootstrapped). While the more common svydesign function is used for surveys with a single set of weights, you want to use svrepdesign, which will allow you to specify survey weights and replicate weights. Check out the documentation, but here is what you can do:
mydesign <- svrepdesign(data = mydata,
weights = ~SURV_WGT,
repweights = "BSW[0-9]+",
type = "bootstrap",
combined.weights = TRUE)
You should read the documentation, but briefly: data will be your data frame, weights takes your single survey weight vector, usually as a formula, repweights is great in that it accepts a regex string that identifies all the replicate weight columns in your data by column name, type tells the design what your replicate weights are (how they were derived), combined.weights is logical for whether the replicate weights contain sampling weights - I assume this is true but it may not be.
From this design object, you can then run analysis. E.g., let's calculate the average number of prescriptions by sex:
myresult <- svyby(~NumPrescr, # variable to pass to function
by = ~SEX, # grouping
design = mydesign, # design object
vartype = "ci", # report variation as confidence interval
FUN = svymean # specify function from survey package, mean here
)
Hope this helps!
EDIT: if you want to look at something by age groups, as you suggest, you need to create a character or factor variable that is coded for each age group and use that new variable in your svyby call.

How do I code a Mixed effects model for abalone growth in Aquaculture nutrition with nested individuals

I am a biologist working in aquaculture nutrition research and until recently I haven't paid much attention to the power of statistics. The usual method of analysis had been to run ANOVA on final weights of animals given various treatments and boom, you have a result. I have tried to improve my results by designing an experiment that could track individuals growth over time but I am having a really hard time trying to understand which model to use for the data I have.
For simplified explanation of my experiment: I have 900 abalone/snails which were sourced from a single cohort (spawned/born at the same time). I have individually marked each abalone (id) and recorded a length and weight at Time 0. The animals were then randomly assigned 1 of 6 treatment diets (n=30 abalone per treatment) each replicated n=5 times (n=150 abalone / replicate). Each replicate looks like a randomized block design where each treatment is only replicate once within each block and each is assigned to independent tank with n=30 abalone/tank (n treatment). Abalone were fed a known amount of feed for 90 days before being weighed and measured again (Time 1). They are back in their homes for another 90 days before the concluding the experiment.
From my understanding:
fixed effects - Time, Treatment
nested random effects - replicate, id
My raw data entered is in Long format with each row being a unique animal and columns for Time (0 or 1), Replicate (1-5), Treatment (1-6), Sex (M or F) Animal ID (1-900), Length (mm), Weight (g), Condition Factor (Weight/Length^2.99*5655)
I have used columns from my raw data and converted them to factors and vectors before using the new variables to create a data frame.
id<-as.factor(data.long[,5])
time<-as.factor(data.long[,1])
replicate<-as.factor(data.long[,2])
treatment<-data.long[,3]
weight<-as.vector(data.long[,7])
length<-as.vector(data.long[,6])
cf<-as.vector(data.long[,10])
My data frame is currently in the following structure:
df1<-data.frame(time,replicate,treatment,id,weight,length,cf)
I am struggling to understand how to nest my individual abalone within replicates. I can convert the weight data to change from initial but I think the package nlme already accounts this change when coded correctly. I could also create another measure of Specific Growth Rate for each animal at Time 1 but this would not allow the Time factor to be used.
lme(weight ~ time*treatment, random=~1 | id, method="ML", data=df1))
I would like to structure a mixed effects model so that my code takes into account the individual animal variability to detect statistical differences in their weight at Time 1 between treatments.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex