PRC analysis with paired observations in vegan - r

This message is a copy from a message that I wrote in R-Forge. I would like to compute Principal response curve analysis on my data. I have several pairs of plots where deer browse the vegetation on Anticosti island, Québec. There are repeated observations of each plot during the course of 4 years. At each site, there is a plot inside the enclosure (without deer, called "exclosure") and the other plot is outside the enclosure (with deer, called "control"). I would like to take into account the pairing of observations in and out of each enclosure in the PRC analysis. I would like to add an other condition term to the PRC (like in partial RDA) to consider the paired observations or extract value from a partial RDA computed with the PRC formula and plot it like it is done in a PRC.
More over, I would like to test with permutations tests the signification of the difference between the two treatments. My hypothesis is to find if vegetation composition is different in the exclosure than in the control throughout the years. So, I would like to know if there is a difference between the two treatments and if there is, after how many years.
Somebody knows how to do this?
So here the code of my prc (without taking paired observations into account):
levels (treat)
[1] "controle" "exclosure"
levels (years)
[1] "0" "3" "5" "8"
prc.out <- prc(data.prc.spe.hell, treat, years)
species <- colSums(data.prc.spe.hell)
plot(prc.out, select = species > 5)
ctrl <- how(plots = Plots(strata = site,type = "free"),
within = Within(type = "series"), nperm = 99)
anova(prc.out, permutations = ctrl, first=TRUE)
Here is the result.
Thank you very much for your help!

I may have an answer for the first part of your question:"I would like to add an other condition term to the PRC (like in partial RDA) to consider the paired observations".
I am currently working on a similar case and this is what I came up with: Since Principal Responses Curves (PRC) are a special case of RDA, and that the objective is to do a kind of "partial PRC", I read the R documentation of the function rda() and this is what I found: "If matrix Z is supplied, its effects are removed from the community matrix, and the residual matrix is submitted to the next stage."
So if I understand well, when you do a partial RDA with X, Y, Z (X=community matrix, Y=Constraining matrix, Z=Conditioning matrix), the first thing done by the function is to remove the effect of Z by using the residuals matrix of the RDA of X ~ Z.
If that is true, it is easy to do this step alone, and then to use the residual matrix in your PRC:
library(vegan)
rda.out = rda(X ~ Z) # equivalent of "rda.out = rda(X ~ Condition(Z))"
rda.res = residuals(rda.out)
prc.out = prc(rda.res, treatment, time)
If you coded a dummy variable for your pairing effect, I think it should be as.factor() and NOT as.numeric().
I am not a stats expert, but it looks right to me. Even though that look simple, I would appreciate if someone could validate my answer.
Cheers

Related

Regressing out or Removing age as confounding factor from experimental result

I have obtained cycle threshold values (CT values) for some genes for diseased and healthy samples. The healthy samples were younger than the diseased. I want to check if the age (exact age values) are impacting the CT values. And if so, I want to obtain an adjusted CT value matrix in which the gene values are not affected by age.
I have checked various sources for confounding variable adjustment, but they all deal with categorical confounding factors (like batch effect). I can't get how to do it for age.
I have done the following:
modcombat = model.matrix(~1, data=data.frame(data_val))
modcancer = model.matrix(~Age, data=data.frame(data_val))
combat_edata = ComBat(dat=t(data_val), batch=Age, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
pValuesComBat = f.pvalue(combat_edata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
data_val is the gene expression/CT values matrix.
Age is the age vector for all the samples.
For some genes the p-value is significant. So how to correctly modify those gene values so as to remove the age effect?
I tried linear regression as well (upon checking some blogs):
lm1 = lm(data_val[1,] ~ Age) #1 indicates first gene. Did this for all genes
cor.test(lm1$residuals, Age)
The blog suggested checking p-val of correlation of residuals and confounding factors. I don't get why to test correlation of residuals with age.
And how to apply a correction to CT values using regression?
Please guide if what I have done is correct.
In case it's incorrect, kindly tell me how to obtain data_val with no age effect.
There are many methods to solve this:-
Basic statistical approach
A very basic method to incorporate the effect of Age parameter in the data and make the final dataset age agnostic is:
Do centring and scaling of your data based on Age. By this I mean group your data by age and then take out the mean of each group and then standardise your data based on these groups using this mean.
For standardising you can use two methods:
1) z-score normalisation : In this you can change each data point to as (x-mean(x))/standard-dev(x)); by using group-mean and group-standard deviation.
2) mean normalization: In this you simply subtract groupmean from every observation.
3) min-max normalisation: This is a modification to z-score normalisation, in this in place of standard deviation you can use min or max of the group, ie (x-mean(x))/min(x)) or (x-mean(x))/max(x)).
On to more complex statistics:
You can get the importance of all the features/columns in your dataset using some algorithms like PCA(principle component analysis) (https://en.wikipedia.org/wiki/Principal_component_analysis), though it is generally used as a dimensionality reduction algorithm, still it can be used to get the variance in the whole data set and also get the importance of features.
Below is a simple example explaining it:
I have plotted the importance using the biplot and graph, using the decathlon dataset from factoextra package:
library("factoextra")
data(decathlon2)
colnames(data)
data<-decathlon2[,1:10] # taking only 10 variables/columns for easyness
res.pca <- prcomp(data, scale = TRUE)
#fviz_eig(res.pca)
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
hep.PC.cor = prcomp(data, scale=TRUE)
biplot(hep.PC.cor)
output
[1] "X100m" "Long.jump" "Shot.put" "High.jump" "X400m" "X110m.hurdle"
[7] "Discus" "Pole.vault" "Javeline" "X1500m"
On these similar lines you can use PCA on your data to get the importance of the age parameter in your data.
I hope this helps, if I find more such methods I will share.

R - linear model does not match experimental data

I am trying to perform a linear regression on experimental data consisting of replicate measures of the same condition (for several conditions) to check for the reliability of the experimental data. For each condition I have ~5k-10k observations stored in a data frame df:
[1] cond1 repA cond1 repB cond2 repA cond2 repB ...
[2] 4.158660e+06 4454400.703 ...
[3] 1.458585e+06 4454400.703 ...
[4] NA 887776.392 ...
...
[5024] 9571785.382 9.679092e+06 ...
I use the following code to plot scatterplot + lm + R^2 values (stored in rdata) for the different conditions:
for (i in seq(1,13,2)){
vec <- matrix(0, nrow = nrow(df), ncol = 2)
vec[,1] <- df[,i]
vec[,2] <- df[,i+1]
vec <- na.exclude(vec)
plot(log10(vec[,1]),log10(vec[,2]), xlab = 'rep A', ylab = 'rep B' ,col="#00000033")
abline(fit<-lm(log10(vec[,2])~log10(vec[,1])), col='red')
legend("topleft",bty="n",legend=paste("R2 is",rdata[1,((i+1)/2)] <- format(summary(fit)$adj.r.squared,digits=4)))
}
However, the lm seems to be shifted so that it does not fit the trend I see in the experimental data:
It consistently occurs for every condition. I unsuccesfully tried to find an explanation by looking up the scource code and browsing different forums and posts (this or here).
Would have like to simply comment/ask a few questions, but can't.
From what I've understood, both repA and repB are measured with error. Hence, you cannot fit your data using an ordinary least square procedure, which only takes into account the error in Y (some might argue a weighted OLS may work, however I'm not skilled enough to discuss that). Your question seem linked to this one.
What you can use is a total least square procedure: it takes into account the error in X and Y. In the example below, I've used a "normal" TLS assuming there is the same error in X and Y (thus error.ratio=1). If it is not, you can specify the error ratio by entering error.ratio=var(y1)/var(x1) (at least I think it's var(Y)/var(X): check on the documentation to ensure that).
library(mcr)
MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
MCR_intercept=getCoefficients(MCR_reg)[1,1]
MCR_slope=getCoefficients(MCR_reg)[2,1]
# CI for predicted values
x_to_predict=seq(0,35)
predicted_values=MCResultAnalytical.calcResponse(MCR_reg,x_to_predict,alpha=0.05)
CI_low=predicted_values[,4]
CI_up=predicted_values[,5]
Please note that, in Deming/TLS regressions, your x- and y-errors are supposed to follow normal distribution, as explained here. If it's not the case, go for a Passing-Bablok regressions (and the R code is here).
Also note that the R2 isn't defined for Deming nor Passing Bablok regressions (see here). A correlation coefficient is a good proxy, although it does not exactly provide the same information. Since you're studying a linear correlation between two factors, see Pearson's product moment correlation coefficient, and use e.g. the rcorrfunction.

Basic - T-Test -> Grouping Factor Must have Exactly 2 Levels

I am relatively new to R. For my assignment I have to start by conducting a T-Test by looking at the effect of a politician's (Conservative or Labour) wealth on their real gross wealth and real net wealth. I have to attempt to estimate the effect of serving in office wealth using a simple t-test.
The dataset is called takehome.dta
Labour and Tory are binary where 1 indicates that they serve for that party and 0 otherwise.
The variables for wealth are lnrealgross and lnrealnet.
I have imported and attached the dataset, but when I attempt to conduct a simple t-test. I get the following message "grouping factor must have exactly 2 levels." Not quite sure where I appear to be going wrong. Any assistance would be appreciated!
are you doing this:
t.test(y~x)
when you mean to do this
t.test(y,x)
In general use the ~ then you have data like
y <- 1:10
x <- rep(letters[1:2], each = 5)
and the , when you have data like
y <- 1:5
x <- 6:10
I assume you're doing something like:
y <- 1:10
x <- rep(1,10)
t.test(y~x) #instead of t.test(y,x)
because the error suggests you have no variation in the grouping factor x
The differences between ~ and , is the type of statistical test you are running. ~ gives you the mean differences. This is for dependent samples (e.g. before and after). , gives you the difference in means. This is for independent samples (e.g. treatment and control). These two tests are not interchangeable.
I was having a similar problem and did not realize given the size of my dataset that one of my y's had no values for one of my levels. I had taken a series of gene readings for two groups and one gene had readings only for group 2 and not group 1. I hadn't even noticed but for some reason this presented with the same error as what I would get if I had too many levels. The solution is to remove that y or in my case gene from my analysis and then the error is solved.

R Trouble getting correlation coefficient

I'm getting difficulties on my quest to get a correlation coefficient for my data set.
I started by using ggpairsand then cor function.
It might sound a lack of knowledge, but I didn’t realize that I can’t calculate the correlation for columns which type is not numeric.
For example, I would like to now the correlation between some AGE and CITY. What alternative do I have to situations like this? Or what data transformations I should do?
Thank you.
As thelatemail put it, sometimes graphs speak more than a stat...
cities <- c("Montreal", "Toronto", "New York", "Plattsburgh")
dat <- data.frame(city = sample(cities,size = 200, replace = TRUE), age = rnorm(n = 200, mean = 40, sd = 20))
dat$city <- as.factor(dat$city)
plot(age ~ city, data = dat)
Then for proper analysis you have several options... anova, or regression with cities as an explanatory variable (factor)... Although your question might have better responses on Cross Validated!
Btw: pls just ignore negative ages, this has been done quickly.
I think you first need to answer the question of what it is you are trying to do. The correlation coefficient (Pearson's r) is a specific statistic that can be calculated on two numerical values (where a dichotomous variable can be considered numeric). It has some special characteristics, including that it is bounded by -1 and 1 and that it does not have a concept of dependent or independent variable. Also it does not represent the proportion of variance explained; you need to square it to get the usual measure of that. What it does do is give you an estimate of the size and direction of the association between two variables.
These characteristics make it inappropriate to use r when you have a variable such as city as one of the two variables. If you want to know the proportion of variance in age explained by city, you can run a regression of age on a set of dummy variables for city and look at the overall R squared for the model. However unlike r, you won't have a simple direction (just direction for each city) and it won't necessarily be the same as if you built a model predicting city based on age.
Regarding the qualitative data such as City, you can use the Spearman's correlation.
You can find more information about this correlation here
It can be simply used in R with the help of this command :
cor(x, use=, method= )
So , if you want to use it in a simple example :
cor(AGE, CITY, method = "Spearman")
I hope that helps you

plotting glm interactions: "newdata=" structure in predict() function

My problem is with the predict() function, its structure, and plotting the predictions.
Using the predictions coming from my model, I would like to visualize how my significant factors (and their interaction) affect the probability of my response variable.
My model:
m1 <-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1, family=binomial(logit))
mating: individual has mated or not (factor, binomial: 0,1)
pop: population (factor, 4 levels)
behv: behaviour (numeric, scaled & centered)
condition: relative fat content (numeric, scaled & centered)
Significant effects after running the glm:
pop1
condition
behv*pop2
behv^2*pop1
Although I have read the help pages, previous answers to similar questions, tutorials etc., I couldn't figure out how to structure the newdata= part in the predict() function. The effects I want to visualise (given above) might give a clue of what I want: For the "behv*pop2" interaction, for example, I would like to get a graph that shows how the behaviour of individuals from population-2 can influence whether they will mate or not (probability from 0 to 1).
Really the only thing that predict expects is that the names of the columns in newdata exactly match the column names used in the formula. And you must have values for each of your predictors. Here's some sample data.
#sample data
set.seed(16)
data <- data.frame(
mating=sample(0:1, 200, replace=T),
pop=sample(letters[1:4], 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
data1<-data[1:150,] #for model fitting
data2<-data[51:200,-1] #for predicting
Then this will fit the model using data1 and predict into data2
model<-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1,
family=binomial(logit))
predict(model, newdata=data2, type="response")
Using type="response" will give you the predicted probabilities.
Now to make predictions, you don't have to use a subset from the exact same data.frame. You can create a new one to investigate a particular range of values (just make sure the column names match up. So in order to explore behv*pop2 (or behv*popb in my sample data), I might create a data.frame like this
popbbehv<-data.frame(
pop="b",
behv=seq(from=min(data$behv), to=max(data$behv), length.out=100),
condition = mean(data$condition)
)
Here I fix pop="b" so i'm only looking at the pop, and since I have to supply condition as well, i fix that at the mean of the original data. (I could have just put in 0 since the data is centered and scaled.) Now I specify a range of behv values i'm interested in. Here i just took the range of the original data and split it into 100 regions. This will give me enough points to plot. So again i use predict to get
popbbehvpred<-predict(model, newdata=popbbehv, type="response")
and then I can plot that with
plot(popbbehvpred~behv, popbbehv, type="l")
Although nothing is significant in my fake data, we can see that higher behavior values seem to result in less mating for population B.

Resources