Using permanova in r to analyse the effect of 3 independent variables on reef systems - r

I am trying to understand how to run PERMANOVA using Adonis2 in R to analyse some data that I have collected. I have been looking online, but as it often happens, explanations are a bit convoluted, so I am asking for your help, if you can help me. I have got some fish and coral groups as columns, as well as 3 independent variables (reef age, depth, and material). Snapshot of my dataset structure I think I have understood that p-values are not the only important bit of the output, and that the R2 values indicate how much each variable contributes to the model. Is there something wrong or that I am missing here? Also, I think I understood that I should check for homogeneity of variance, but I have not understood, again, if I should check for it on each variable independently, or if I should include them all in the same bit of code (which does not seem to work). Here are the bit of code that I am using to run the PERMANOVA (1), and the one that I am trying to use to assess homogeneity of variance - which does not work (2).
(1) adonis2(species ~ Age+Material+Depth,data=data.var,by="margin)
'Species' is the subset of the dataset including all the species'count, while 'data.var'is the subset including the 3 independent variables. Also what is the difference in using '+' or '' in the code? When I use '' it gives me - 'Error in qr.X(object$CCA$QR) :
need larger value of 'ncol' as pivoting occurred'. What does this mean?
(2) variance.check<-betadisper(species.distance,data.var, type=c("centroid"), bias.adjust= FALSE)
'species.distance' is a matrix calculated through 'vegdist' using Bray-Curtis method. I used 'data.var'to check variance on all the 3 independent variables, but it does not work, while it works if I check them independently (3). Why is that?
(3) variance.check<-betadisper(species.distance, data$Depth, type=c("centroid"), bias.adjust= FALSE)
Thank you in advance for your responses, and for your help. It will really help me to get my head around it (and sorry for the many questions).

Related

custom code for compact letter display from pairwise table output

I would like to create a custom code that creates a compact letter display from a pairwise test I have performed.
I have done this with pairwise t-tests with success (packages for this exist), and I am also familiar with the package library(multcomp) when I run linear models and the function cld() to get the compact letter displays, but they will not work for my specific case here.
I work with kaplan meier survival data often, and after I run the pairwise_survdiff() function to see if any statistical differences exist between groups (found in the packages library(survival) and library(survminer), I am easily able to extract a table to display all pairwise comparisons and their corresponding p-values. I have included an example for you here today. (see df below)
When their are many comparisons to do by hand, this becomes a mess to found out which groups are different / similar, and it's prone to human error when many levels exist, and up to now, I've always done it by hand. I would like to change this.
Could someone help me with a code that helps do this automatically?
Here is a mock dataframe df with 10 treatments (named treatment-1....treatment-10), and the rows are filled with p-values. Let's assume anything below p<0.05 as significant. However, it would be very cool to have a code that would allow a more conservative approach, and say set the desired cut off for statistical significance (say anything below p<0.01 as significant for example).
Thanks for your help, and again, here is a play datatframe
df <- read.table("https://pastebin.com/raw/ZAKDBjVs", header = T)
While reflecting on this, I believe I found an answer on my own, with the library(mulcompView) and library(rcompanion)
Nonetheless, I think it's important, since I have seen / heard this question multiple times. Here is how I solved my problem
library(rcompanion)
library(multcompView)
df <- read.table("https://pastebin.com/raw/ZAKDBjVs", header=T)
PT1 = fullPTable(df)
multcompLetters(PT1,
compare="<",
threshold=0.05,
Letters=letters,
reversed = FALSE)
This gives me the desired output with the compact letter displays between groups. Additionally, one could edit the statistical threshold to be either more/less conservative by changing the threshold=
Very happy with the result. This has bothered me for a while. I hope it is useful to other members

LASSO coefficients equal to 0 using opt1D

I have a question about LASSO. I'm getting crazy because it is something that I can not solve only according to my background. I'm a biologist.
Briefly I run LASSO using the R library "penalized". In particular I used the opt1D function with around 500 simulations on a data.frame (numerical) of around 30 columns that are my biomarkers (gene expression). I want to test and 3000 rows that are people of which around 50 are tumours and all the others are normals.
Unfortunately by using L1 regularization, all and really all coefficients of 500 simulations are 0. If I check L2 matrix of coefficients they are close to 0. Now my point is that I cannot think that all my biomarkers are not able to distinguish between Normals and Tumors.
I don't know if what I have done is all I can to check for the discriminatory potential of my molecules. Is there something else I can do to understand why are they all 0 and also is there something else I can do to verify that really they are not able to stratify my cohort?
Did you consider fitting your data without penalization before using regularization? L1 regularization will naturally result in a significant number of zero coefficients.
As a side note I would first run PCA/PCoA and see whether or not your genes separate according to your class variable. This could save you some time and allow you to trim your data set to those genes that show the greatest differences across your class variable. Also if you have relatively little experience with R I would suggest using a linear modeling package such as Limma since it has excellent documentation and many examples that are easy to follow.

Customized Fisher exact test in R

Beginner's question ahead!
(after spending much time, could not find straightforward solution..)
After trying all relevant posts I can't seem to find the answer, perhaps because my question is quite basic.
I want to run fisher.test on my data (Whatever data, doesn't really matter to me - mine is Rubin's children TV workshop from QR33 - http://www.stat.columbia.edu/~cook/qr33.pdf) It has to simulate completely randomized experiment.
My assumption is that RCT in this context means that all units have the same probability to be assigned to treatment(1/N). (of course, correct me if I'm wrong. thanks).
I was asked to create a customized function and my function has to include the following arguments:
Treatment observations (vector)
Control observations (vector)
A scalar representing the value, e.g., zero, of the sharp null hypothesis; and
The number of simulated experiments the function should run.
When digging in R's fisher.test I see that I can specify X,Y and many other params, but I'm unsure reg the following:
What's the meaning of Y? (i.e. a factor object; ignored if x is a matrix. is not informative as per the statistical meaning).
How to specify my null hypothesis? (i.e. if I don't want to use 0.) I see that there is a class "htest" with null.value but how can I use it in the function?
Reg number of simulations, my plan is to run everything through a loop - sounds expensive - any ideas how to better write it?
Thanks for helping - this is not an easy task I believe, hopefully will be useful for many people.
Cheers,
NB - Following explanations were found unsatisfying:
https://www.r-bloggers.com/contingency-tables-%E2%80%93-fisher%E2%80%99s-exact-test/
https://stats.stackexchange.com/questions/252234/compute-a-fisher-exact-test-in-r
https://stats.stackexchange.com/questions/133441/computing-the-power-of-fishers-exact-test-in-r
https://stats.stackexchange.com/questions/147559/fisher-exact-test-on-paired-data
It's not completely clear to me that a Fisher test is necessarily the right thing for what you're trying to do (that would be a good question for stats.SE) but I'll address the R questions.
As is explained at the start of the section on "Details", R offers two ways to specify your data.
You can either 1. supply to the argument x a contingency table of counts (omitting anything for y), or you can supply observations on individuals as two vectors that indicate the row and column categories (it doesn't matter which is which); each vector containing factors for x and y. [I'm not sure why it also doesn't let you specify x as a vector of counts and y as a data frame of factors, but it's easy enough to convert]
With a Fisher test, the null hypothesis under which (conditionally on the margins) the observation-categories become exchangeable is independence, but you can choose to make it one or two tailed (via the alternative argument)
I'm not sure I clearly understand the simulation aspect but I almost never use a loop for simulations (not for efficiency, but for clarity and brevity). The function replicate is very good for doing simulations. I use it roughly daily, sometimes many times.

Clustered robust standard errors on country-year pairs

I want to replicate a Stata do.file (panel model) in R, but unfortunately I'm ending up with the wrong standard error estimates. The data is proprietary, so I can't post it here. The Stata code used looks like:
xtreg Y X, vce(cluster countrycodeid) fe nonest dfadj
With fe for fixed effects, nonest indicating that the panels are not nested within the clusters, and dfadj for the fact that some sort of DF-adjustment takes place - not possible to find out which sort as of now.
My R-Code looks like this and makes me end up with the right coefficient values:
model <- plm(Y~X+as.factor(year),data=panel,model="within",index=c("codeid","year"))
Now comes the difficult part, which I haven't found a solution for so far, even after trying out numerous sorts of standard error robust estimation methods, for example making extensive use of lmtest and various degrees of freedom transformation methods. The standard errors are supposed to follow a country-year pair pattern (captured by the variable countrycodeid in the Stata code, which takes the form codeid-year, as there appears to be missing data for some variables which are not available on a monthly basis.
Does anyone know if there are special tricks to keep in mind when working with unbalanced panels and the plm() package, which sort of DF-adjustment can be used, and if there is a possibility to group data in the coeftest() function on a country-year basis?
This is not a complete answer.
Stata uses a finite sample correction described in this post. I think that may get your standard errors a tad closer.
Moreover, you can learn more about the nonest/dfadj by issuing the help whatsnew9. Stata used to adjust the VCE for the within transformation when the cluster() option was specified. The cluster-robust VCE no longer adjusts unless the dfadj is specified. You may need to use the version control to replicate old estimates.

R code: Extracting highly correlated variables and Running multivariate regression model with selected variables

I have a huge data which has about 2,000 variables and about 10,000 observations.
Initially, I wanted to run a regression model for each one with 1999 independent variables and then do stepwise model selection.
Therefore, I would have 2,000 models.
However, unfortunately R presented errors because of lack of memory..
So, alternatively, I have tried to remove some independent variables which are low correlation value- maybe lower than .5-
With variables which are highly correlated with each dependent variable, I would like to run regression model..
I tried to do follow codes, even melt function doesn't work because of memory issue.. oh god..
test<-data.frame(X1=rnorm(50,mean=50,sd=10),
X2=rnorm(50,mean=5,sd=1.5),
X3=rnorm(50,mean=200,sd=25))
test$X1[10]<-5
test$X2[10]<-5
test$X3[10]<-530
corr<-cor(test)
diag(corr)<-NA
corr[upper.tri(corr)]<-NA
melt(corr)
#it doesn't work with my own data..because of lack of memory.
Please help me.. and thank you so much in advance..!
In such a situation if might be worth trying sparsity inducing techniques such as the Lasso. Here a sparse subset of variables is selected by constraining the sum of absolute values of the regression coefficients.
This will give you a reduced subset of variables which are the most relevant (and due to the nature of the Lasso algorithm also the most correlated, which was what you were looking for)
In R you can use the LARS package and information about the Lasso can be found here:
http://www-stat.stanford.edu/~tibs/lasso.html
Also a very good resource is: http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Resources