custom code for compact letter display from pairwise table output - r

I would like to create a custom code that creates a compact letter display from a pairwise test I have performed.
I have done this with pairwise t-tests with success (packages for this exist), and I am also familiar with the package library(multcomp) when I run linear models and the function cld() to get the compact letter displays, but they will not work for my specific case here.
I work with kaplan meier survival data often, and after I run the pairwise_survdiff() function to see if any statistical differences exist between groups (found in the packages library(survival) and library(survminer), I am easily able to extract a table to display all pairwise comparisons and their corresponding p-values. I have included an example for you here today. (see df below)
When their are many comparisons to do by hand, this becomes a mess to found out which groups are different / similar, and it's prone to human error when many levels exist, and up to now, I've always done it by hand. I would like to change this.
Could someone help me with a code that helps do this automatically?
Here is a mock dataframe df with 10 treatments (named treatment-1....treatment-10), and the rows are filled with p-values. Let's assume anything below p<0.05 as significant. However, it would be very cool to have a code that would allow a more conservative approach, and say set the desired cut off for statistical significance (say anything below p<0.01 as significant for example).
Thanks for your help, and again, here is a play datatframe
df <- read.table("https://pastebin.com/raw/ZAKDBjVs", header = T)

While reflecting on this, I believe I found an answer on my own, with the library(mulcompView) and library(rcompanion)
Nonetheless, I think it's important, since I have seen / heard this question multiple times. Here is how I solved my problem
library(rcompanion)
library(multcompView)
df <- read.table("https://pastebin.com/raw/ZAKDBjVs", header=T)
PT1 = fullPTable(df)
multcompLetters(PT1,
compare="<",
threshold=0.05,
Letters=letters,
reversed = FALSE)
This gives me the desired output with the compact letter displays between groups. Additionally, one could edit the statistical threshold to be either more/less conservative by changing the threshold=
Very happy with the result. This has bothered me for a while. I hope it is useful to other members

Related

Using permanova in r to analyse the effect of 3 independent variables on reef systems

I am trying to understand how to run PERMANOVA using Adonis2 in R to analyse some data that I have collected. I have been looking online, but as it often happens, explanations are a bit convoluted, so I am asking for your help, if you can help me. I have got some fish and coral groups as columns, as well as 3 independent variables (reef age, depth, and material). Snapshot of my dataset structure I think I have understood that p-values are not the only important bit of the output, and that the R2 values indicate how much each variable contributes to the model. Is there something wrong or that I am missing here? Also, I think I understood that I should check for homogeneity of variance, but I have not understood, again, if I should check for it on each variable independently, or if I should include them all in the same bit of code (which does not seem to work). Here are the bit of code that I am using to run the PERMANOVA (1), and the one that I am trying to use to assess homogeneity of variance - which does not work (2).
(1) adonis2(species ~ Age+Material+Depth,data=data.var,by="margin)
'Species' is the subset of the dataset including all the species'count, while 'data.var'is the subset including the 3 independent variables. Also what is the difference in using '+' or '' in the code? When I use '' it gives me - 'Error in qr.X(object$CCA$QR) :
need larger value of 'ncol' as pivoting occurred'. What does this mean?
(2) variance.check<-betadisper(species.distance,data.var, type=c("centroid"), bias.adjust= FALSE)
'species.distance' is a matrix calculated through 'vegdist' using Bray-Curtis method. I used 'data.var'to check variance on all the 3 independent variables, but it does not work, while it works if I check them independently (3). Why is that?
(3) variance.check<-betadisper(species.distance, data$Depth, type=c("centroid"), bias.adjust= FALSE)
Thank you in advance for your responses, and for your help. It will really help me to get my head around it (and sorry for the many questions).

R: [Indicspecies package] multipatt function: extract values from summary.multipatt

I am working with the 'indicspecies' package - multipatt function and am unable to extract summary values of the package. Unfortunately I can't print all the summary and am left with impartial information for my model. The reason is the huge amount of data that needs to be printed from the summary (300.000 different species, 3 groups, 6 comparable combinations).
This is what happens with summary being saved (pre-code incl.):
x <- multipatt(data, ...)
sumx <-summary(x)
sumx
NULL
str(sumx)
NULL
So, the summary does not work exactly like a generic summary. It seems that the function is based around the older indval function from the 'labdsv' package (which is mentioned in the documentation). I found an archived thread where a similar problem is discussed: http://r.789695.n4.nabble.com/extract-values-from-summary-of-function-indval-of-the-package-labdsv-td4637466.html
but it seems not resolved (and is not exactly about the same function, rather the base function indval).
I was wondering if anyone has experience with the indicspecies package and knows a way to either extract the info from the summary.
It is possible to extract significance and other information from the other saved data from the model, but it might be nice to just get a quick complete overview from the data.
ps. I tried
options(max.print=1000000)
but this didn't solve it for me.
I use to capture the summary output for a multipatt object, but don't any more because the p-values reported are not corrected for multiple testing. To answer the OP's question you can capture the summary output using capture.output
ex.
dat.multipatt.summary<-capture.output(summary(dat.multipatt, indvalcomp=TRUE))
Again, I do not recommend this. It is very important to correct the p-values for multiple testing, so the summary output actually isn't helpful. To be clear ?multipatt states:
"sign Data table with results of the best matching pattern, the association value and the degree of statistical significance of the association (i.e. p-values from permutation test). Note that p-values are not corrected for multiple testing."
I just posted an answer for how to correct the p-values here https://stats.stackexchange.com/questions/370724/indiscpecies-multipatt-and-overcoming-multi-comparrisons/401277#401277
I don't have any experience with this package and since you haven't provided the data, it's difficult to reproduce. But since summary is returning NULL, are you sure your x is computed properly? Check the object.size or class or something else of x to see if it indeed has any content.
Also instead of accessing all the contents of summary(x) together, you can use # to access slots of it (similar to $ in dataframe).
If you need further assistance, it'd be better t provide atleast a small subset or some other sample data so that the community can work with it.

How to take a Probability Proportional to Size (PPS) Unequal Probability sample using R?

I have very little programming experience, but I'm working on a statistics project and would like to generate an unequal probability sample where the inclusion probability of a unit is based on its size (PPS).
Basically, I have two datasets:
ds1 lists US states and the parameter I'm trying to estimate
ds2 has the population size of each state.
My questions:
I want to use R to select a random sample from the first dataset using inclusion probabilities based on the population of each state (second dataset).
Also is there any way to use R to calculate these Generalized Unequal Probability Estimator formulas?
Also just a note on the formulas: pi_i is inclusion probability and pi_ij is joint inclusion probability.
There is a package for the same in R - pps and the documentation is here.
Also, there is another package called survey with a bit of documentation here.
I'm not sure of the difference between the two and haven't used them myself. Hope this is what you're looking for.
Yes, that's called weighted sampling. Simply set the weight to the size of the state, strictly you don't even need to normalize them by 1/sum(sizes) although it's always good practice to. There are tons of duplicate posts on SO showing how to do weighted sampling.
The only tiny complication is that you need to do a join() of the datasets ds1, ds2. Show us what code you've tried if it's causing problems. Recommend you use either dplyr or data.table.
Your second question should be asked as a separate question, and is offtopic on SO, or at least won't get a great response - best to ask statistical questions at sister site CrossValidated

LASSO coefficients equal to 0 using opt1D

I have a question about LASSO. I'm getting crazy because it is something that I can not solve only according to my background. I'm a biologist.
Briefly I run LASSO using the R library "penalized". In particular I used the opt1D function with around 500 simulations on a data.frame (numerical) of around 30 columns that are my biomarkers (gene expression). I want to test and 3000 rows that are people of which around 50 are tumours and all the others are normals.
Unfortunately by using L1 regularization, all and really all coefficients of 500 simulations are 0. If I check L2 matrix of coefficients they are close to 0. Now my point is that I cannot think that all my biomarkers are not able to distinguish between Normals and Tumors.
I don't know if what I have done is all I can to check for the discriminatory potential of my molecules. Is there something else I can do to understand why are they all 0 and also is there something else I can do to verify that really they are not able to stratify my cohort?
Did you consider fitting your data without penalization before using regularization? L1 regularization will naturally result in a significant number of zero coefficients.
As a side note I would first run PCA/PCoA and see whether or not your genes separate according to your class variable. This could save you some time and allow you to trim your data set to those genes that show the greatest differences across your class variable. Also if you have relatively little experience with R I would suggest using a linear modeling package such as Limma since it has excellent documentation and many examples that are easy to follow.

Applying univariate coxph function to multiple covariates (columns) at once

First, I gathered from this link Applying a function to multiple columns that using the "function" function would perhaps do what I'm looking for. However, I have not been able to make the leap from thinking about it in the way presented to making it actually work in my situation (or really even knowing where to start). I'm a beginner in R so I apologize in advance if this is a really "newb" question. My data is a data frame that consists of an event variable (tumor recurrence) and a time variable (followup time/time to recurrence) as well as recurrence risk factors (t-stage, tumor size,age at dx, etc.). Some risk factors are categorical and some are continuous. I have been running my univariate analysis by hand, one at a time like this example univariateageatdx<-coxph(survobj~agedx), and then collecting the data. This gets very tedious for multiple factors and doing it for a few different recurrence types. I figured there must be a way to code such that I could basically have one line of code that had the coxph equation and then applied it to all of my variables of interest and spit out a result that had the univariate analysis results for each factor. I tried using cbind to bind variables (i.e x<-cbind("agedx","tumor size") then running cox coxph(recurrencesurvobj~x) but this of course just did the multivariate analysis on these variables and didn't split them out as true univariate analyses.
I also tried the following code based on a similar problem that I found on a different site, but it gave the error shown and I don't know quite what to make of it. Is this on the right track?
f <- as.formula(paste('regionalsurvobj ~', paste(colnames(nodcistradmasvssubcutmasR)[6-9], collapse='+')))
I then ran it has coxph(f)
Gave me the results of a multivariate cox analysis.
Thanks!
**edit: I just fixed the error, I needed to use the column numbers I suppose not the names. Changes are reflected in the code above. However, it still runs the variables selected as a multivariate analysis and not as the true univariate analysis...
If you want to go the formula-route (which in your case with multiple outcomes and multiple variables might be the most practical way to go about it) you need to create a formula per model you want to fit. I've split the steps here a bit (making formulas, making models and extracting data), they can off course be combined this allows you to inspect all your models.
#example using transplant data from survival package
#make new event-variable: death or no death
#to have dichot outcome
transplant$death <- transplant$event=="death"
#making formulas
univ_formulas <- sapply(c("age","sex","abo"),function(x)as.formula(paste('Surv(futime,death)~',x))
)
#making a list of models
univ_models <- lapply(univ_formulas, function(x){coxph(x,data=transplant)})
#extract data (here I've gone for HR and confint)
univ_results <- lapply(univ_models,function(x){return(exp(cbind(coef(x),confint(x))))})

Resources