SPIA::spia function output - r

I asked this question on Bioconductor since it was specific to the SPIA package, but haven't received a reply so I'm posting it here to reach a wider audience.
I downloaded the 134 Malus domestica ('mdm') pathways using the KEGGREST package and successfully made the spia data using the makeSPIAdata() function. I'm usng this library to evaluate pathways for two different apples, HoneyCrisp and Cripps Pink. Then, I called the spia() function, and after it completed it only showed 10 pathways that were 'Done.'
length(dir(mydir)) #Directory containing 134 Malus kgml/xml pathway files.
[1] 134
In spia (), my spia input vectors 'de' and 'all' contained 243 unique probesets, and I took the mean logFC for each probeset. This dataset was a subset of the topTable limma result of adj.P.Value < 0.001. Below is the run for HoneyCrisp apples (HC).
res<-spia(de=DE_malus_HC, all=entrez_only, organism="mdm", nB = 2000,
pathids = NULL, data.dir="./", combine = 'fisher', plots = TRUE)
Done pathway 1 : RNA transport..
Done pathway 2 : RNA degradation..
Done pathway 3 : MAPK signaling pathway - plant..
Done pathway 4 : Plant hormone signal transduct..
Done pathway 5 : Sulfur relay system..
Done pathway 6 : SNARE interactions in vesicula..
Done pathway 7 : Autophagy..
Done pathway 8 : Protein processing in endoplas..
Done pathway 9 : Plant-pathogen interaction..
Done pathway 10 : Circadian rhythm - plant..>
res[ , -12] #Showing RES for HoneyCrisp Apples
Name ID pSize NDE pNDE tA pPERT pG pGFdr pGFWER
Status
1 MAPK signaling pathway - plant 04016 4 4 1
24.29737166 0.160 0.4532130 1 1 Activated
2 Plant hormone signal transduction 04075 14 14 1
11.93279398 0.292 0.6514524 1 1 Activated
3 Circadian rhythm - plant 04712 3 3 1
9.17881852 0.440 0.8012314 1 1 Activated
4 Plant-pathogen interaction 04626 2 2 1
0.02234003 0.987 0.9999151 1 1 Activated
5 Protein processing in endoplasmic reticulum 04141 1 1 1
0.00000000 NA 1.0000000 1 1 Inhibited
I then used the 'mdm' library to evaluate a DE set for Cripps Pink (CP) apples...
res<-spia(de=DE_malus_CP, all=entrez_only_CP, organism="mdm", nB = 2000,
pathids = NULL, data.dir="./", plots = TRUE)
Done pathway 1 : RNA transport..
Done pathway 2 : RNA degradation..
Done pathway 3 : MAPK signaling pathway - plant..
Done pathway 4 : Plant hormone signal transduct..
Done pathway 5 : Sulfur relay system..
Done pathway 6 : SNARE interactions in vesicula..
Done pathway 7 : Autophagy..
Done pathway 8 : Protein processing in endoplas..
Done pathway 9 : Plant-pathogen interaction..
Done pathway 10 : Circadian rhythm - plant..>
res[, -12]
Name ID pSize NDE pNDE tA pPERT pG pGFdr pGFWER Status
1 Plant hormone signal transduction 04075 4 4 1 -0.4086641 0.812
0.981103 1 1 Inhibited
2 Protein processing in endoplasmic reticulum 04141 1 1 1
0.0000000 NA 1.000000 1 1 Inhibited
While these results are exciting, I have some questions to make sure this output is correct:
1) What is the printed 'Done' pathway list, and why does it print only 10 pathways rather than all 134 pathways?
2) For Protein processing in endoplasmic reticulum in HC and CP results, why is this pathway in the list although pPERT = 'NA.' So, how come the other 'Done' pathways (e.g. RNA transport, RNA degradation, Autophagy....) do not show up in the res output? Is this Protein processing in endoplasmic reticulum pathway considered significant if tA = 0, pG = 1 and pPERT = 'NA'?
3) I didn't expect pSize and NDE to be equal to each other, and so all the pNDEs are equal to 1...Why is pSize and NDE the same value?
4) The fold change values for the 243 HC DE probes dataset range from +3.25 to -4.38. Since I'm starting with gene probesets with an adj.P.Value < 0.001, I lowered the nB value to say 100, but the results are identical as if I used nB =2000. This is also true when I did this on the 127 CP DE probes dataset. Why is this?
Thanks for reading this as I tried to give as much info as possible while being succinct. I'm trying to understand this SPIA package in a bit more detail than explained in the vignette and ref manual.
Hope to hear from you soon.
Thanks,
Franklin

The response to this post can be found on Bioconductor:
https://support.bioconductor.org/p/90756/

Related

Inter-rater reliability of multiple raters responding to (some subset of) multiple questions (in R)

I have data from 5 raters who provided ratings of transcripts by answering up to a dozen questions about each transcript. Each question used a different rating system (e.g., yes vs. no, 1-7, or this vs. that vs. indeterminant).
A toy example of the data can be made with this code.
data.table(Rater = c("A","B","C","D","E"),
Content = c("I","I","I","I","I","II","II","II","II","II"),
Question1 = c("Yes","No","Yes","No","NA"),
Question2 = c("1","3","5","7","NA"),
Question3 = c("This","That","Indeterminate","This","Indeterminate"))
Which produces what is below:
Rater Content Question1 Question2 Question3
1: A I Yes 1 This
2: B I No 3 That
3: C I Yes 5 Indeterminate
4: D I No 7 This
5: E I NA NA Indeterminate
6: A II Yes 1 This
7: B II No 3 That
8: C II Yes 5 Indeterminate
9: D II No 7 This
10: E II NA NA Indeterminate
I need to compute the interrater reliability for the raters.
The kappa2 function of the irr package would need data to be in long format (if I understand correctly)—something like:
Rater A B ... E
Question1_Content_I Yes No ... NA
Question2_Content_I 1 3 ... NA
Question3_Content_I This That ... Ind.
Question1_Content_II Yes No ... Ind.
...
How can I (re)format the data to compute IRR scores (with kappa2 or another function)? (Would melt do the trick?)
What functions would compute IRR scores for each kind of question/rating? (And, if applicable, for data (re)format(ing) would they require?)
Must there be separate IRR scores for each question/rating or is there a way to compute an overall IRR (across the questions)?
What needs to be done to accomodate the fact that some raters didn't respond to every question?
Thank you for your advice!

GLMM of proportions adjusting for group size

I'm trying to investigate if the proportion of muzzle contact(mc) in primates tends to be directed more towards the mother than other group members (Adults or Juveniles). I have data over 5 years in 4 different groups. This is an exemple for 3 different initiators (those initiating the mc):
age1data
initiator
receiver
count
total_init
prop_mc
subgroupsize
group
Aaa
Mother
1
3
0.333
1
1
Aaa
Adult
2
3
0.666
40
1
Aaa
Juvenile
0
3
0
20
1
Hee
Mother
0
2
0
1
1
Hee
Adult
0
2
0
40
1
Hee
Juvenile
2
2
1
20
1
Awa
Mother
2
10
0.2
1
2
Awa
Adult
3
10
0.3
7
2
Awa
Juvenile
5
10
0.5
13
2
count: number of mc directed to an individual belonging to each receiver subgroups
total_init: total number of mc by this individual
subgroupsize: number of individuals within the group that belong to the receiver subgroup (for exemple, each individual has 1 mother but the group1 has 40 adults (other than the mother) and 20 juveniles
This is the model I tried:
glmm_ages <- glmer(((count_init/total_init)/subgroupsize)~receiver + (1|group) + (1|initiator),
data = age1data,
family = binomial)
This gives me this error message:
Error in pwrssUpdate(pp, resp, tol = tolPwrss, GQmat = GQmat, compDev = compDev, :
Downdated VtV is not positive definite
In addition: Warning message:
In eval(family$initialize, rho) : non-integer #successes in a binomial glm!
The model works when I do a simple GLM without group and initiator as random variables but I really think I need to include them.
From what I understand, the error message means that some categories are all 1 or all 0, which is the case when an individual is only recorded muzzle contacting its mother once, for exemple (dependent variable becomes 1/1/1 = 1).
I'm trying to understand what I should do from this thread I found http://bbolker.github.io/mixedmodels-misc/ecostats_chap.html#digression-complete-separation
In this section, I'm not sure how to find the number I should be putting instead of "10"?
newdat <- subset(age1data,
abs(resid(glmm_ages,"pearson"))<10)
I'm also not sure what all this means and how can I figure out what is my own variance and standard deviation in my dataset:
impose zero-mean Normal priors on the fixed effects (a 4 Ă— 4 diagonal matrix with diagonal elements equal to 9, for variances of 9 or standard deviations of 3)
Can anyone help me figure out if I'm doing the right thing and this is the solution for me?
I apologize for the length of this post, I wanted to make sure everything was there, hope it's clear!

How do I fix error code "studies with wrong number of comparisons in netmeta package R

I want to perform a network meta analysis with studies containing multiple and single treatment arms.
Here is my data structure:
data.frame': 85 obs. of 5 variables:
$ TE : num -0.758 -0.695 -0.288 -0.511 -0.288 ...
$ seTE : num 0.6191 0.2667 0.0341 0.0758 0.1178 ...
$ treat1 : Factor w/ 31 levels "Boil_promo","Chem",..: 6 26 26 18 9 9 3 9 16 26 ...
$ treat2 : Factor w/ 3 levels "Act","Pa","Pb": 2 2 2 2 2 3 3 2 2 2 ...
$ studlab: Factor w/ 63 levels "Altmann2018",..: 1 2 3 4 5 6 7 8 8 9 ...
I have 4 studies with 2 treatment arms, 5 with 3 treatment arms, and 2 with 4 treatment arms.
There are three types of control: Passive, Active and Placebo controls
There are 31 treatment types in total.
After running the code below:
m.netmeta <- netmeta(TE = TE,
seTE = seTE,
treat1 = treat1,
treat2 = treat2,
studlab = paste(net.meta$studlab),
data = net.meta,
sm = "RR",
comb.fixed = TRUE,
comb.random = FALSE,
reference = "Pa",
details.chkmultiarm = TRUE,
sep.trts = " vs ")
I got this error:
Error: The following studies have a wrong number of comparisons: 'Brown2008', 'Crump2005', 'Ecrumen2015', 'Francis2016', 'Luby2004', 'Opryszko2010', 'Reller2003', 'Sinharoy2017'
Please provide data for all treatment comparisons (two-arm: 1; three-arm: 3; four-arm: 6, ...).
Does anyone know how I can deal with this?
netmeta requires the data to have each pairwise comparison listed as a separate row for multi-arm studies. The error message you are getting suggests netmeta detects this has not occurred for some of the studies (namely, Brown2008, Crump2005, Ecrumen2015, Francis2016, Luby2004, Opryszko2010, Reller2003, Sinharoy2017).
Your post indicates many of your studies are multi-arm studies, and the information you have provided suggests you may not have set up your data in the format expected by netmeta. To give a couple examples:
Your description suggests you may be restricting what you put in treat2 (namely, only including active, passive, and placebo controls). What goes in treat2 is entirely dependent on the study composition. The hypothetical example below unpacks this more.
Your description of the study composition ("4 studies with 2 treatment arms, 5 with 3 treatment arms, and 2 with 4 treatment arms") does not seem to match up with the expected number of rows. For instance, why are there 63 levels for studlab? It is not clear to me how this pairs up with the expected 31 rows (corresponding to the number of pairwise comparisons): 4 studies with 2 treatment arms = 4 rows, 5 studies with 3 arms = 15 rows, and 2 studies with 4 treatment arms = 12 rows (4 + 15 + 12 = 31).
As an example, if Brown2008 has the arms MedicationX, MedicationY, and Placebo, you would need separate rows for MedicationX vs MedicationY, MedicationX vs Placebo, and MedicationY vs Placebo, not just MedicationX vs Placebo and MedicationY vs Placebo. I would recommend starting your debugging there.
The other thing that could possibly lead to that error is if you have separate publications that happen to have the same AuthorYear label. If that's the case, you can try appending the 'culprit' studies with unique identifiers.

R: How to measure difference with both categorical and numeric features

I'm very new to data wrangling. And now I have this problem at hand:
So basically I have used tables of biochemical measurements (all numerical) of patients to perform cluster analysis, and by doing so I sorted them into 5 clusters.
Then I also have their clinical data/features, now I want to ask if any of these clinical features (a mix of numerical and categorical features) are significantly different from one cluster to another. So how can I go about this? What test shall I perform? Is there a good library I should be looking at?
To give you an idea about the "clinical data":
ClusterAssigned PatientID age sex stage FISH IGHV IgG ...
1 S134567 50 m 4 11q mutated scig
1 S234667 80 m 2 13q mutated 6.5
1 S135677 55 f 4 11q na scig
1 S356576 94 f 2 13q,t12 unmutated 5
1 S187978 59 m 4 11q mutated scig
4 S278967 80 f 2 17q unmutated 6.5
4 S123467 75 f 4 na unmutated 9.1
4 S234577 62 m 2 t12 mutated 9
.....
So you see the Cluster assigned is based on my cluster analysis. FISH, IGHV, IgG are categorical, and you can see there are sometimes na values and sometimes one person can have multiple entry "13q,t12".
In a discounted way, I can perhaps just take cluster 1 and 4 patients out, emit all na ones, and ask if there is a difference in their age, sex, FISH, IGHV...Still what's the method I can use here to perform such test in one go?
You can convert the categorical variables into dummy variables first and then perform a normal cluster analysis.
Things get more complicated if you have ordered categorical fields

Phylogenetic tree

I am working to have a phylogenetic tree based on pairwise-data of genes.Below is my subset of the data(test.txt).The tree does not has to be constructed on the basis of any DNA sequences,but just treating it as words.
ID gene1 gene2
1 ADRA1D ADK
2 ADRA1B ADK
3 ADRA1A ADK
4 ADRB1 ASIC1
5 ADRB1 ADK
6 ADRB2 ASIC1
7 ADRB2 ADK
8 AGTR1 ACHE
9 AGTR1 ADK
10 ALOX5 ADRB1
11 ALOX5 ADRB2
12 ALPPL2 ADRB1
13 ALPPL2 ADRB2
14 AMY2A AGTR1
15 AR ADORA1
16 AR ADRA1D
17 AR ADRA1B
18 AR ADRA1A
19 AR ADRA2A
20 AR ADRA2B
Below is my code in R
library(ape)
tab=read.csv("test.txt",sep="\t",header=TRUE)
d=dist(tab,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
My figure is attached here
I have a question on how they are clustered.Since the pairs
17 AR ADRA1B
18 AR ADRA1A
and
2 ADRA1B ADK
3 ADRA1A ADK
should be clustered closely because they have one common gene.so 17 and 2 should be together,and 18 and 3.
Should I use any other method,if I am wrong in using this method(Euclidean distance)?
Should I convert my data to a matrix of rows and columns ,where gene1 is x-axis ,and gene2 is y-axis,each cell being filled by 1 or 0?(Basically if they are paired would mean 1, and if not then 0)
Updated Code :
table=table(tab$gene1, tab$gene2)
d <- dist(table,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
However, in this I get only the genes from gene1 and not gene2 column.The below figure is exactly what I want but should have genes from gene2 column as well
There is some room for interpretation in the example of the question. My answer is only valid if there are really only two genes present in each individual and each row describes an individual. If, however, each row means that gene1 occurs with gene2 with certainty no useful clustering can be performed, in my opinion. In that case I would expect an additional column stating the probability for their common occurrence and something like an principal component analysis (PCA) may be preferred, but I am far away from being an expert on (hierarchial) clustering.
Before you can use the dist function, you have to bring your data into an appropriate format:
# convert test data into suitable format
gene.names <- sort(unique(c(tab[,"gene1"],tab[,"gene2"])))
gene.matrix <- cbind(tab[,"ID"],matrix(0L,nrow=nrow(tab),ncol=length(gene.names)))
colnames(gene.matrix) <- c("ID",gene.names)
lapply(seq_len(nrow(tab)),function(x) gene.matrix[x,match(tab[x,c("gene1","gene2")],colnames(gene.matrix))]<<-1)
The obtained gene.matrix has the shape:
ID ACHE ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2A ...
[1,] 1 0 1 0 0 0 1 0
[2,] 2 0 1 0 0 1 0 0
[3,] 3 0 1 0 1 0 0 0
[4,] 4 0 0 0 0 0 0 0
...
So each row represents an observation (=individual) where the first column identifies the individual and each of the subsequent columns contains 1 if the gene is present and 0 if it is missing. On this matrix the dist function can be reasonably applied (ID column removed):
d <- dist(gene.matrix[,-1],method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
Maybe, it is a good idea to read up the differences between the distance measures euclidean, manhattan etc. For instance, the euclidian distance between the individuals with ID=1 and ID=2 is:
euclidean_dist = sqrt((0-0)^2 + (1-1)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2 + ...)
whereas the manhattan distance is
manhattan_dist = abs(0-0) + abs(1-1) + abs(0-0) + abs(0-0) + abs(0-1) + ...

Resources