Error obtained when running the GOplot package in R - r

[I have uploaded an output from DAVID "BP_direct", removed unnessesary columns, and uploaded a set of DEGs "DEGs".
> names(BP.direct) #data frame with 142 obs
[1] "Category" "Term" "Genes" "adj_pval"](https://www.stackoverflow.com/)
> names(DEGs.6 ) # data frame with 394 obs.
[1] "ID" "logFC" "P.Value" "adj.P.Val"
I genereated the plotting object, and it looks fine to me:
> head(circ) # data frame with 995 obs. of 7 variables
category term count genes
1 BP_DIRECT cell surface receptor signaling pathway 28 ENSG00000159618
2 BP_DIRECT cell surface receptor signaling pathway 28 ENSG00000197471
3 BP_DIRECT cell surface receptor signaling pathway 28 ENSG00000166928
4 BP_DIRECT cell surface receptor signaling pathway 28 ENSG00000134539
5 BP_DIRECT cell surface receptor signaling pathway 28 ENSG00000116824
6 BP_DIRECT cell surface receptor signaling pathway 28 ENSG00000198821
logFC adj_pval zscore
1 0.3638934 4.87e-09 3.023716
2 0.1851308 4.87e-09 3.023716
3 0.2148182 4.87e-09 3.023716
4 0.2261708 4.87e-09 3.023716
5 0.2341476 4.87e-09 3.023716
6 0.3118525 4.87e-09 3.023716
>
However, when I try to create the plots I obtain an error about unequal lenght of x.
GOBar(circ, display = 'BP_direct')
ℹ Error occurred in the 1st layer.
Caused by error in `tapply()`:
! arguments should have samme length
18. ├─base::factor(id, levels = stats::reorder(id, adj_pval))
19. ├─stats::reorder(id, adj_pval)
20. ├─stats:::reorder.default(id, adj_pval)
21. │ └─base::tapply(X = X, INDEX = x, FUN = FUN, ...)
22. │ └─base::stop("arguments must have same length"
And I run into the same trouble with the GOChord function:
GOChord(chord, space = 0.02, gene.order = 'logFC', gene.space = 0.25, gene.size = 5)
Error in `$<-.data.frame`(`*tmp*`, "x.start", value = c(0, 0.0266007731512397, :
replacement has 239 rows, data has 230
Can anyone tell me what I might have done wrong, and how to fix it?
All the best, Julie

Related

How do I fix error code "studies with wrong number of comparisons in netmeta package R

I want to perform a network meta analysis with studies containing multiple and single treatment arms.
Here is my data structure:
data.frame': 85 obs. of 5 variables:
$ TE : num -0.758 -0.695 -0.288 -0.511 -0.288 ...
$ seTE : num 0.6191 0.2667 0.0341 0.0758 0.1178 ...
$ treat1 : Factor w/ 31 levels "Boil_promo","Chem",..: 6 26 26 18 9 9 3 9 16 26 ...
$ treat2 : Factor w/ 3 levels "Act","Pa","Pb": 2 2 2 2 2 3 3 2 2 2 ...
$ studlab: Factor w/ 63 levels "Altmann2018",..: 1 2 3 4 5 6 7 8 8 9 ...
I have 4 studies with 2 treatment arms, 5 with 3 treatment arms, and 2 with 4 treatment arms.
There are three types of control: Passive, Active and Placebo controls
There are 31 treatment types in total.
After running the code below:
m.netmeta <- netmeta(TE = TE,
seTE = seTE,
treat1 = treat1,
treat2 = treat2,
studlab = paste(net.meta$studlab),
data = net.meta,
sm = "RR",
comb.fixed = TRUE,
comb.random = FALSE,
reference = "Pa",
details.chkmultiarm = TRUE,
sep.trts = " vs ")
I got this error:
Error: The following studies have a wrong number of comparisons: 'Brown2008', 'Crump2005', 'Ecrumen2015', 'Francis2016', 'Luby2004', 'Opryszko2010', 'Reller2003', 'Sinharoy2017'
Please provide data for all treatment comparisons (two-arm: 1; three-arm: 3; four-arm: 6, ...).
Does anyone know how I can deal with this?
netmeta requires the data to have each pairwise comparison listed as a separate row for multi-arm studies. The error message you are getting suggests netmeta detects this has not occurred for some of the studies (namely, Brown2008, Crump2005, Ecrumen2015, Francis2016, Luby2004, Opryszko2010, Reller2003, Sinharoy2017).
Your post indicates many of your studies are multi-arm studies, and the information you have provided suggests you may not have set up your data in the format expected by netmeta. To give a couple examples:
Your description suggests you may be restricting what you put in treat2 (namely, only including active, passive, and placebo controls). What goes in treat2 is entirely dependent on the study composition. The hypothetical example below unpacks this more.
Your description of the study composition ("4 studies with 2 treatment arms, 5 with 3 treatment arms, and 2 with 4 treatment arms") does not seem to match up with the expected number of rows. For instance, why are there 63 levels for studlab? It is not clear to me how this pairs up with the expected 31 rows (corresponding to the number of pairwise comparisons): 4 studies with 2 treatment arms = 4 rows, 5 studies with 3 arms = 15 rows, and 2 studies with 4 treatment arms = 12 rows (4 + 15 + 12 = 31).
As an example, if Brown2008 has the arms MedicationX, MedicationY, and Placebo, you would need separate rows for MedicationX vs MedicationY, MedicationX vs Placebo, and MedicationY vs Placebo, not just MedicationX vs Placebo and MedicationY vs Placebo. I would recommend starting your debugging there.
The other thing that could possibly lead to that error is if you have separate publications that happen to have the same AuthorYear label. If that's the case, you can try appending the 'culprit' studies with unique identifiers.

SPIA::spia function output

I asked this question on Bioconductor since it was specific to the SPIA package, but haven't received a reply so I'm posting it here to reach a wider audience.
I downloaded the 134 Malus domestica ('mdm') pathways using the KEGGREST package and successfully made the spia data using the makeSPIAdata() function. I'm usng this library to evaluate pathways for two different apples, HoneyCrisp and Cripps Pink. Then, I called the spia() function, and after it completed it only showed 10 pathways that were 'Done.'
length(dir(mydir)) #Directory containing 134 Malus kgml/xml pathway files.
[1] 134
In spia (), my spia input vectors 'de' and 'all' contained 243 unique probesets, and I took the mean logFC for each probeset. This dataset was a subset of the topTable limma result of adj.P.Value < 0.001. Below is the run for HoneyCrisp apples (HC).
res<-spia(de=DE_malus_HC, all=entrez_only, organism="mdm", nB = 2000,
pathids = NULL, data.dir="./", combine = 'fisher', plots = TRUE)
Done pathway 1 : RNA transport..
Done pathway 2 : RNA degradation..
Done pathway 3 : MAPK signaling pathway - plant..
Done pathway 4 : Plant hormone signal transduct..
Done pathway 5 : Sulfur relay system..
Done pathway 6 : SNARE interactions in vesicula..
Done pathway 7 : Autophagy..
Done pathway 8 : Protein processing in endoplas..
Done pathway 9 : Plant-pathogen interaction..
Done pathway 10 : Circadian rhythm - plant..>
res[ , -12] #Showing RES for HoneyCrisp Apples
Name ID pSize NDE pNDE tA pPERT pG pGFdr pGFWER
Status
1 MAPK signaling pathway - plant 04016 4 4 1
24.29737166 0.160 0.4532130 1 1 Activated
2 Plant hormone signal transduction 04075 14 14 1
11.93279398 0.292 0.6514524 1 1 Activated
3 Circadian rhythm - plant 04712 3 3 1
9.17881852 0.440 0.8012314 1 1 Activated
4 Plant-pathogen interaction 04626 2 2 1
0.02234003 0.987 0.9999151 1 1 Activated
5 Protein processing in endoplasmic reticulum 04141 1 1 1
0.00000000 NA 1.0000000 1 1 Inhibited
I then used the 'mdm' library to evaluate a DE set for Cripps Pink (CP) apples...
res<-spia(de=DE_malus_CP, all=entrez_only_CP, organism="mdm", nB = 2000,
pathids = NULL, data.dir="./", plots = TRUE)
Done pathway 1 : RNA transport..
Done pathway 2 : RNA degradation..
Done pathway 3 : MAPK signaling pathway - plant..
Done pathway 4 : Plant hormone signal transduct..
Done pathway 5 : Sulfur relay system..
Done pathway 6 : SNARE interactions in vesicula..
Done pathway 7 : Autophagy..
Done pathway 8 : Protein processing in endoplas..
Done pathway 9 : Plant-pathogen interaction..
Done pathway 10 : Circadian rhythm - plant..>
res[, -12]
Name ID pSize NDE pNDE tA pPERT pG pGFdr pGFWER Status
1 Plant hormone signal transduction 04075 4 4 1 -0.4086641 0.812
0.981103 1 1 Inhibited
2 Protein processing in endoplasmic reticulum 04141 1 1 1
0.0000000 NA 1.000000 1 1 Inhibited
While these results are exciting, I have some questions to make sure this output is correct:
1) What is the printed 'Done' pathway list, and why does it print only 10 pathways rather than all 134 pathways?
2) For Protein processing in endoplasmic reticulum in HC and CP results, why is this pathway in the list although pPERT = 'NA.' So, how come the other 'Done' pathways (e.g. RNA transport, RNA degradation, Autophagy....) do not show up in the res output? Is this Protein processing in endoplasmic reticulum pathway considered significant if tA = 0, pG = 1 and pPERT = 'NA'?
3) I didn't expect pSize and NDE to be equal to each other, and so all the pNDEs are equal to 1...Why is pSize and NDE the same value?
4) The fold change values for the 243 HC DE probes dataset range from +3.25 to -4.38. Since I'm starting with gene probesets with an adj.P.Value < 0.001, I lowered the nB value to say 100, but the results are identical as if I used nB =2000. This is also true when I did this on the 127 CP DE probes dataset. Why is this?
Thanks for reading this as I tried to give as much info as possible while being succinct. I'm trying to understand this SPIA package in a bit more detail than explained in the vignette and ref manual.
Hope to hear from you soon.
Thanks,
Franklin
The response to this post can be found on Bioconductor:
https://support.bioconductor.org/p/90756/

Perform multiple survival analysis with loop in R

I am recently working on survival analysis with R. I have two data frames, geneDf for gene expression, survDf for the follow-up. As the following samples:
#Data frame:geneID
geneID=c("EGFR","Her2","E2F1","PTEN")
patient1=c(12,23,56,23)
patient2=c(23,34,11,6)
patient3=c(56,44,32,45)
patient4=c(23,64,45,23)
geneDf=data.frame(patient1,patient2,patient3,patient4,geneID)
> geneDf
patient1 patient2 patient3 patient4 geneID
1 12 23 56 23 EGFR
2 23 34 44 64 Her2
3 56 11 32 45 E2F1
4 23 6 45 23 PTEN
#Data frame:survDf
ID=c("patient1","patient2","patient3","patient4")
time=c(23,7,34,56)
status=c(1,0,1,1)
survDf=data.frame(ID,time,status)
#
> survDf
ID time status
1 patient1 23 1
2 patient1 7 0
3 patient1 34 1
4 patient1 56 1
I extract the expression data of specific gene from geneDf, and use the median of its expression as cut off value to perform survival analysis by “survival”package, and gain the p value by survdiff. In the following codes I use "EGFR" gene as an example.
#extract expression of a certain gene
targetGene<-subset(geneDf,grepl("EGFR",geneDf$geneID))
targetGene$geneID<-NULL
#Transpose the table and adjust its format
targetGene<-t(targetGene[,1:ncol(targetGene)])
targetGene<-data.frame(as.factor(rownames(targetGene)),targetGene)
colnames(targetGene)<-c("ID","Expression")
rownames(targetGene)<-NULL
targetGene$Expression1<-targetGene$Expression
targetGene$Expression1[ targetGene$Expression<median( targetGene$Expression)]<-1
targetGene$Expression1[ targetGene$Expression>=median( targetGene$Expression)]<-2
#Survival analysis
library(survival)
##Add survival object
survDf$SurvObj<-with(survDf, Surv(time,status==1))
## Kaplan-Meier estimator for stage
km<-survfit(SurvObj~targetGene$Expression1, data=survDf, conf.type = "log-log")
sdf<-survdiff(Surv(time, status) ~targetGene$Expression1, data=survDf)
#gain p value
p.val <-1-pchisq(sdf$chisq, length(sdf$n) - 1)
> p.val
[1] 0.1572992
I can do this through different genes one by one. But the question is: There are more than 10,000 gene need to be analyzed. I want gain all the p-values of them and put them to a new data frame. Do I need use loop or apply?
This is an ugly scritp but working.
In the Data10, in the first column you need to have the time, in the second one the status and in the next any treatments that you want.(patients as rownames)
loopsurff<-function(Data10){combos<-
rbind.data.frame(rep(1,ncol(Data10)- 2),
rep(2,ncol(Data10)-2),rep(3:(ncol(Data10)-2),1))
combos<-as.matrix(sapply(combos, as.numeric));library(plyr);
library(survival)
vv<-adply(combos, 2, function(x) {
fit <-survdiff(Surv(Data10[,1], Data10[,2]) ~ Data10[, x[3]],data=Data10)
p<-1 - pchisq(fit$chisq, 1)
out <- data.frame("var1"=colnames(Data10)[x[3]],"p.value" =
as.numeric(sprintf("%.3f", p)))
return(out)
})
}`
You will get a data frame with the column names of yourdata[,3:ncol(yourdata)] and the p value to each one.

Phylogenetic tree

I am working to have a phylogenetic tree based on pairwise-data of genes.Below is my subset of the data(test.txt).The tree does not has to be constructed on the basis of any DNA sequences,but just treating it as words.
ID gene1 gene2
1 ADRA1D ADK
2 ADRA1B ADK
3 ADRA1A ADK
4 ADRB1 ASIC1
5 ADRB1 ADK
6 ADRB2 ASIC1
7 ADRB2 ADK
8 AGTR1 ACHE
9 AGTR1 ADK
10 ALOX5 ADRB1
11 ALOX5 ADRB2
12 ALPPL2 ADRB1
13 ALPPL2 ADRB2
14 AMY2A AGTR1
15 AR ADORA1
16 AR ADRA1D
17 AR ADRA1B
18 AR ADRA1A
19 AR ADRA2A
20 AR ADRA2B
Below is my code in R
library(ape)
tab=read.csv("test.txt",sep="\t",header=TRUE)
d=dist(tab,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
My figure is attached here
I have a question on how they are clustered.Since the pairs
17 AR ADRA1B
18 AR ADRA1A
and
2 ADRA1B ADK
3 ADRA1A ADK
should be clustered closely because they have one common gene.so 17 and 2 should be together,and 18 and 3.
Should I use any other method,if I am wrong in using this method(Euclidean distance)?
Should I convert my data to a matrix of rows and columns ,where gene1 is x-axis ,and gene2 is y-axis,each cell being filled by 1 or 0?(Basically if they are paired would mean 1, and if not then 0)
Updated Code :
table=table(tab$gene1, tab$gene2)
d <- dist(table,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
However, in this I get only the genes from gene1 and not gene2 column.The below figure is exactly what I want but should have genes from gene2 column as well
There is some room for interpretation in the example of the question. My answer is only valid if there are really only two genes present in each individual and each row describes an individual. If, however, each row means that gene1 occurs with gene2 with certainty no useful clustering can be performed, in my opinion. In that case I would expect an additional column stating the probability for their common occurrence and something like an principal component analysis (PCA) may be preferred, but I am far away from being an expert on (hierarchial) clustering.
Before you can use the dist function, you have to bring your data into an appropriate format:
# convert test data into suitable format
gene.names <- sort(unique(c(tab[,"gene1"],tab[,"gene2"])))
gene.matrix <- cbind(tab[,"ID"],matrix(0L,nrow=nrow(tab),ncol=length(gene.names)))
colnames(gene.matrix) <- c("ID",gene.names)
lapply(seq_len(nrow(tab)),function(x) gene.matrix[x,match(tab[x,c("gene1","gene2")],colnames(gene.matrix))]<<-1)
The obtained gene.matrix has the shape:
ID ACHE ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2A ...
[1,] 1 0 1 0 0 0 1 0
[2,] 2 0 1 0 0 1 0 0
[3,] 3 0 1 0 1 0 0 0
[4,] 4 0 0 0 0 0 0 0
...
So each row represents an observation (=individual) where the first column identifies the individual and each of the subsequent columns contains 1 if the gene is present and 0 if it is missing. On this matrix the dist function can be reasonably applied (ID column removed):
d <- dist(gene.matrix[,-1],method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
Maybe, it is a good idea to read up the differences between the distance measures euclidean, manhattan etc. For instance, the euclidian distance between the individuals with ID=1 and ID=2 is:
euclidean_dist = sqrt((0-0)^2 + (1-1)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2 + ...)
whereas the manhattan distance is
manhattan_dist = abs(0-0) + abs(1-1) + abs(0-0) + abs(0-0) + abs(0-1) + ...

Apply LR models to another dataframe

I searched SO, but I could not seem to find the right code that is applicable to my question. It is similar to this question: Linear Regression calculation several times in one dataframe
I got a dataframe of LR coefficients following Andrie's code:
Cddply <- ddply(test, .(sumtest), function(test)coef(lm(Area~Conc, data=test)))
sumtest (Intercept) Conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617
My question is how to apply each of these LR models (1-10) to specific row intervals in another dataframe in order to get x, the independent variable, into a 3rd column. For example, I would like to apply sumtest1 to Samples 6:29, sumtest2 to samples 35:50, sumtest3 to samples 56:79, etc.. in intervals of 24 and 16 samples. The sample numbers repeats after 200, so sumtest9 will be for Samples 6:29 again.
Sample Area
6 236211
7 724919
8 1259814
9 1574722
10 268836
11 863818
12 1261768
13 1591845
14 220322
15 608396
16 980182
17 1415859
18 276276
19 724532
20 1130024
21 1147840
22 252051
23 544870
24 832512
25 899457
26 285093
27 4291007
28 825922
29 865491
35 246707
36 538092
37 767269
38 852410
39 269152
40 971471
41 1573989
42 1897208
43 261321
44 481486
45 598617
46 769240
47 229695
48 782691
49 1380597
50 1725419
The resulting dataframe would look like this:
Sample Area Calc
6 236211 407.5312917
7 724919 985.1525288
8 1259814 1617.363812
9 1574722 1989.564693
10 268836 446.0919309
...
35 246707 365.2452551
36 538092 724.3591324
37 767269 1006.805521
38 852410 1111.736505
39 269152 392.9073207
Thank you for your assistance.
Is this what you want? I made up a slightly larger dummy data set of 'area' to make it easier to see how the code worked when I tried it out.
# create 400 rows of area data
set.seed(123)
df <- data.frame(area = round(rnorm(400, mean = 1000000, sd = 100000)))
# "sample numbers repeats after 200" -> add a sample nr 1-200, 1-200
df$sample_nr <- 1:200
# create a factor which cuts the vector of sample_nr into pieces of length 16, 24, 16, 24...
# repeat to a total length of the pieces is 200
# i.e. 5 repeats of (16, 24)
grp <- cut(df$sample_nr, breaks = c(-Inf, cumsum(rep(c(16, 24), 5))))
# add a numeric version of the chunks to data frame
# this number indicates the model from which coefficients will be used
# row 1-16 (16 rows): model 1; row 17-40 (24 rows): model 2;
# row 41-56 (16 rows): model 3; and so on.
df$mod <- as.numeric(grp)
# read coefficients
coefs <- read.table(text = "intercept beta_conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617", header = TRUE)
# add model number
coefs$mod <- rownames(coefs)
head(df)
head(coefs)
# join area data and coefficients by model number
# (use 'join' instead of merge to avoid sorting)
library(plyr)
df2 <- join(df, coefs)
# calculate conc from area and model coefficients
# area = intercept + beta_conc * conc
# conc = (area - intercept) / beta_conc
df2$conc <- (df2$area - df2$intercept) / df2$beta_conc
head(df2, 41)

Resources