goseq package in R "missing value where TRUE/FALSE needed" error - r

I am attempting to run a GO Analysis in R (I have never done this analysis, so I am trying different packages), and I am struggling to find the problem with my code in the goseq package.
I start with this code which produces a list of the differentially expressed gene names:
de.genes <- rownames(res)[ which(res$padj < fdr.threshold & !is.na(res$padj)) ]
Then I try to run this code (based on page 7 of the vignette (https://bioconductor.org/packages/devel/bioc/vignettes/goseq/inst/doc/goseq.pdf)
pwf <- nullp(de.genes, "hg38","geneSymbol")
but I get the following error:
Can't find hg38/geneSymbol length data in genLenDataBase...
Found the annotation package, TxDb.Hsapiens.UCSC.hg38.knownGene
Trying to get the gene lengths from it.
Error in if (matched_frac == 0) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In grep(txdbPattern, installedPackages):argument 'pattern' has length > 1 and only the first element will be used
I found this forum: https://support.bioconductor.org/p/38580/ that says I need an "indicator variable" but I do not know what this is.
Any help with this error would be greatly appreciated, or if you know of any other GO packages that are easy to learn. Thanks!

You can check the supported databases, hg38 is not one of them:
library(org.Hs.eg.db)
library(goseq)
supported[grep("hg38|hg19",supported$Genome),]
Genome Id Id Description Lengths in geneLeneDataBase
4 hg19 knownGene Entrez Gene ID TRUE
36 hg19 ensGene Ensembl gene ID TRUE
81 hg19 geneSymbol Gene Symbol TRUE
98 hg38 FALSE
GO Annotation Available
4 TRUE
36 TRUE
81 TRUE
98 TRUE
You can get a rough idea of what it looks like by using hg19, you will have some missing or unmatched by should be ok. You need to have a binary vector and it should be named, for example:
set.seed(111)
allgenes = keys(org.Hs.eg.db,keytype="SYMBOL")
de.genes = rbinom(100,1,0.3)
names(de.genes) = sample(allgenes,100)
It looks like this:
GALNT5 TPRKB CD48 OR52R1 LOC105372708 LOC112163649
0 1 0 0 0 0
LOC105369203 LOC110121115 LOC105377654 LOC105371502 LOC101929964 HPC14
0 0 0 0 0 0
IGHD4-17 LOC101927993 HINT1 BCC3 RPL18P3 LOC108281192
0 0 0 0 0 1
RNU6-793P JUN
0 0
This will be ok:
res = nullp(de.genes,"hg19","geneSymbol")

Related

Duplicate Row Name Error for FAMD Visualization

I'm trying to perform this function in R: fviz_famd_ind() and keep getting an error. It works on the wine dataset provided in the package, but not on my cleaned data set from Telco.Customer.Churn from IBM.
I've created the object of the FAMD function using the cleaned data set called dfcfamd1. I've verified there are no duplicate row or column names in the sets using any(duplicated(rownames())) for both Telco.Customer.Churn and dfcfamd1 which both return FALSE.
fviz_famd_ind(dfcfamd1)
> Error in `.rowNamesDF<-`(x, value = value) :
> duplicate 'row.names' are not allowed
> In addition: Warning message:
> non-unique values when setting 'row.names': ‘No’, ‘Yes’
Sample Data below
head(Telco.Customer.Churn)
customerID gender SeniorCitizen Partner Dependents tenure
1 7590-VHVEG Female 0 Yes No 1
2 5575-GNVDE Male 0 No No 34
3 3668-QPYBK Male 0 No No 2
PhoneService MultipleLines InternetService OnlineSecurity
1 No No DSL No
2 Yes No DSL Yes
3 Yes Yes Fiber optic No
OnlineBackup DeviceProtection TechSupport StreamingTV
1 Yes No No No
2 No No No No
3 No Yes No Yes
StreamingMovies Contract PaperlessBilling PaymentMethod
1 No Month-to-month Yes Electronic check
2 No One year No Mailed check
3 No Month-to-month Yes Mailed check
MonthlyCharges TotalCharges Churn
1 29.85 29.85 No
2 56.95 1889.50 No
3 53.85 108.15 Yes
The output should give me a graphical output which it does for the package data, but not for my data.
Attempting to set names to unique, I get a vector error.
rownames(dfcfamd1) = make.names(names, unique=TRUE)
> Error in as.character(names) :
> cannot coerce type 'builtin' to vector of type 'character'
The issue is that names is a function
rownames(dfcfamd1) = make.names(names, unique=TRUE)
instead it should be
row.names(dfcfamd1) = make.names(row.names(dfcfamd1), unique=TRUE)
Try:
fviz_pca_ind(dfcfamd1)
PS: I met the same problem! It could be solved by simply using the function fviz_pca_ind rather than using the function fviz_famd_ind, as the two functions use data with similar structures.
It seems that fviz_famd_ind cannot handle the same values across multiple categorical columns.
One way to solve this is to rename the values to be unique across columns:
# Define factors
cols <- c("Partner","Dependents ", "PhoneService", "MultipleLines", "InternetService","OnlineSecurity" "OnlineBackup", "DeviceProtection",
"TechSupport", "StreamingTV", "StreamingMovies","PaperlessBilling","Churn")
dfcfamd1[cols] <- lapply(dfcfamd1[cols], factor)
rm(cols)
# Rename the factors
# Do this for every column until only unique values remain.
dfcfamd1$Partner<- recode_factor(dfcfamd1$Partner,"Yes" = "yesParnter", "No" = "noPartner")
#[...]
dfcfamd1$Churn<- recode_factor(dfcfamd1$Churn,"Yes" = "yesChurn", "No" = "noChurn")
# Run the function on dfcfamd1
fviz_famd_ind(dfcfamd1)

Convert several levels of factor into 2 in R

I have 197 levels relating to location, I want to simplify this by creating a new variable "INSIDE" which stores 1 when location is a building/home/etc and 0 when location is outside. I have tried grepl() but it gives an error
data$Inside<-ifelse(grepl(data$Premise.Description,pattern = c("BUILDING","ROOM","AUTO","BALCONY","BANK","BAR","STORE","CHURCH","COLLEGE","CONDOMINIUM","CENTER","DAY CARE","SCHOOL","HOSPITAL","LIBRARY","PARLOR","OFFICE","MOSQUE","CLUB","PORCH","MALL","WAREHOUSE")),1,0)
Warning message:
In grepl(crime_3yr$Premise.Description, pattern = c("BUILDING", :
argument 'pattern' has length > 1 and only the first element will be used
I have tried using lapply() but it did not work too.
I want the output to be like this:
BUILDING 1
SHOP 1
Street 0
grepl takes a regex instead of a list of options, try this:
data$Inside<-ifelse(grepl(data$Premise.Description,pattern = "BUILDING|ROOM|AUTO|BALCONY|BANK|BAR|STORE|CHURCH|COLLEGE|CONDOMINIUM|CENTER|DAY CARE|SCHOOL|HOSPITAL|LIBRARY|PARLOR|OFFICE|MOSQUE|CLUB|PORCH|MALL|WAREHOUSE"),1,0)
If you want to keep the code similar to what you listed you need to look into regular expressions which is what the pattern part of the grepl needs to be.
data$Inside<-ifelse(grepl(data$Premise.Description,pattern = "BUILDING|ROOM|AUTO|BALCONY|BANK|BAR|STORE|CHURCH|COLLEGE|CONDOMINIUM|CENTER|DAY CARE|SCHOOL|HOSPITAL|LIBRARY|PARLOR|OFFICE|MOSQUE|CLUB|PORCH|MALL|WAREHOUSE"),1,0)
Try this code:
Your data.frame:
data<-data.frame(Premise.Description= c("BUILDING 1","MY ROOM","AUTO","BALCONY","OTHER"))
The solution:
toMatch<-c("BUILDING","ROOM","AUTO","BALCONY","BANK","BAR","STORE","CHURCH","COLLEGE","CONDOMINIUM","CENTER","DAY CARE","SCHOOL","HOSPITAL","LIBRARY","PARLOR","OFFICE","MOSQUE","CLUB","PORCH","MALL","WAREHOUSE")
data$Inside<-grepl(paste(toMatch,collapse="|"), data$Premise.Description)
data
Premise.Description Inside
1 BUILDING 1 TRUE
2 MY ROOM TRUE
3 AUTO TRUE
4 BALCONY TRUE
5 OTHER FALSE
You might be better off using data.table:
library(data.table)
setDT(data)
data[
grepl(c("BUILDING","ROOM","AUTO","BALCONY","BANK","BAR","STORE","CHURCH","COLLEGE","CONDOMINIUM","CENTER","DAY CARE","SCHOOL","HOSPITAL","LIBRARY","PARLOR","OFFICE","MOSQUE","CLUB","PORCH","MALL","WAREHOUSE"), Premise),
Inside := TRUE
]

How to subset 'n' number of rows past a certain value?

I'm trying to subset a data.frame based on a 1 or 0 value the data.frame.
Here is some sample code;
> Test
Close High Low Dn.BB MaVg Up.BB Per.BB Dn.Brk
2007-02-27 6286.1 6434.7 6270.5 6305.813 6389.679 6473.544 -0.11752900 1
2007-02-28 6171.5 6286.1 6166.2 6237.635 6377.186 6516.737 -0.23695539 1
2007-03-01 6116.0 6230.7 6038.9 6164.470 6358.129 6551.787 -0.12514308 1
2007-03-02 6116.2 6164.4 6085.6 6110.807 6341.179 6571.550 0.01170495 0
2007-03-05 6058.7 6116.2 5989.6 6047.421 6318.100 6588.779 0.02083561 0
2007-03-06 6138.5 6138.5 6058.7 6018.953 6297.907 6576.861 0.21427696 0
2007-03-07 6156.5 6167.6 6106.1 6001.139 6278.136 6555.133 0.28043853 0
2007-03-08 6227.7 6233.1 6156.5 5997.989 6264.436 6530.882 0.43106389 0
2007-03-09 6245.2 6255.8 6190.3 6003.152 6250.207 6497.262 0.48986661 0
2007-03-12 6233.3 6276.3 6219.3 6007.297 6237.421 6467.546 0.49104464 0
2007-03-13 6161.2 6240.7 6161.2 6000.401 6223.429 6446.457 0.36049188 0
Here, I would like to have something that iterates along the data.frame and then splits out the subsets based on Dn.Brk > 0. I can only think of a loop method here and am not to familiar with sub-setting, so was wondering if anyone could point me in the right direction / provide some tips of functions / packages that could achive this?
A little more detail below;
Sub <- rep(0,nrow(Test))
for (i in nrow(Test)){
if (Test[i,8] > 0){Sub = Test(i:i+10,1)}
}
So, the above would, at every point where Test[i,8] > 0, select, Test$Close from i:i+10.
Ideally, I'd like every sample to be stored in a separate row/column in a new df. Is that possible?
You can use sapply here:
sapply(which(Test[, 8] > 0), function(z) Test$Close[z:(z+10)])
A few things to note in the loop you provided though:
You are not iterating: Your loop is from i in nrow(Test) which is effectively nrow(Test)
You would be overwriting Sub with each iteration
If you are still in search for doing it with a for loop here is the answer:
#### results list #####
results <- list()
for (i in rows.test){
if (test[i,8] > 0)
{
results[[i]] = test$Close[i:(i+10)]
}
else {results[[i]] = "no value"}
}
This could also be further parallelisable if your dataset is huge with a package called foreach. A good intro here: http://www.vikparuchuri.com/blog/parallel-r-loops-for-windows-and-linux/. You could also change "no value" to next if you want a list with only three named elements

the condition has length > 1 and only the first element will be used while trying to use if case

I have a csv fie as:
score text
1 0 RT #RealJackEdwards: (2 of) a solution. 7 st yrs in playoffs, a Cup, a Final, a Prez Trophy. Yup, Boychuk trade a disaster; Bottom 6 fwds r…
I need to write all the tweets with negative score to a different file. I am trying to use if statement as:
if(stat$score < 0 )
write.csv(stat$text, file=paste('negtweetscore.csv'), row.names=TRUE)
But after running this code i am getting the following error message:
In if (stat$score < 0) write.csv(stat$text, file = paste("negtweetscore.csv"), :
the condition has length > 1 and only the first element will be used
You have to subset your data.frame properly:
write.csv(stat$text[stat$score<0], file=paste('negtweetscore.csv'), row.names=TRUE)

Cannot coerce class ....to a data.frame error

R subject
I have an "cannot coerce class "c("summary.turnpoints", "turnpoints")" to a data.frame" error when trying to save the summary in a file. I have tried to fix that with as.data.frame with no success.
code :
library(plyr)
library(pastecs)
data <- read.table("C:\\Users\\Ron\\Desktop\\dataset.txt", header=F, col.name="A")
data.tp=turnpoints(data$A)
print(data.tp)
Turning points for: data$A
nbr observations : 5990
nbr ex-aequos : 51
nbr turning points: 413 (first point is a pit)
E(p) = 3992 Var(p) = 1064.567 (theoretical)
Turning points for: data$A
nbr observations : 5990
nbr ex-aequos : 51
nbr turning points: 413 (first point is a pit)
E(p) = 3992 Var(p) = 1064.567 (theoretical)
data.sum=summary(data.tp)
print(data.sum)
point type proba info
1 11 pit 7.232437e-15 46.97444
2 21 peak 7.594058e-14 43.58212
3 30 pit 3.479857e-27 87.89303
4 51 peak 5.200612e-29 93.95723
5 62 pit 7.594058e-14 43.58212
6 70 peak 6.213321e-14 43.87163
7 81 pit 6.276081e-16 50.50099
8 91 peak 5.534016e-23 73.93602
.....................................
write.table(data.sum, file = "C:\\Users\\Ron\\Desktop\\datasetTurnP.txt")
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class "c("summary.turnpoints", "turnpoints")" to a data.frame
In addition: Warning messages:
1: package ‘plyr’ was built under R version 3.0.1
2: package ‘pastecs’ was built under R version 3.0.1
How can I save these summary results to a text file?
Thank you.
Look at the Value section of:
?pastecs::summary.turnpoints
It should be clear that this will not be a set of lists all of which have the same length. Hence the error message. So rather than asking for the impossible, ... tell us what you wanted to save.
It's actually not impossible, just not possible with write.table, since it's not a dataframe. The dump function would allow you to construct an ASCII representation of the structure(...) representation of that summary-object.
dump(data.sum, file="dump_data_sum.asc")
This could then be source()-ed

Resources