R Script - TopGO Package

R Script - TopGO Package - r

I am an R beginner. I am using the {TopGO} R Package for performing enrichment analysis for gene ontology.
In the last part of the script (#create topGO object), I get this error:
Error in .local(.Object, ...) : allGenes must be a factor with 2 levels
Could you please help me with this? I really appreciate your help. Thanks
Below I have reported the R script I have used. I would like to upload also my starting dataset but I don't know how to to it.
library(tidyverse)
library(data.table)
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("topGO")
library(topGO)
# make table from meta data holding grouping information for each sample
# in this case landrace, cultivar ...
getwd()
setwd("C:/Users/iaia-/Desktop/")
background <- fread('anno.variable.out.txt') %>%
as_tibble() %>%
#the ID is the actual GO term
mutate(id = as.numeric(id)) %>%
na.omit() %>%
#according to pannzer2 ARGOT is the best scoring algorithm
filter(str_detect(type ,'ARGOT'),
#choose a PPV value
#according to the pannzer2 manual, there is no 'best' option
# Philipp advised to use 0.4, 0.6, 0.8
# maybe I write a loop to check differences later
PPV >=0.6 ) %>%
#create a ontology column to have the 3 ontology options, which topGO supports
# 'BP' - biological process, 'MF' - molecular function, 'CC' - cellular component
mutate(ontology=sapply(strsplit(type,'_'),'[',1)) %>%
#select the 3 columns we need
dplyr::select(id,ontology, qpid) %>%
#add GO: to the GO ids
dplyr::mutate(id=paste0('GO:',id))
foreground <- fread('LIST.txt',header=F)
colnames(foreground) <- c("rowname")
#rename type to group to prevent confusion
#dplyr::rename(group=type)
#for all groups together
all_results <- tibble()
for (o in unique(background$ontology)){
#filter background for a certain ontology
ont_background <- filter(background, ontology==o)} #BIOLOGICAL PROCESS
annAT <- split(ont_background$qpid,ont_background$id)
#filter foreground for a group
fg_genes <- foreground %>% pull(rowname)
ont_background <- ont_background %>%
mutate(present=as.factor(ifelse(qpid %in% fg_genes,1,0))) %>%
dplyr::select(-id) %>% distinct() %>%
pull(present, name = qpid)
#create topGO object
GOdata <-new("topGOdata", ontology = o, allGenes = ont_background, nodeSize = 5,annot=annFUN.GO2genes,GO2genes=annAT)
weight01.fisher <- runTest(GOdata, statistic = "fisher")
results <- GenTable(GOdata, classicFisher=weight01.fisher,topNodes=ifelse(length(GOdata#graph#nodes) < 30,length(GOdata#graph#nodes),30)) %>%
dplyr::rename(pvalue=6) %>%
mutate(ontology=o,
pvalue=as.numeric(pvalue))
all_results <- bind_rows(all_results,results)
}
#all_results %>% pull(GO.ID) %>% writeClipboard()
resultspvalue<- all_results %>% dplyr::select(GO.ID,pvalue)
write_tsv(resultspvalue, "GOTermsrep_pvalue.txt")

It's tough to say for sure without knowing what ont_background looked like before you modified it, but based on these threads I think that there's a problem with the final ont_background object you are using to make the topGO object in your final "#create topGO object" chunk.
It looks like the allGenes argument (ont_background) needs to be a vector of pvalues with names() assigned as the gene names in whatever format you are using for other arguments.

Related

Issue computing AUC with pROC package

I'm trying to use a function that calls on the pROC package in R to calculate the area under the curve for a number of different outcomes.
# Function used to compute area under the curve
proc_auc <- function(outcome_var, predictor_var) {
pROC::auc(outcome_var, predictor_var)}
To do this, I am intending to refer to outcome names in a vector (much like below).
# Create a vector of outcome names
outcome <- c('outcome_1', 'outcome_2')
However, I am having problems defining variables to input into this function. When I do this, I generate the error: "Error in roc.default(response, predictor, auc = TRUE, ...): 'response' must have two levels". However, I can't work out why, as I reckon I only have two levels...
I would be so happy if anyone could help me!
Here is a reproducible code from the iris dataset in R.
library(pROC)
library(datasets)
library(dplyr)
# Use iris dataset to generate binary variables needed for function
df <- iris %>% dplyr::mutate(outcome_1 = as.numeric(ntile(Sepal.Length, 4)==4),
outcome_2 = as.numeric(ntile(Petal.Length, 4)==4))%>%
dplyr::rename(predictor_1 = Petal.Width)
# Inspect binary outcome variables
df %>% group_by(outcome_1) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
df %>% group_by(outcome_2) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
# Function used to compute area under the curve
proc_auc <- function(outcome_var, predictor_var) {
pROC::auc(outcome_var, predictor_var)}
# Create a vector of outcome names
outcome <- c('outcome_1', 'outcome_2')
# Define variables to go into function
outcome_var <- df %>% dplyr::select(outcome[[1]])
predictor_var <- df %>% dplyr::select(predictor_1)
# Use function - first line works but not last line!
proc_auc(df$outcome_1, df$predictor_1)
proc_auc(outcome_var, predictor_var)

outcome_var and predictor_var are dataframes with one column which means they cannot be used directly as an argument in the auc function.
Just specify the column names and it will work.
proc_auc(outcome_var$outcome_1, predictor_var$predictor_1)

You'll have to familiarize yourself with dplyr's non-standard evaluation, which makes it pretty hard to program with. In particular, you need to realize that passing a variable name is an indirection, and that there is a special syntax for it.
If you want to stay with the pipes / non-standard evaluation, you can use the roc_ function which follows a previous naming convention for functions taking variable names as input instead of the actual column names.
proc_auc2 <- function(data, outcome_var, predictor_var) {
pROC::auc(pROC::roc_(data, outcome_var, predictor_var))
}
At this point you can pass the actual column names to this new function:
proc_auc2(df, outcome[[1]], "predictor_1")
# or equivalently:
df %>% proc_auc2(outcome[[1]], "predictor_1")
That being said, for most use cases you probably want to follow #druskacik's answer and use standard R evaluation.

How to extract values across all years?

I am following the code given in: https://github.com/CornellLabofOrnithology/ebird-best-practices/blob/master/03_covariates.Rmd
On arriving at this code:
# extract landcover values within neighborhoods, only needed most recent year
lc_extract_ext <- landcover[[paste0("y", max_lc_year)]] %>%
exact_extract(r_cells, progress = FALSE)
lc_extract_cnt <- map(lc_extract_ext, ~ count(., landcover = value)) %>%
tibble(id = r_cells$id, data = .)
lc_extract_pred <- unnest(lc_extract_cnt, data)
I wished to extract landcover values within neighborhoods across all years.
Instead of using this bit of code max_lc_year, I used this:
# I have thought of using
all_lc_year <- names(landcover) %>%
str_extract("[0-9]{4}") %>%
as.integer()
To extract all years, however, it returns this error at this piece of code:
lc_extract_cnt <- map(lc_extract_ext, ~ count(., landcover = value)) %>%
tibble(id = r_cells$id, data = .)
Error: Problem with mutate() input landcover.
x object 'value' not found
i Input landcover is value.
I'm thinking I would need to loop, either with map or a for loop, over the years and bind together the resulting data frames.
EDIT:
Noticed the previous steps did not work. Here they are, working and corrected!
I have uploaded the landcover .tif files onto my Github here:
https://github.com/lime-n/Landcover.git
You can then implement this code to stack them together in R:
library(sf)
library(raster)
library(exactextractr)
library(viridis)
library(tidyverse)
# resolve namespace conflicts
select <- dplyr::select
map <- purrr::map
projection <- raster::projection
landcover <- list.files("insert_folder_name_here", "^modis_mcd12q1_umd",
full.names = TRUE) %>%
stack()
# label layers with year
landcover <- names(landcover) %>%
str_extract("(?<=modis_mcd12q1_umd_)[0-9]{4}") %>%
paste0("y", .) %>%
setNames(landcover, .)
landcover
neighborhood_radius <- 5 * ceiling(max(res(landcover))) / 2
I have uploaded the r data at my github here to produce the r_cells data: https://github.com/lime-n/Landcover/blob/master/prediction-surface.tif
and the following code to get the r_cells:
r <- raster("data/prediction-surface.tif")
r_centers <- rasterToPoints(r, spatial = TRUE) %>%
st_as_sf() %>%
transmute(id = row_number())
r_cells <- st_buffer(r_centers, dist = neighborhood_radius)
I understand the process to download and implement may take a few minutes. However, this code has been bugging me for weeks! any help is appreciated.

Found that doing the code individually for each individual year using all_lc_year[1], all_lc_year[2] etc ... and then combining all the rows using rbind(lc_extract_pred, lc_extract_pred_two, lc_extract_pred_three #...etc)
EDIT:
I found that in this code:
lc_extract_cnt <- map(lc_extract_ext, ~ count(., landcover = value)) %>%
tibble(id = r_cells$id, data = .)
The problem has been mentioned similarly here
I tried assigning many different columns under a single column landcover which is why the code never worked. By replacing landcover by the column names needed for the function count, and taking this from
lc_extract_ext <- landcover %>%
instead of the maximum year to include all columns.

Comparing multiple variables in more than two groups with t.test

I tried to do a t-test comparing values between time1/2/3.. and threshold.
here is my data frame:
time.df1<-data.frame("condition" =c("A","B","C","A","C","B"),
"time1" = c(1,3,2,6,2,3) ,
"time2" = c(1,1,2,8,2,9) ,
"time3" = c(-2,12,4,1,0,6),
"time4" = c(-8,3,2,1,9,6),
"threshold" = c(-2,3,8,1,9,-3))
and I tried to compare each two values by:
time.df1%>%
select_if(is.numeric) %>%
purrr::map_df(~ broom::tidy(t.test(. ~ threshold)))
However, I got this error message
Error in eval(predvars, data, env) : object 'threshold' not found
So, I tried another way (maybe it is wrong)
time.df2<-time.df1%>%gather(TF,value,time1:time4)
time.df2%>% group_by(condition) %>% do(tidy(t.test(value~TF, data=.)))
sadly, I got this error. Even I limited the condition to only two levels (A,B)
Error in t.test.formula(value ~ TF, data = .) : grouping factor must have exactly 2 levels
I wish to loop t-test over each time column to threshold column per condition, then using broom::tidy to get the results in tidy format. My approaches apparently aren't working, any advice is much appreciated to improve my codes.

An alternative route would be to define a function with the required options for t.test() up front, then create data frames for each pair of variables (i.e. each combination of 'time*' and 'threshold') and nesting them into list columns and use map() combined with relevant functions from 'broom' to simplify the outputs.
library(tidyverse)
library(broom)
ttestfn <- function(data, ...){
# amend this function to include required options for t.test
res = t.test(data$resp, data$threshold)
return(res)
}
df2 <-
time.df1 %>%
gather(time, "resp", - threshold, -condition) %>%
group_by(time) %>%
nest() %>%
mutate(ttests = map(data, ttestfn),
glances = map(ttests, glance))
# df2 has data frames, t-test objects and glance summaries
# as separate list columns
Now it's easy to query this object to extract what you want
df2 %>%
unnest(glances, .drop=TRUE)
However, it's unclear to me what you want to do with 'condition', so I'm wondering if it is more straightforward to reframe the question in terms of a GLM (as camille suggested in the comments: ANOVA is part of the GLM family).
Reshape the data, define 'threshold' as the reference level of the 'time' factor and the default 'treatment' contrasts used by R will compare each time to 'threshold':
time.df2 <-
time.df1 %>%
gather(key = "time", value = "resp", -condition) %>%
mutate(time = fct_relevel(time, "threshold")) # define 'threshold' as baseline
fit.aov <- aov(resp ~ condition * time, data = time.df2)
summary(fit.aov)
summary.lm(fit.aov) # coefficients and p-values
Of course this assumes that all subjects are independent (i.e. there are no repeated measures). If not, then you'll need to move on to more complicated procedures. Anyway, moving to appropriate GLMs for the study design should help minimise the pitfalls of doing multiple t-tests on the same data set.

We could remove the threshold from the select and then reintroduce it by creating a data.frame which would go into the formula object of t.test
library(tidyverse)
time1.df %>%
select_if(is.numeric) %>%
select(-threshold) %>%
map_df(~ data.frame(time = .x, time1.df['threshold']) %>%
broom::tidy(t.test(. ~ threshold)))

How to create a dataframe of node attributes when all nodes don't have an attribute (doing networks with surveys)?

I've been working on this a while to no avail. I'm using both statnet to create some networks in r from survey data. The way the networks are measured in the survey allowed respondents to list network contacts not included in the survey. The way it turned out is that many network responses were surveyed, just a few are not. I'm trying to map colors to nodes based on other survey responses.
This is a replication of my issue. I want to label the nodes that have available attributes with their attribute and label those without as 'unknown' or NA or ''.
install.packages('statnet')
library(statnet)
mydata <- data.frame(
src=c('bob','sue','tom','john','sheena'),
trg=c('tom','billy','billy','bob','chris'),
vary_1=c(1,2,2,3,1)
)
net_1 <- network(mydata[1:2])
##### My attempt using dplyr to create labels ####
# it doesn't work
labs <- mydata %>%
mutate(flag = .[,1] %in% .[,2]) %>%
gather(key,value,-flag,-vary_1) %>%
mutate(i=ifelse(.$key=='trg',.$vary_1==NA,.$vary_1)) %>%
select(value) %>%
unique() %>%
.[,1] #### I think this approach is something close
set.seed(123)
gplot(net_1,vertex.cex = degree(net_1),
label=labs) #labels using the labs created above

Using dplyr within a function, Grouping Error with function arguments

Below I have a working example of what I would like the function to do, and then script for the function, noting where the Error occurs.
The error message is:
Error: index out of bounds
Which I know usually means R can’t find the variable that’s being called.
Interestingly, in my function example below, if I only group by my subgroup_name (which is passed to the function and becomes a column in the newly created dataframe) the function will successfully regroup that variable, but I also want to group by a newly created column (from the melt) called variable.
Similar code used to work for me using regroup(), but that has been deprecated. I am trying to use group_by_() but to no avail.
I have read many other posts and answers and experimented several hours today but still not successful.
# Initialize example dataset
database <- ggplot2::diamonds
database$diamond <- row.names(diamonds) # needed for melting
subgroup_name <- "cut" # can replace with "color" or "clarity"
subgroup_column <- 2 # can replace with 3 for color, 4 for clarity
# This works, although it would be preferable not to need separate variables for subgroup_name and subgroup_column number
df <- database %>%
select(diamond, subgroup_column, x,y,z) %>%
melt(id.vars=c("diamond", subgroup_name)) %>%
group_by(cut, variable) %>%
summarise(value = round(mean(value, na.rm = TRUE),2))
# This does not work, I am expecting the same output as above
subgroup_analysis <- function(database,...){
df <- database %>%
select(diamond, subgroup_column, x,y,z) %>%
melt(id.vars=c("diamond", subgroup_name)) %>%
group_by_(subgroup_name, variable) %>% # problem appears to be with finding "variable"
summarise(value = round(mean(value, na.rm = TRUE),2))
print(df)
}
subgroup_analysis(database, subgroup_column, subgroup_name)

From the NSE vignette:
If you also want to output variables to vary, you need to pass a list
of quoted objects to the .dots argument:
Here, variable should be quoted:
subgroup_analysis <- function(database,...){
df <- database %>%
select(diamond, subgroup_column, x,y,z) %>%
melt(id.vars=c("diamond", subgroup_name)) %>%
group_by_(subgroup_name, quote(variable)) %>%
summarise(value = round(mean(value, na.rm = TRUE),2))
print(df)
}
subgroup_analysis(database, subgroup_column, subgroup_name)
As mentionned by #RichardScriven, if you plan to assign the result to a new variable, then you may want to remove the print call at the end and just write df, or not even assign df at all in the function
Otherwise the result prints even when you do x <- subgroup_analysis(...)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Script - TopGO Package - r

Related

Issue computing AUC with pROC package

How to extract values across all years?

Comparing multiple variables in more than two groups with t.test

How to create a dataframe of node attributes when all nodes don't have an attribute (doing networks with surveys)?

Using dplyr within a function, Grouping Error with function arguments

Categories

Resources