Subset clusterProfiler compareClusterResult object in R - r
I am using the clusterProfiler package in R to do gene set enrichment analyses. I have the basic code working, but I would like to subset the results object compareClusterResult to only include a specific subset of pathways (i.e. keeping only non-disease pathways). I have created a list of non-disease pathways using the gage package, but cannot figure out how to subset the compareClusterResult object based on that list.
Here is a small subset of the data I'm analyzing:
library(clusterProfiler)
dput(de_list)
list(fb = c("K08193", "K09851", "K07874", "K14847", "K14793",
"K06670", "K19009", "K13783", "K17963", "K15076", "K08492", "K15262",
"K00901", "K00078", "K15133", "K21407", "K13566", "K14454", "K23565",
"K09341", "K22414", "K00069", "K00069", "K07192", "K10276", "K11348",
"K10389", "K06054", "K06590", "K06678", "K03671", "K17302", "K08155",
"K23387", "K02951", "K12481", "K11434", "K18461", "K23439", "K13208",
"K16803", "K20793", "K06269", "K16749", "K12737", "K14264", "K00857",
"K21863", "K04459", "K01183", "K12856", "K23616", "K23195", "K09188",
"K20193", "K21249", "K05765", "K04703", "K12259", "K24014", "K10141",
"K11099", "K02263", "K01784", "K11884", "K24195", "K14810", "K15113",
"K15283", "K14999", "K14776", "K11433", "K00228", "K03253", "K01410",
"K05768", "K13288", "K07432", "K13718", "K11587", "K02912", "K15235",
"K04351", "K23893", "K20730", "K10310", "K00558", "K15837", "K01205",
"K11660", "K12021", "K23214", "K20791", "K07189", "K01507", "K16682",
"K18163", "K13142", "K23901", "K17501"), mg = c("K19788", "K07874",
"K00128", "K14793", "K06670", "K19009", "K13783", "K17963", "K19476",
"K00078", "K13915", "K21407", "K14719", "K13524", "K22414", "K00069",
"K00069", "K02178", "K12172", "K12866", "K13123", "K24254", "K17302",
"K08155", "K02951", "K12481", "K11434", "K13208", "K17602", "K10571",
"K13758", "K16749", "K00857", "K21863", "K06839", "K03241", "K04459",
"K18200", "K01183", "K23616", "K10442", "K17563", "K05765", "K12259",
"K10141", "K19326", "K10049", "K01784", "K00604", "K24195", "K15113",
"K15283", "K19527", "K14999", "K01410", "K11587", "K02912", "K13109",
"K15235", "K09595", "K23893", "K10310", "K11981", "K08858", "K00558",
"K01205", "K11583", "K11660", "K05291", "K12021", "K18660", "K10393",
"K23214", "K20791", "K06072", "K18163", "K17501", "K09848", "K23336",
"K03064", "K02366", "K02377", "K14971", "K20290", "K13240", "K20185",
"K01109", "K13125", "K16678", "K07964", "K05397", "K15175", "K08705",
"K08561", "K02519", "K17824", "K13122", "K15338", "K12821", "K08752"
))
xx <- compareCluster(de_list, fun="enrichKEGG",
organism="ko", pvalueCutoff=0.05)
And the list of pathway IDs that I'd like to keep:
library(gage)
kg.ko = kegg.gsets("ko") # ("ko" is KEGG ortholog pathway)
kegg.gs = kg.ko$kg.sets[kg.ko$sigmet.idx] # keep only metabolic and signaling pathways
kegg.gs_names <-names(kegg.gs)
kegg.gs_names <- as.data.frame(gsub( " .*$", "", kegg.gs_names ))
names(kegg.gs_names) <- "ID"
So, I'd like to use kegg.gs_names to subset xx. The corresponding entry in xx is xx#compareCluster$ID while maintaining the structure of the clusterProfiler object for downstream plotting.
Here is the vignette (http://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html). I'm trying to produce the plot in 15.7 without the disease pathways included.
Related
Creating a for loop in R from a list
I'm trying to create a for loop in R to iterate through a list of genetic variants, labeled with rsID's, and filter the results by patient ID. ace2_snps <- c("rs4646121", "rs4646127", "rs1996225", "rs2158082", "rs4830974", "rs148271868", "rs113539251", "rs4646135", "rs4646179", "rs2301693", "rs16980031", "rs12689012", "rs4646141", "rs142049267", "rs16979971", "rs12007623", "rs4646182", "rs147214574", "rs6632677", "rs139469582", "rs149000434", "rs148805807", "rs112032651", "rs144314464", "rs147077778", "rs182259051", "rs112621533", "rs35803318", "rs35304868", "rs113848176", "rs145345877", "rs12009805", "rs233570", "rs73635824", "rs73635823", "rs4646142", "rs4646157", "rs2074192", "rs79878075", "rs144239059", "rs67635467", "rs183583165", "rs137910448", "rs116419580", "rs2097723", "rs4646170") for (snps in ace2_snps) { genotype_snps <- as.data.frame(bgen_ACE2$data[snps,,]) idfromcsv <- read.csv("/Users/keeseyyyyy/Desktop/Walley/pospatid.csv") id <- as.character(idfromcsv[[1]]) filtered_snps <- genotype_snps[id,] } I need to run genotype_rs146217251 <- as.data.frame(bgen_ACE2$data["rs146217251",,]) for each rsID, and then I'd like to label the variable filtered_snps according to its rsID in the place of "snps" in the variable name for each variant. I'm not very familiar with R syntax. Can anyone give me some tips? For one variant, the process would go like this: genotype_rs146217251 <- as.data.frame(bgen_ACE2$data["rs146217251",,]) idfromcsv <- read.csv("/Users/keeseyyyyy/Desktop/Walley/pospatid.csv") id <- as.character(idfromcsv[[1]]) filtered <- genotype_rs146217251[id,]
R GWASTools createDataFile() error: "Error ... %in% names(...) is not TRUE"
I'm trying to create an intensity GDS file from existing Illumina files using createDataFile() function of GWASTools. I tried this: col.nums <- as.integer(c(1,11,12,13,14)) names(col.nums) <- c("snp", "BAlleleFreq", "LogRRatio", "a1", "a2") variables <- c("genotype","BAlleleFreq","LogRRatio") intens <- createDataFile(path="/pathexample/", "/pathexample/IntensityGDS", file.type="gds", variables=variables, snp.annotation=snpAnnot, scan.annotation=scanAnnot, sep.type=",", skip.num=12, col.total=14, col.nums=col.nums, scan.name.in.file=-1, allele.coding="nucleotide", precision="single", compress="LZMA_RA:1M", compress.geno="", compress.annot="LZMA_RA", array.name=NULL, genome.build=NULL, diagnostics.filename="createDataFile.diagnostics.RData", verbose=TRUE) The error I'm getting is: Error: all(c("snpID", "chromosome", "position", "snpName") %in% names(snp.annotation)) is not TRUE However I know those column names are in both the snp.annotation snpAnnotationDataFrame (aka snpAnnot) and the underlying dataframe I used to create that snpAnnotationDataFrame. E.g.: varLabels(snpAnnot) yields "snpName" "chromosome" "position" "rsID_real" "snpID" Thanks!!
Apparently the problem was that createDataFile() takes regular R dataframes in the snp.annotation and scan.annotation arguments, not an object of class "snp annotation data frame." ie, no need to run the command SnpAnnotationDataFrame() on your dataframe, just insert the actual dataframe.
Pathview R: Mapping known transcripts to a KEGG pathway diagram representing FoldChange
I'm struggling with: library(pathview) I have a data frame ("T3") with the following column names and possible identifiers to map Fold changes to a significantly enriched KEGG pathway: KEGGid SYMBOL Human_ENSEMBL Human_ENTREZID Mouse_ensembl_gene_id Mouse_ENTREZID It has taken a long time to learn how to get all of these possible IDs but unfortunately, when I try to map them to relevant KEGG nodes, by assigning identifiers as rownames, I do not seem to yield a result (Error message: Warning: None of the genes or compounds mapped to the pathway! Argument gene.idtype or cpd.idtype may be wrong. Error in select(db.obj, keys = in.ids, keytype = in.type, columns = c(in.type, : unused arguments (keys = in.ids, keytype = in.type, columns = c(in.type, out.type)) Error in $<-.data.frame(*tmp*, "labels", value = c("", "", "", "", : replacement has 82 rows, data has 89 ) This is frustrating because T3 contains all of the transcripts which are annotated to PI3K signaling and so they should map. None of the identifiers which I have been using seem to work? However, I know that these transcripts map. For example using "AKT3", which is in the list, we can highlight this node online [https://www.genome.jp/kegg-bin/show_pathway?hsa04151+10000] Where the +1000 at the end of the address specifies AKT node to be highlighted in red. Command lines for example SYMBOL <- c("AKT3", "AKT3") Human_ENSEMBL<- c("ENSG00000117020","ENSG00000275199") Human_ENTREZID <-c("10000", "10000") Mouse_ensembl_gene_id <- c("ENSMUSG00000019699", "ENSMUSG00000019699") Mouse_entrezgene <- c(23797, 23797) log2FoldChange <-c(-0.676668324, -0.676668324) T3 <- c(SYMBOL, Human_ENSEMBL, Human_ENTREZID, Mouse_ensembl_gene_id, Mouse_entrezgene, log2FoldChange) row.names(T3) <- T3$SYMBOL ##For example here using SYMBOL but I have tried a lot of the other identifiers pv.out <- pathview(gene.data = T3, pathway.id = "hsa04151", out.suffix = "Control vs Treatment" ) Thanks for taking the time to help Mark
Nestled Loop not Working to gather data from NOAA
I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine The loop: require('triebeard') require('bindr') require('colorspace') require('mime') require('curl') require('openssl') require('R6') require('urltools') require('httpcode') require('stringr') require('assertthat') require('bindrcpp') require('glue') require('magrittr') require('pkgconfig') require('rlang') require('Rcpp') require('BH') require('plogr') require('purrr') require('stringi') require('tidyselect') require('digest') require('gtable') require('plyr') require('reshape2') require('lazyeval') require('RColorBrewer') require('dichromat') require('munsell') require('labeling') require('viridisLite') require('data.table') require('rjson') require('httr') require('crul') require('lubridate') require('dplyr') require('tidyr') require('ggplot2') require('scales') require('XML') require('xml2') require('jsonlite') require('rappdirs') require('gridExtra') require('tibble') require('isdparser') require('geonames') require('hoardr') require('rnoaa') install.package('ncdf4') install.packages("devtools") library(devtools) install_github("rnoaa", "ropensci") library(rnoaa) list <- buoys(dataset='wlevel') lid <- data.frame(list$id) foo <- for(range in 1990:2017){ for(bid in lid){ bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range) bid.year.data <- data.frame(bid.year$data) write.csv(bid.year.data, file='cwind/bid_range.csv') } } The response: Using c1990.nc Using Error: length(url) == 1 is not TRUE It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops. Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse. # Load packages library(tidyverse) library(rnoaa) Step 1: Create a "Grid" containing all combination of id and year The expand function from tidyr can create the combination of different values. data_list <- buoys(dataset = 'wlevel') data_list2 <- data_list %>% select(id) %>% expand(id, year = 1990:2017) Step 2: Create a "safe" version that does not break when there is no data. Also make this function suitable for the map2 function Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks. # Modify the buoy function buoy_modify <- function(buoyid, year, dataset, ...){ buoy(dataset, buoyid = buoyid, year = year, ...) } # Creare a safe version of buoy_modify buoy_safe <- safely(buoy_modify) Step 3: Apply the buoy_safe function wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel") # Assign name for the element in the list based on id and year names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_") After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message. Step 4: Access the data transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data. # Turn the list "inside out" wlevel_data2 <- transpose(wlevel_data) # Get the error message wlevel_error <- wlevel_data2$error # Get he result wlevel_result <- wlevel_data2$result # Remove NULL element in wlevel_result wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]
Get the most expressed genes from one .CEL file in R
In R the Limma package can give you a list of differentially expressed genes. How can I simply get all the probesets with highest signal intensity in the respect of a threshold? Can I get only the most expressed genes in an healty experiment, for example from one .CEL file? Or the most expressed genes from a set of .CEL files of the same group (all of the control group, or all of the sample group). If you run the following script, it's all ok. You have many .CEL files and all work. source("http://www.bioconductor.org/biocLite.R") biocLite(c("GEOquery","affy","limma","gcrma")) gse_number <- "GSE13887" getGEOSuppFiles( gse_number ) COMPRESSED_CELS_DIRECTORY <- gse_number untar( paste( gse_number , paste( gse_number , "RAW.tar" , sep="_") , sep="/" ), exdir=COMPRESSED_CELS_DIRECTORY) cels <- list.files( COMPRESSED_CELS_DIRECTORY , pattern = "[gz]") sapply( paste( COMPRESSED_CELS_DIRECTORY , cels, sep="/") , gunzip ) celData <- ReadAffy( celfile.path = gse_number ) gcrma.ExpressionSet <- gcrma(celData) But if you delete all .CEL files manually but you leave only one, execute the script from scratch, in order to have 1 sample in the celData object: > celData AffyBatch object size of arrays=1164x1164 features (17 kb) cdf=HG-U133_Plus_2 (54675 affyids) number of samples=1 number of genes=54675 annotation=hgu133plus2 notes= Then you'll get the error: Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) : variable lengths differ (found for 'x') How can I get the most expressed genes from 1 .CEL sample file? I've found a library that could be useful for my purpose: the panp package. But, if you run the following script: if(!require(panp)) { biocLite("panp") } library(panp) myGDS <- getGEO("GDS2697") eset <- GDS2eSet(myGDS,do.log2=TRUE) my_pa <- pa.calls(eset) you'll get an error: > my_pa <- pa.calls(eset) Error in if (chip == "hgu133b") { : the argument has length zero even if the platform of the GDS is that expected by the library. If you run with the pa.call() with gcrma.ExpressionSet as parameter then all work: my_pa <- pa.calls(gcrma.ExpressionSet) Processing 28 chips: ############################ Processing complete. In summary, If you run the script you'll get an error while executing: my_pa <- pa.calls(eset) and not while executing my_pa <- pa.calls(gcrma.ExpressionSet) Why if they are both ExpressionSet? > is(gcrma.ExpressionSet) [1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned" > is(eset) [1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
Your gcrma.ExpressionSet is an object of class "ExpressionSet"; working with ExpressionSet objects is described in the Biobase vignette vignette("ExpressionSetIntroduction") also available on the Biobase landing page. In particular the matrix of summarized expression values can be extracted with exprs(gcrma.ExpressionSet). So > eset = gcrma.ExpressionSet ## easier to display > which(exprs(eset) == max(exprs(eset)), arr.ind=TRUE) row col 213477_x_at 22779 24 > sampleNames(eset)[24] [1] "GSM349767.CEL" Use justGCRMA() rather than ReadAffy as a faster and more memory efficient way to get to an ExpressionSet. Consider asking questions about Biocondcutor packages on the Bioconductor support site where you'll get fast responses from knowledgeable members.