Issue running purrr::possible for error recovery inside ~ tibble - r

I was working through an example from the Hadley Wickham's "The Joy of Functional Programming (for Data Science)" (available on youtube) lecture on purrr and wanted to add in error-recovery using possibly() to handle any zip files that fail to be read.
A bit of his code;
paths <- dir_ls("NSFopen_Alldata/", glob= ".zip")
files<- map_dfr(paths, ~ tibble(path=.x, files= unzip(.x, list = TRUE)$Name))
Adding error recovery with possibly();
unzip_safe <- possibly(.f= unzip, otherwise = NA)
files<- map_dfr(paths, ~ tibble(path=.x, files= unzip_safe(.x, list = TRUE)$Name))
I get the following error: $ operator is invalid for atomic vectors.
Is this because possibly is in a tibble?

Files that fail return an NA and you are trying to extract $Name from it which returns an error. See
NA$NAme
Error in NA$NAme : $ operator is invalid for atomic vectors
Extract $Name from the successful files in possibly itself. Try :
library(purrr)
unzip_safe <- possibly(.f= ~unzip(., list = TRUE)$Name, otherwise = NA)
files <- map_dfr(paths, ~ tibble(path=.x, files = unzip_safe(.x))

Related

Choose command order in a function based on an error [R]

I have three files in a folder with the following names:
./multiqc_data$ ls
file1.json
file2.json
file3.json
When I open the files with the TidyMultiqc package existing NA values in the files might lead to the following error:
files <- dir(path,pattern = "*.json") #locate files
files %>%
map(~ load_multiqc(file.path(path, .))) #parse them
## the error
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
"mapped_failed_pct": NaN, "paired in
(right here) ------^
I want to create a function to handle this error.
I want every time this error pops up to be able to apply this sed function in all files of the folder.
system(paste("gsed -i 's/NaN/null/g'",paste0(path,"*.json")))
Any ideas how can I achieve this
You could use this wrapper :
safe_load_multiqc <- function(path, file) {
tryCatch(load_multiqc(file.path(path, file)), error = function(e) {
system(paste("gsed -i 's/NaN/null/g'",paste0(path,"*.json")))
# retry
load_multiqc(path, file)
})
}
A good way to handle errors in work pipelines like that is using restarts and withCallingHandlers and withRestarts.
You establish the condition handlers and the recovery protocols (restarts) then you can choose what protocols to use and in which order. Calling handlers allows a much finer control on error conditions than common try-catch.
In the example, I wrote two handlers: removeNaNs (works at folder level) and skipFile (works at file level), if the first fails, the second is executed (simply skipping the file). Of course is an example
I think in your case you can simply run sed in every case, nevertheless, I hope this answer meet your looking for a canonical way
Inspiration and Extra lecture: Beyond Exception Handling: Conditions and Restarts
path <- "../your_path"
# function that does the error_prone task
do_task <- function(path){
files <- dir(path,pattern = "*.json") #locate files
files %>%
map(~ withRestart( # set an alternative restart
load_multiqc(file.path(path, .)), # parsing
skipFile = function() { # if fails, skip only this file
message(paste("skipping ", file.path(path, .)))
return(NULL)
}))
}
# error handler that invokes "removeNaN"
removeNaNHandler <- function(e) tryInvokeRestart("removeNaN")
# error handler that invokes "skipFile"
skipFileHandler <- function(e) tryInvokeRestart("skipFile")
# run the task with handlers in case of error
withCallingHandlers(
condition = removeNaNHandler, # call handler (on generic error)
# condition = skipFileHandler, # if previous fails skips file
{
# run with recovery protocols (can define more than one)
withRestarts({
do_task(path)},
removeNaN = function() # protocol "removeNaN"
{
system(paste("gsed -i 's/NaN/null/g'",paste0(path,"*.json")))
do_task(path) # try again
}
)
}
)
Based on this open github issue, a potential solution provided by Peter Diakumis is to use RJSONIO::fromJSON() in place of jsonlite::read_json(). You could adapt this solution to your use-case by e.g. creating your own load_multiqc() function:
library(RJSONIO)
load_multiqc_bugfix <- function(paths,
plots = NULL,
find_metadata = function(...) {
list()
},
plot_parsers = list(),
sections = "general") {
assertthat::assert_that(all(sections %in% c(
"general", "plot", "raw"
)), msg = "Only 'general', 'plot' and 'raw' (and combinations of those) are valid items for the sections parameter")
# Vectorised over paths
paths %>%
purrr::map_dfr(function(path) {
parsed <- RJSONIO::fromJSON(path)
# The main data is plots/general/raw
main_data <- sections %>%
purrr::map(~ switch(.,
general = parse_general(parsed),
raw = parse_raw(parsed),
plot = parse_plots(parsed, plots = plots, plot_parsers = plot_parsers)
)) %>%
purrr::reduce(~ purrr::list_merge(.x, !!!.y), .init = list()) %>%
purrr::imap(~ purrr::list_merge(.x, metadata.sample_id = .y))
# Metadata is defined by a user function
metadata <- parse_metadata(parsed = parsed, samples = names(main_data), find_metadata = find_metadata)
purrr::list_merge(metadata, !!!main_data) %>%
dplyr::bind_rows()
}) %>%
# Only arrange the columns if we have at least 1 column
`if`(
# Move the columns into the order: metadata, general, plot, raw
ncol(.) > 0,
(.) %>%
dplyr::relocate(dplyr::starts_with("raw")) %>%
dplyr::relocate(dplyr::starts_with("plot")) %>%
dplyr::relocate(dplyr::starts_with("general")) %>%
dplyr::relocate(dplyr::starts_with("metadata")) %>%
# Always put the sample ID at the start
dplyr::relocate(metadata.sample_id),
.
)
}

Error in is.single.string(object) : argument "object" is missing, with no default

I want to parse the AAChange.refGene column and then use biomaRt R package to extract information. My code is raising Error in is.single.string(object) : argument "object" is missing, with no default even though the getSequence function is meant to accept multiple arguments.
library(tidyr)
variant_calls = read.delim("variant_calls.txt")
info = tidyr::separate(variant_calls["AAChange.refGene"], AAChange.refGene, c("Refseq ID", "cDNA level change", "Protein level change"), ":")
df = cbind(variant_calls["Gene.refGene"],info)
library(biomaRt)
ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="https://grch37.ensembl.org", path="/biomart/martservice")
pep <- vector()
for(i in 1:length(df$`Refseq ID`)){
temp <- getSequence(id=df$`Refseq ID`[i],type='refseq_mrna',seqType='peptide', mart=ensembl)
temp <- sapply(temp$peptide, nchar)
temp <- sort(temp, decreasing = TRUE)
temp <- names(temp[1])
pep[i] <- temp
}
df$Sequence <- pep
Traceback:
Error in is.single.string(object) :
argument "object" is missing, with no default
I got the same error and found out (using ?getSequence) that it was a conflict between packages (classic R), specifically biomart and seqinr which is used to handle fasta format thus probably used together often.
My solution consisted in calling the function like this:
biomaRt::getSequence()

How to deal with "Warning: object 'xxx' is created by more than one data call"

When checking an R package, I got the warning
Warning: object 'xxx' is created by more than one data call
What causes this, and how can I fix it?
This warning occurs when multiple RData files in the data directory of the package store a variable with the same name.
To reproduce, we create a package and save the cars dataset twice, to different files:
library(devtools)
create("test")
dir.create("test/data")
save(cars, file = "test/data/cars1.RData")
save(cars, file = "test/data/cars2.RData")
check("test")
The output from check includes these lines:
Found the following significant warnings:
Warning: object 'cars' is created by more than one data call
If you receive this warning, you can find repeated variable names using:
rdata_files <- dir("test/data", full.names = TRUE, pattern = "\\.RData$")
var_names <- lapply(
rdata_files,
function(rdata_file)
{
e <- new.env()
load(rdata_file, envir = e)
ls(e)
}
)
Reduce(intersect, var_names)
## [1] "cars"

Get the most expressed genes from one .CEL file in R

In R the Limma package can give you a list of differentially expressed genes.
How can I simply get all the probesets with highest signal intensity in the respect of a threshold?
Can I get only the most expressed genes in an healty experiment, for example from one .CEL file? Or the most expressed genes from a set of .CEL files of the same group (all of the control group, or all of the sample group).
If you run the following script, it's all ok. You have many .CEL files and all work.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GEOquery","affy","limma","gcrma"))
gse_number <- "GSE13887"
getGEOSuppFiles( gse_number )
COMPRESSED_CELS_DIRECTORY <- gse_number
untar( paste( gse_number , paste( gse_number , "RAW.tar" , sep="_") , sep="/" ), exdir=COMPRESSED_CELS_DIRECTORY)
cels <- list.files( COMPRESSED_CELS_DIRECTORY , pattern = "[gz]")
sapply( paste( COMPRESSED_CELS_DIRECTORY , cels, sep="/") , gunzip )
celData <- ReadAffy( celfile.path = gse_number )
gcrma.ExpressionSet <- gcrma(celData)
But if you delete all .CEL files manually but you leave only one, execute the script from scratch, in order to have 1 sample in the celData object:
> celData
AffyBatch object
size of arrays=1164x1164 features (17 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=1
number of genes=54675
annotation=hgu133plus2
notes=
Then you'll get the error:
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) :
variable lengths differ (found for 'x')
How can I get the most expressed genes from 1 .CEL sample file?
I've found a library that could be useful for my purpose: the panp package.
But, if you run the following script:
if(!require(panp)) { biocLite("panp") }
library(panp)
myGDS <- getGEO("GDS2697")
eset <- GDS2eSet(myGDS,do.log2=TRUE)
my_pa <- pa.calls(eset)
you'll get an error:
> my_pa <- pa.calls(eset)
Error in if (chip == "hgu133b") { : the argument has length zero
even if the platform of the GDS is that expected by the library.
If you run with the pa.call() with gcrma.ExpressionSet as parameter then all work:
my_pa <- pa.calls(gcrma.ExpressionSet)
Processing 28 chips: ############################
Processing complete.
In summary, If you run the script you'll get an error while executing:
my_pa <- pa.calls(eset)
and not while executing
my_pa <- pa.calls(gcrma.ExpressionSet)
Why if they are both ExpressionSet?
> is(gcrma.ExpressionSet)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
> is(eset)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
Your gcrma.ExpressionSet is an object of class "ExpressionSet"; working with ExpressionSet objects is described in the Biobase vignette
vignette("ExpressionSetIntroduction")
also available on the Biobase landing page. In particular the matrix of summarized expression values can be extracted with exprs(gcrma.ExpressionSet). So
> eset = gcrma.ExpressionSet ## easier to display
> which(exprs(eset) == max(exprs(eset)), arr.ind=TRUE)
row col
213477_x_at 22779 24
> sampleNames(eset)[24]
[1] "GSM349767.CEL"
Use justGCRMA() rather than ReadAffy as a faster and more memory efficient way to get to an ExpressionSet.
Consider asking questions about Biocondcutor packages on the Bioconductor support site where you'll get fast responses from knowledgeable members.

Get function's title from documentation

I would like to get the title of a base function (e.g.: rnorm) in one of my scripts. That is included in the documentation, but I have no idea how to "grab" it.
I mean the line given in the RD files as \title{} or the top line in documentation.
Is there any simple way to do this without calling Rd_db function from tools and parse all RD files -- as having a very big overhead for this simple stuff? Other thing: I tried with parse_Rd too, but:
I do not know which Rd file holds my function,
I have no Rd files on my system (just rdb, rdx and rds).
So a function to parse the (offline) documentation would be the best :)
POC demo:
> get.title("rnorm")
[1] "The Normal Distribution"
If you look at the code for help, you see that the function index.search seems to be what is pulling in the location of the help files, and that the default for the associated find.packages() function is NULL. Turns out tha tthere is neither a help fo that function nor is exposed, so I tested the usual suspects for which package it was in (base, tools, utils), and ended up with "utils:
utils:::index.search("+", find.package())
#[1] "/Library/Frameworks/R.framework/Resources/library/base/help/Arithmetic"
So:
ghelp <- utils:::index.search("+", find.package())
gsub("^.+/", "", ghelp)
#[1] "Arithmetic"
ghelp <- utils:::index.search("rnorm", find.package())
gsub("^.+/", "", ghelp)
#[1] "Normal"
What you are asking for is \title{Title}, but here I have shown you how to find the specific Rd file to parse and is sounds as though you already know how to do that.
EDIT: #Hadley has provided a method for getting all of the help text, once you know the package name, so applying that to the index.search() value above:
target <- gsub("^.+/library/(.+)/help.+$", "\\1", utils:::index.search("rnorm",
find.package()))
doc.txt <- pkg_topic(target, "rnorm") # assuming both of Hadley's functions are here
print(doc.txt[[1]][[1]][1])
#[1] "The Normal Distribution"
It's not completely obvious what you want, but the code below will get the Rd data structure corresponding to the the topic you're interested in - you can then manipulate that to extract whatever you want.
There may be simpler ways, but unfortunately very little of the needed coded is exported and documented. I really wish there was a base help package.
pkg_topic <- function(package, topic, file = NULL) {
# Find "file" name given topic name/alias
if (is.null(file)) {
topics <- pkg_topics_index(package)
topic_page <- subset(topics, alias == topic, select = file)$file
if(length(topic_page) < 1)
topic_page <- subset(topics, file == topic, select = file)$file
stopifnot(length(topic_page) >= 1)
file <- topic_page[1]
}
rdb_path <- file.path(system.file("help", package = package), package)
tools:::fetchRdDB(rdb_path, file)
}
pkg_topics_index <- function(package) {
help_path <- system.file("help", package = package)
file_path <- file.path(help_path, "AnIndex")
if (length(readLines(file_path, n = 1)) < 1) {
return(NULL)
}
topics <- read.table(file_path, sep = "\t",
stringsAsFactors = FALSE, comment.char = "", quote = "", header = FALSE)
names(topics) <- c("alias", "file")
topics[complete.cases(topics), ]
}

Resources