Why can't I vectorize source_url in knitr? - r

I am trying to vectorize this call to source_url, in order to load some functions from GitHub:
library(devtools)
# Find ggnet functions.
fun = c("ggnet.R", "functions.R")
fun = paste0("https://raw.github.com/briatte/ggnet/master/", fun)
# Load ggnet functions.
source_url(fun[1], prompt = FALSE)
source_url(fun[2], prompt = FALSE)
The last two lines should be able to work in a lapply call, but for some reason, this won't work from knitr: to have this code work when I process a Rmd document to HTML, I have to call source_url twice.
The same error shows up with source_url from devtools and with the one from downloader: somehwere in my code, an object of type closure is not subsettable.
I suspect that this has something to do with SHA; any explanation would be most welcome.

It has nothing to do with knitr or devtools or vectorization. It is just an error in your(?) code, and it is fairly easy to find it out using traceback().
> library(devtools)
> # Find ggnet functions.
> fun = c("ggnet.R", "functions.R")
> fun = paste0("https://raw.github.com/briatte/ggnet/master/", fun)
> # Load ggnet functions.
> source_url(fun[1], prompt = FALSE)
SHA-1 hash of file is 2c731cbdf4a670170fb5298f7870c93677e95c7b
> source_url(fun[2], prompt = FALSE)
SHA-1 hash of file is d7d466413f9ddddc1d71982dada34e291454efcb
Error in df$Source : object of type 'closure' is not subsettable
> traceback()
7: which(df$Source == x) at file34af6f0b0be5#14
6: who.is.followed.by(df, "JacquesBompard") at file34af6f0b0be5#19
5: eval(expr, envir, enclos)
4: eval(ei, envir)
3: withVisible(eval(ei, envir))
2: source(temp_file, ...)
1: source_url(fun[2], prompt = FALSE)
You used df in the code, and df is a function in the stats package (density of the F distribution). I know you probably mean a data frame, but you did not declare that in the code.

Related

How skip some line in R

I have many URLs which I import their text in R.
I use this code:
setNames(lapply(1:1000, function(x) gettxt(get(paste0("url", x)))), paste0("url", 1:1000, "_txt")) %>%
list2env(envir = globalenv())
However, some URLs can not import and show this error:
Error in file(con, "r") : cannot open the connection In addition:
Warning message: In file(con, "r") : InternetOpenUrl failed: 'A
connection with the server could not be established'
So, my code doesn't run and doesn't import any text from any URL.
How can I recognize wrong URLs and skip them in other to import correct URLs?
one possible aproach besides trycatch mentioned by #tester can be the purrr-package:
library(purrr)
# declare function
my_gettxt <- function(x) {
gettxt(get(paste0("url", x)))
}
# make function error prone by defining the otherwise value (could be empty df with column defintion, etc.) used as output if function fails
my_gettxt <- purrr::possibly(my_gettxt , otherwise = NA)
# use map from purrr instead of apply function
my_data <- purrr::map(1:1000, ~my_gettxt(.x))

How to reuse sparklyr context with mclapply?

I have a R code that does some distributed data preprocessing in sparklyr, and then collects the data to R local dataframe to finally save the result in the CSV. Everything works as expected and now I plan to re-use the spark context across multiple input files processing.
My code looks similar to this reproducible example:
library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")
# Generate random input
matrix(rbinom(1000, 1, .5), ncol=1) %>% write.csv('/tmp/input/df0.csv')
matrix(rbinom(1000, 1, .5), ncol=1) %>% write.csv('/tmp/input/df1.csv')
# Multi-job input
input = list(
list(name="df0", path="/tmp/input/df0.csv"),
list(name="df1", path="/tmp/input/df1.csv")
)
global_parallelism = 2
results_dir = "/tmp/results2"
# Function executed on each file
f <- function (job) {
spark_df <- spark_read_csv(sc, "df_tbl", job$path)
local_df <- spark_df %>%
group_by(V1) %>%
summarise(n=n()) %>%
sdf_collect
output_path <- paste(results_dir, "/", job$name, ".csv", sep="")
local_df %>% write.csv(output_path)
return (output_path)
}
If I execute the function of a job inputs in sequential way with lapply everything works as expected:
> lapply(input, f)
[[1]]
[1] "/tmp/results2/df0.csv"
[[2]]
[1] "/tmp/results2/df1.csv"
However, if I plan to run it in parallel to maximize usage of spark context (if df0 spark processing is done and the local R is working on it, df1 can be already processed by spark):
> library(parallel)
> library(MASS)
> mclapply(input, f, mc.cores = global_parallelism)
*** caught segfault ***
address 0x560b2c134003, cause 'memory not mapped'
[[1]]
[1] "Error in as.vector(x, \"list\") : \n cannot coerce type 'environment' to vector of type 'list'\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in as.vector(x, "list"): cannot coerce type 'environment' to vector of type 'list'>
[[2]]
NULL
Warning messages:
1: In mclapply(input, f, mc.cores = global_parallelism) :
scheduled core 2 did not deliver a result, all values of the job will be affected
2: In mclapply(input, f, mc.cores = global_parallelism) :
scheduled core 1 encountered error in user code, all values of the job will be affected
When I'm doing similar with Python and ThreadPoolExcutor, the spark context is shared across threads, same for Scala and Java.
Is this possible to reuse sparklyr context in parallel execution in R?
Yeah, unfortunately, the sc object, which is of class spark_connection, cannot be exported to another R process (even if forked processing is used). If you use the future.apply package, part of the future ecosystem, you can see this if you use:
library(future.apply)
plan(multicore)
## Look for non-exportable objects and given an error if found
options(future.globals.onReference = "error")
y <- future_lapply(input, f)
That will throw:
Error: Detected a non-exportable reference (‘externalptr’) in one of the
globals (‘sc’ of class ‘spark_connection’) used in the future expression

Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments

I am trying to do some Bioconductor exercises on R studio cloud. Running the first two codes (#1,#2) have been fine, but the last code(#3) gives the error message. Please can anyone help?
#1 Transcribe dna_seq into an RNAString object and print it
dna_seq <- subseq(unlist(zikaVirus), end = 21)
dna_seq
21-letter "DNAString" instance
seq: AGTTGTTGATCTGTGTGAGTC
#2 Transcribe dna_seq into an RNAString object and print it
rna_seq <- RNAString(dna_seq)
rna_seq
21-letter "RNAString" instance
seq: AGUUGUUGAUCUGUGUGAGUC
#3 Translate rna_seq into an AAString object and print it
aa_seq <- translate(rna_seq)
aa_seq
aa_seq <- translate(rna_seq)
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
aa_seq
Error: object 'aa_seq' not found
Thank you. I managed to solve the problem: I think there was a clash with the translate() function because it is used by both the seqinr and Biostring packages(I loaded both). I had to unload seqinr, because the exercises I was doing were based on the Biostring package.

R: annotate() gives error in R

I am new to R. I have to use POSTagger in my code. I am using openNLP with R. While trying following sample code (in Sample.R file):
library("NLP")
library("openNLP")
s <- paste(c("Pierre Vinken, 61 years old, will join the board as a ",
"nonexecutive director Nov. 29.\n",
"Mr. Vinken is chairman of Elsevier N.V., ",
"the Dutch publishing group."),
collapse = "")
s <- as.String(s)
sent_token_annotator <- Maxent_Sent_Token_Annotator()
a1 <- annotate(s, sent_token_annotator)
s[a1]
And running this code from R Console (Using source("Sample.R"))
I am getting following error:
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class "c("Simple_POS_Tag_Annotator", "Annotator")" to a data.frame
Following is the output of traceback() command :
14: stop(gettextf("cannot coerce class \"%s\" to a data.frame", deparse(class(x))),
domain = NA)
13: as.data.frame.default(x[[i]], optional = TRUE)
12: as.data.frame(x[[i]], optional = TRUE)
11: data.frame(x = function (s, a = Annotation())
{
s <- as.String(s)
y <- f(s)
n <- length(y)
id <- .seq_id(next_id(a$id), n)
type <- rep.int("sentence", n)
if (is.Annotation(y)) {
y$id <- id
y$type <- type
}
else if (is.Span(y)) {
y <- as.Annotation(y, id = id, type = type)
}
else stop("Invalid result from underlying sentence tokenizer.")
if (length(i <- which(a$type == "paragraph"))) {
a <- a[i]
a$features <- lapply(annotations_in_spans(y, a), function(e) list(constituents = e$id))
y <- c(y, a)
}
y
}, check.names = FALSE, stringsAsFactors = FALSE)
10: eval(expr, envir, enclos)
9: eval(as.call(c(expression(data.frame), x, check.names = !optional,
stringsAsFactors = stringsAsFactors)))
8: as.data.frame.list(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
7: as.data.frame(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
6: data.frame(position)
5: annotate(s, sent_token_annotator) at sample.R#11
4: eval(expr, envir, enclos)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("sample.R")
What can be possibly wrong? I am using Rx64 3.1.1 on Windows 7. Any help will be much appreciated. Thanks in advance.
I have the same problem and i fixed it by removing/detach the ggplot2 package. There is a function called Annotate in ggplot2 and it is the same name in both packages. I suggest you make sure that it is looking at the correct function in the library... in my case it was looking at the Annotate function of ggplot2 and not the NLP package.
I don't have an exact answer but suffered the same error using NLP, openNLP, tm, qdap. I worked backward restarting R and loading (library) one package, running code, then loading another package and running code, until I ran across the "cannot coerce to a dataframe" error. I found, in my case that qdap interferes with the openNLP annotate() function call -- which is actually using an NLP wrapper.
openNLP version 0.2-3 imports NLP (≥ 0.1-2), openNLPdata (≥ 1.5.3-1), and rJava (≥ 0.6-3). Because you loaded NLP explicitly, it may be a case of two instances of NLP running in memory interfering with each other. Try just loading openNLP and running your code
Multiple packages have same name. If you specifically tell R which package to use it will probably resolve the issue. For example, instead of Arrange(...), try using openNLP::Arrange(...)

R / devtools / roxygen2 : difficulty creating package

I'm trying to turn this function found here into an R package. I'm following the directions found here.
Here are the steps I take:
1) Load required library
library(devtools)
2) Go to a new location
setwd('C:\\myRpkgs\\')
3) Create skeleton
create('conveniencePkg')
4) Copy function to file and save in 'C:\\myRpkgs\\conveniencePkg\\R\\lsos.R'
5) Run document function
setwd("./conveniencePkg")
document()
6) Install package
setwd("..")
install("conveniencePkg")
7) Load library
library(conveniencePkg)
8) try to use lsos function
>lsos()
Error in is.na(obj.dim)[, 1] : subscript out of bounds
Result is the following error:
> traceback()
2: .ls.objects(..., order.by = "Size", decreasing = TRUE, head = TRUE,
n = n)
1: conveniencePkg::lsos()
The function runs fine if I put it into an R file and just use the source() function. Anything seem incorrect in the above steps?

Resources