I want to create a large lookup table of key value pairs, attempting it like this:
# actual use case is length ~5 million
key <- do.call(paste0, Map(stringi::stri_rand_strings, n=2e5, length = 16))
val <- sample.int(750, size = 2e5, replace = T)
make_dict <- function(keys, values){
require(rlang)
e <- new.env(size = length(keys))
l <- list2(!!!setNames(values, keys))
list2env(l, envir = e, hash = T) # problem in here...?
}
d <- make_dict(key, val)
Problem
When make_dict is run it throws Error: protect(): protection stack overflow. Specifically in RStudio when the input is a vector of length is greater than 49991, which seems very similar to this stackoverflow post.
However, when I run accessor functions to grab some of the values, it seems that make_dict ran fine after all, as I can't find any oddities in its result:
`%||%` <- function(x,y) if(is.null(x)) y else x
grab <- function(...){
vector("integer", length(..2)) |>
(\(.){. = Vectorize(\(e, x) e[[x]] %||% NA_integer_, list("x"), T, F)(..1, ..2); .})()
}
out <- vector("integer", length(key))
out <- grab(d, sample(key)) # using sample to scramble the keys
anyNA(out) | !lobstr::obj_size(out) == lobstr::obj_size(val)
[1] FALSE
Running the same code in RGui does not throw the error.
Oddities
The d environment object does not appear in environment pane in RStudio for size > 5e4.
The R console returns swiftly back to > (signaling the function has finished), but is unresponsive until the error is thrown
Error is thrown if manually setting options(expressions = 5e5), or retaining the default value of 5000
When the error is thrown is proportional to the size of the input vector
tryCatch(make_dict(key, val), error = function(e) e) doesn't catch an error
The error also occurs if code is run from package (Packaged version available through remotes::install_github("D-Se/minimal"))
Question
What's going on here? How to troubleshoot such an error?
options(error = traceback) as advised here didn't give any results. Inserting a browser() after list2env in the make_dict function throws an error long after the browser has opened. A traceback() gives the function .rs.describeObject, which is used to generate the summary in the Environment pane, and can be found here.
traceback()
# .rs.describeObject
(function (env, objName, computeSize = TRUE)
{
obj <- get(objName, env)
hasNullPtr <- .Call("rs_hasExternalPointer", obj, TRUE, PACKAGE = "(embedding)")
if (hasNullPtr) {
val <- "<Object with null pointer>"
desc <- "An R object containing a null external pointer"
size <- 0
len <- 0
}
else {
val <- "(unknown)"
desc <- ""
size <- if (computeSize)
object.size(obj)
else 0
len <- length(obj)
}
class <- .rs.getSingleClass(obj)
contents <- list()
contents_deferred <- FALSE
if (is.language(obj) || is.symbol(obj)) {
val <- deparse(obj)
}
else if (!hasNullPtr) {
if (size > 524288) {
len_desc <- if (len > 1)
paste(len, " elements, ", sep = "")
else ""
if (is.data.frame(obj)) {
val <- "NO_VALUE"
desc <- .rs.valueDescription(obj)
}
else {
val <- paste("Large ", class, " (", len_desc,
format(size, units = "auto", standard = "SI"),
")", sep = "")
}
contents_deferred <- TRUE
}
else {
val <- .rs.valueAsString(obj)
desc <- .rs.valueDescription(obj)
if (class == "data.table" || class == "ore.frame" ||
class == "cast_df" || class == "xts" || class ==
"DataFrame" || is.list(obj) || is.data.frame(obj) ||
isS4(obj)) {
if (computeSize) {
contents <- .rs.valueContents(obj)
}
else {
val <- "NO_VALUE"
contents_deferred <- TRUE
}
}
}
}
list(name = .rs.scalar(objName), type = .rs.scalar(class),
clazz = c(class(obj), typeof(obj)), is_data = .rs.scalar(is.data.frame(obj)),
value = .rs.scalar(val), description = .rs.scalar(desc),
size = .rs.scalar(size), length = .rs.scalar(len), contents = contents,
contents_deferred = .rs.scalar(contents_deferred))
})(<environment>, "d", TRUE)
This github issue pointed out by #technocrat talks about a known bug in earlier versions of RStudio of disabling null external pointer checks, and has since been solved by adding an additional preference check in .rs.describeObject() of
.rs.readUiPref("check_null_external_pointers")
To check if code is run from within RStudio, and if that version is lower than that of before a certain version number (here I use the current official release), a check can be included in the function, or in the .OnAttach of a package:
if(!is.na(Sys.getenv("RSTUDIO", unset = NA)) && .rs.api.versionInfo()$version < "2021.9.1.372")){
# warning or action
}
Related
I am new to R parallel processing, I'm trying to move this from a very slow working for-loop to multithread and this I cant figure out Error in checkForRemoteErrors(val) : one node produced an error: promise already under evaluation: recursive default argument reference or earlier problems?
my xml wrangling is poor but this serves the purpose. The error comes from
process_entry <- function(ent,component_entry, flattenXMLx=flattenXMLx, EntryComponent=EntryComponent, componentv=componentv){
library(data.table)
flat<-NULL
path <- character(0)
df <- data.frame(elem. = character(0), templateID = character(0), elemid. = integer(0), attr. = character(0), value. = character(0), stringsAsFactors = FALSE)
component<-list()
component[["Document"]][["block"]][["Body"]][['component']][['section']]<-component_entry[ent]
entryc<-component[["Document"]][["block"]][["Body"]][['component']][['section']][['entry']][['organ']]
y <- list()
for (a in 1:length(entryc)) {
if (any(names(entryc[a]) != "component")) {
y = c(y, a)
}
}
if (length(y)>0){
entry_component <- component_entry[[ent]]$organizer
} else{
entry_component<-NULL
}
if (any(length(entry_component)>0)){
entcom<- EntryComponent(entry_component)
} else {
entcom<-NULL
}
componente <- flattenXMLx(xml2::xml_root(xml2::as_xml_document(component)), df, path)
if(any(!is.na(entcom))){
componente <<- dplyr::bind_rows(componente,entcom)}
componentv <<- dplyr::bind_rows(componentv, componente)
return(componentv)
}
this calls the function
componentv <- do.call(cbind, parLapply(cl, 1:length(component_entry), process_entry, component_entry=component_entry) )
the issue seems to come from componentv <<- dplyr::bind_rows(componentv, componente)
edit: added full code
I made an S4 method for 'plot' that appears to be working, except it outputs some stray NULL to the console and I can't figure out where it's coming from. Here's the top level code:
print(plot(x = flux, y = 1, fastplot = TRUE, quietly = TRUE))
And the class:
flux <- setClass(
# Set the class name
"flux",
slots = c(
raw.data = "list",
source.files = "character",
data = "matrix",
time = "POSIXct",
datatype = "character",
metadata = "data.frame"
)
)
And the method:
setMethod("plot",
signature(x = "flux"),
function (x, y, ...) {
CheckFluxObject(x)
params <- LoadDefaults(flux = x)
# Interpret 'plot' arguments
par.restore <- par(no.readonly = TRUE)
on.exit(expr = par(par.restore), add = TRUE)
arguments <- list(...)
if (!("fastplot" %in% names(arguments))) {
fastplot <- FALSE
} else {
fastplot <- arguments$fastplot
arguments$fastplot <- NULL
}
if (!("quietly" %in% names(arguments))) {
quietly <- FALSE
} else {
quietly <- arguments$quietly
arguments$quietly <- NULL
}
par(ask=!(fastplot))
if (!("ylab" %in% arguments)) {
ylab <- params["units"]
} else {
ylab <- arguments$ylab
arguments$ylab <- NULL
}
# Pull relevant 'flux' class object data
data <- slot(x, "data")
if (missing("y")) {
y <- 1:ncol(data)
} else {
stopifnot(
is.integer(y),
all(y %in% 1:ncol(data))
)
}
# Bulk function execution
if (quietly == FALSE) {
message("Plotting data traces:")
}
plot.obj <- plot.new()
print("NULL is in the 'for' loop...")
for (i in y){
main <- colnames(data)[i]
plot.obj <- plot(slot(x, "time"), data[, i], main = main,
xlab = "Time", ylab = ylab, unlist(arguments))
print(plot.obj)
}
print("but is it also here??")
# Clean-up and exit
if (quietly == FALSE) {
message("Done plotting.")
}
if (length(y) == 1) {
invisible(plot.obj)
}
print("or here??")
invisible(NULL)
}
)
The output for that is:
[1] "NULL is in the 'for' loop..."
NULL
[1] "but is it also here??"
[1] "or here??"
NULL
If I throw in another print("what about here??") after the invisible(NULL),
then it does this:
[1] "NULL is in the 'for' loop..."
NULL
[1] "but is it also here??"
[1] "or here??"
[1] "what about here??"
[1] "what about here??"
Is there some behavior of the function return or print commands that I'm not anticipating? The CheckFluxObject function just checks to make sure all the slots are filled.
I'll leave this here till a better answer pops up, if ever:
Apparently the print method for plot objects returns a NULL, and if you're trying to generate a plot within a function it seems like the best way to do that is using invisible(plot.object) or invisible(plot(x, y, ...)), NOT print.
I'm still not sure where the 2nd NULL is coming from...
edit: Found the second one! Just like the print(plot.obj) in the method itself, the print in the top-level code was throwing a NULL. Dropping all of the print commands killed all of the ghosts.
I'm trying to run this function, but when I try to compile it, it says:
Error in paste("http://uk.advfn.com/p.php?pid=financials&symbol=", Symbol, :
object 'Symbol' not found
fund.data <- function
(
Symbol, # ticker
n=10, # number of periods
mode=c('quarterly','annual'), # periodicity
max.attempts=5 # maximum number of attempts to download before exiting
)
dirname(sys.frame(1)$ofile)
{
all.data = c()
option.value = -1
start_date = c('istart_date,start_date')
names(start_date) = c('quarterly,annual')
repeat {
# download Quarterly Financial Report data
if(option.value >= 0) {
url = paste('http://uk.advfn.com/p.php?pid=financials&symbol=', Symbol, '&btn=', mode[1], '_reports&', start_date[mode[1]], '=', option.value, sep = '')
} else {
url = paste('http://uk.advfn.com/p.php?pid=financials&symbol=', Symbol, '&btn=', mode[1], '_reports', sep = '')
}
cat('Downloading', url, '\n')
#txt = join(readLines(url))
for(iattempt in 1:max.attempts) {
flag = T
tryCatch({
txt = join(readLines(url))
}, interrupt = function(ex) {
flag <<- F
Sys.sleep(0.1)
}, error = function(ex) {
flag <<- F
Sys.sleep(0.1)
}, finally = {
if(flag) break
})
}
if( length(grep('INDICATORS', txt, ignore.case = T)) == 0 ) {
cat('No Data Found for', Symbol, '\n')
return(all.data)
}
# get title
pos = regexpr(pattern = '<title>(.*?)</title>', txt, ignore.case = TRUE, perl = TRUE)
if(length(pos) == 1)
title = substr(txt, attr(pos, 'capture.start'), attr(pos, 'capture.start') + attr(pos, 'capture.length') - 1)
# extract table from this page
data = extract.table.from.webpage(txt, 'INDICATORS', has.header = T)
colnames(data) = data[1,]
rownames(data) = data[,1]
data = data[,-1,drop=F]
# only add not already present data
add.index = which( is.na(match( colnames(data), colnames(all.data) )) )
all.data = cbind(data[,add.index,drop=F], all.data)
# check if it is time to stop
if(ncol(all.data) >= n) break
if(option.value == 0) break
# extract option value to go to the next page
temp = gsub(pattern = '<option', replacement = '<tr>', txt, perl = TRUE)
temp = gsub(pattern = '</option>', replacement = '</tr>', temp, perl = TRUE)
temp = extract.table.from.webpage(temp, 'All amounts', has.header = T)
temp = apply(temp,1,join)
index.selected = grep('selected', temp)
option.value = 0
if( length(index.selected) )
option.value = as.double( gsub('.*value=\'([0-9]*).*', '\\1', temp[index.selected]) )
if(option.value > 0) {
# can only get 5 time periods at a time
option.value = option.value - 5
option.value = max(0, option.value)
} else {
break
}
}
# remove empty columns
all.data = all.data[, colSums(nchar(trim(all.data))) > 0, drop=F]
all.data = rbind(all.data, title)
rownames(all.data)[nrow(all.data)] = 'HTMLTITLEtext'
if( ncol(all.data) > n ) {
return(all.data[,(ncol(all.data)-n+1):ncol(all.data), drop=F])
} else {
return(all.data)
}
}
The way you've written your code, your dirname() call comprises the entirety of the body of your function. The braced block that follows is executed immediately and is not part of the function.
After running all your code (and getting the error you quoted), this is fund.data():
fund.data;
## function
## (
## Symbol, # ticker
## n=10, # number of periods
## mode=c('quarterly','annual'), # periodicity
## max.attempts=5 # maximum number of attempts to download before exiting
## )
## dirname(sys.frame(1)$ofile)
As you can see, the braced block was not taken as part of the function definition. It was executed by itself immediately after fund.data() was defined. A function definition takes only the immediately following expression as the body, although that expression may comprise a braced block, which allows any number of statements to be subsumed within it. And as #RichardScriven pointed out in his comment, there is no actual call to your function anywhere in your code.
So, the reason why you're getting the exact error "object 'Symbol' not found" is because the function parameter Symbol does not exist in your braced block, because it is not part of the body of your function and was executed by itself.
To solve your problem, you need to surround the entire function body with a braced block:
fund.data <- function
(
Symbol, # ticker
n=10, # number of periods
mode=c('quarterly','annual'), # periodicity
max.attempts=5 # maximum number of attempts to download before exiting
) {
dirname(sys.frame(1)$ofile)
all.data = c()
option.value = -1
...
}
Although it's not clear what the purpose of the dirname() call is, since its return value is not used.
I'm currently writing a program (full disclosure, it's "homework"). The program is designed to run through a series of files based on a range given, collate them into one large table sans NAs and find the mean of the pollutant provided (which is a column in the table).
I wrote the program previously, but wanted to play around with compartmentalising the functions a bit more, so I rewrote it.
Strangely, some ranges return the exact numeric as in the original program, while others return (relatively) radically different results.
For instance:
pollutantmean("specdata", "sulfate", 1:10)
Old Program: 4.064128
New Program: 4.064128
pollutantmean("specdata", "nitrate", 23)
Old Program: 1.280833
New Program: 1.280833
pollutantmean("specdata", "nitrate", 70:72)
Old Program: 1.706047
New Program: 1.732979
In that final example, the old program is producing the expected result, while the new program is producing a result not within the acceptable margin of error at all.
I'm simply at a loss, I've been trying to rewrite my new code so as to minimise differences with the old cold without simply reproducing the old program, and the current code will be below (with the original program). But nothing is working, I continue to receive the exact same (bad) result despite quite a few changes being made.
New Program:
concatTables <- function(directory, id, hasHeader = TRUE, keepNAs = FALSE) {
totalTable <- NULL
currentTable <- NULL
for (file in id) {
filename <- paste( sep ="",
directory,"/",formatC(file,width=3,format="d",flag="0"),".csv"
);
currentTable <- read.csv(file = filename, header = hasHeader);
if (!is.null(totalTable)) {
totalTable <- rbind(totalTable, currentTable);
}
else {
totalTable <- currentTable;
}
}
if (!keepNAs) {
totalTable <- completeRows(totalTable);
}
totalTable
}
completeRows <- function(table) {
table <- table[complete.cases(table),]
table
}
pollutantmean <- function(directory = paste(getwd(),"/specdata",sep = ""), pollutant, id = 1:332, hasHeader = TRUE, keepNAs = FALSE) {
table <- NULL
table <- concatTables(directory,id,hasHeader,keepNAs);
tableMean <- mean(table[[pollutant]]);
tableMean
}
Old Program
(Which produces better results)
dataFileName <- NULL
pollutantmean <- function(directory = "specdata", pollutant, id = 1:332, idWidth = 3, fullLoop = TRUE) {
dataFrame <- NULL
dataFrameTotal <- NULL
for (i in id) {
dataFileName <- paste(directory, "/", formatC(i, width = idWidth, flag = 0), ".csv", sep = "")
if (!is.null(dataFileName)) {
dataFileConnection <- file(dataFileName)
dataFrame <- read.csv(dataFileConnection, header = TRUE)
dataFrameTotal <- rbind(dataFrame, dataFrameTotal)
##close(dataFileConnection)
if (fullLoop == FALSE) {
break
}
}
else print("DATAFILENAME IS NULL!")
}
print(mean(dataFrameTotal[[pollutant]], na.rm = TRUE))
}
The difference is that complete.cases() returns TRUE on each row where one of the columns is NA, while na.rm arg inside mean func will remove rows where the selected column (vector) is NA.
Example:
x <- airquality[1:10, -1]
x[3,3] <- NA
> mean(x[complete.cases(x), "Temp"]) == mean(x[["Temp"]], na.rm = T)
[1] FALSE
Note that complete.cases() returns TRUE on rows 5, 6 where Solar.R column is NA, so you lose 2 observations not NA in Temp column
I am trying to understand the reducer.R code taken from the following website.
http://www.thecloudavenue.com/2013/10/mapreduce-programming-in-r-using-hadoop.html
This code is using for Hadoop Streaming using R.
I have given the code below:
#! /usr/bin/env Rscript
# reducer.R - Wordcount program in R
# script for Reducer (R-Hadoop integration)
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
val <- unlist(strsplit(line, "\t"))
list(word = val[1], count = as.integer(val[2]))
}
env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
split <- splitLine(line)
word <- split$word
count <- split$count
if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}
else assign(word, count, envir = env)
}
close(con)
for (w in ls(env, all = TRUE))
cat(w, "\t", get(w, envir = env), "\n", sep = "")
Could someone explain the significance of the use of the following new.env command and the subsequent use of the env in the code:
env <- new.env(hash = TRUE)
Why is this required? What happens if this is not included in the code?
Update 06/05/2014
I tried writing another version of this code without having a new environment defined and have given the code as follows:
#! /usr/bin/env Rscript
current_word <- ""
current_count <- 0
word <- ""
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0)
{
line1 <- gsub("(^ +)|( +$)", "", line)
word <- unlist(strsplit(line1, "[[:space:]]+"))[[1]]
count <- as.numeric(unlist(strsplit(line1, "[[:space:]]+"))[[2]])
if (current_word == word) {
current_count = current_count + count
} else
{
if(current_word != "")
{
cat(current_word,'\t', current_count,'\n')
}
current_count = count
current_word = word
}
}
if (current_word == word)
{
cat(current_word,'\t', current_count,'\n')
}
close(con)
This code gives the same output as the one with a new environment defined.
Question: Does using new environment provide any advantages from a Hadoop standpoint? Is there a reason for using it in this specific case?
Thank you.
Your question is related with environment in R, example code for make new environment in R
> my.env <- new.env()
> my.env
<environment: 0x114a9d940>
> ls(my.env)
character(0)
> assign("a", 999, envir=my.env)
> my.env$foo = "This is the variable foo."
> ls(my.env)
[1] "a" "foo"
I think this article can help you http://www.r-bloggers.com/environments-in-r/
or press
?environment
for more help
Like on code that you give, the author make a new environmnt.
env <- new.env(hash = TRUE)
when he want to assign value they defined the environment
assign(word, oldcount + count, envir = env)
And for the question "What happens if this is not included in the code?" I think you can find the answer on the link that I already provided
About the advantages using new env in R is already answered in this link
so the reason is in this case you will play with the large of dataset, when you passing your dataset to the function, R will make a copy your dataset and then the return data will overwrite the old dataset. But if you passing env, R will directly process that env without copying large dataset.