I am trying to understand the reducer.R code taken from the following website.
http://www.thecloudavenue.com/2013/10/mapreduce-programming-in-r-using-hadoop.html
This code is using for Hadoop Streaming using R.
I have given the code below:
#! /usr/bin/env Rscript
# reducer.R - Wordcount program in R
# script for Reducer (R-Hadoop integration)
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
val <- unlist(strsplit(line, "\t"))
list(word = val[1], count = as.integer(val[2]))
}
env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
split <- splitLine(line)
word <- split$word
count <- split$count
if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}
else assign(word, count, envir = env)
}
close(con)
for (w in ls(env, all = TRUE))
cat(w, "\t", get(w, envir = env), "\n", sep = "")
Could someone explain the significance of the use of the following new.env command and the subsequent use of the env in the code:
env <- new.env(hash = TRUE)
Why is this required? What happens if this is not included in the code?
Update 06/05/2014
I tried writing another version of this code without having a new environment defined and have given the code as follows:
#! /usr/bin/env Rscript
current_word <- ""
current_count <- 0
word <- ""
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0)
{
line1 <- gsub("(^ +)|( +$)", "", line)
word <- unlist(strsplit(line1, "[[:space:]]+"))[[1]]
count <- as.numeric(unlist(strsplit(line1, "[[:space:]]+"))[[2]])
if (current_word == word) {
current_count = current_count + count
} else
{
if(current_word != "")
{
cat(current_word,'\t', current_count,'\n')
}
current_count = count
current_word = word
}
}
if (current_word == word)
{
cat(current_word,'\t', current_count,'\n')
}
close(con)
This code gives the same output as the one with a new environment defined.
Question: Does using new environment provide any advantages from a Hadoop standpoint? Is there a reason for using it in this specific case?
Thank you.
Your question is related with environment in R, example code for make new environment in R
> my.env <- new.env()
> my.env
<environment: 0x114a9d940>
> ls(my.env)
character(0)
> assign("a", 999, envir=my.env)
> my.env$foo = "This is the variable foo."
> ls(my.env)
[1] "a" "foo"
I think this article can help you http://www.r-bloggers.com/environments-in-r/
or press
?environment
for more help
Like on code that you give, the author make a new environmnt.
env <- new.env(hash = TRUE)
when he want to assign value they defined the environment
assign(word, oldcount + count, envir = env)
And for the question "What happens if this is not included in the code?" I think you can find the answer on the link that I already provided
About the advantages using new env in R is already answered in this link
so the reason is in this case you will play with the large of dataset, when you passing your dataset to the function, R will make a copy your dataset and then the return data will overwrite the old dataset. But if you passing env, R will directly process that env without copying large dataset.
Related
I want to create a large lookup table of key value pairs, attempting it like this:
# actual use case is length ~5 million
key <- do.call(paste0, Map(stringi::stri_rand_strings, n=2e5, length = 16))
val <- sample.int(750, size = 2e5, replace = T)
make_dict <- function(keys, values){
require(rlang)
e <- new.env(size = length(keys))
l <- list2(!!!setNames(values, keys))
list2env(l, envir = e, hash = T) # problem in here...?
}
d <- make_dict(key, val)
Problem
When make_dict is run it throws Error: protect(): protection stack overflow. Specifically in RStudio when the input is a vector of length is greater than 49991, which seems very similar to this stackoverflow post.
However, when I run accessor functions to grab some of the values, it seems that make_dict ran fine after all, as I can't find any oddities in its result:
`%||%` <- function(x,y) if(is.null(x)) y else x
grab <- function(...){
vector("integer", length(..2)) |>
(\(.){. = Vectorize(\(e, x) e[[x]] %||% NA_integer_, list("x"), T, F)(..1, ..2); .})()
}
out <- vector("integer", length(key))
out <- grab(d, sample(key)) # using sample to scramble the keys
anyNA(out) | !lobstr::obj_size(out) == lobstr::obj_size(val)
[1] FALSE
Running the same code in RGui does not throw the error.
Oddities
The d environment object does not appear in environment pane in RStudio for size > 5e4.
The R console returns swiftly back to > (signaling the function has finished), but is unresponsive until the error is thrown
Error is thrown if manually setting options(expressions = 5e5), or retaining the default value of 5000
When the error is thrown is proportional to the size of the input vector
tryCatch(make_dict(key, val), error = function(e) e) doesn't catch an error
The error also occurs if code is run from package (Packaged version available through remotes::install_github("D-Se/minimal"))
Question
What's going on here? How to troubleshoot such an error?
options(error = traceback) as advised here didn't give any results. Inserting a browser() after list2env in the make_dict function throws an error long after the browser has opened. A traceback() gives the function .rs.describeObject, which is used to generate the summary in the Environment pane, and can be found here.
traceback()
# .rs.describeObject
(function (env, objName, computeSize = TRUE)
{
obj <- get(objName, env)
hasNullPtr <- .Call("rs_hasExternalPointer", obj, TRUE, PACKAGE = "(embedding)")
if (hasNullPtr) {
val <- "<Object with null pointer>"
desc <- "An R object containing a null external pointer"
size <- 0
len <- 0
}
else {
val <- "(unknown)"
desc <- ""
size <- if (computeSize)
object.size(obj)
else 0
len <- length(obj)
}
class <- .rs.getSingleClass(obj)
contents <- list()
contents_deferred <- FALSE
if (is.language(obj) || is.symbol(obj)) {
val <- deparse(obj)
}
else if (!hasNullPtr) {
if (size > 524288) {
len_desc <- if (len > 1)
paste(len, " elements, ", sep = "")
else ""
if (is.data.frame(obj)) {
val <- "NO_VALUE"
desc <- .rs.valueDescription(obj)
}
else {
val <- paste("Large ", class, " (", len_desc,
format(size, units = "auto", standard = "SI"),
")", sep = "")
}
contents_deferred <- TRUE
}
else {
val <- .rs.valueAsString(obj)
desc <- .rs.valueDescription(obj)
if (class == "data.table" || class == "ore.frame" ||
class == "cast_df" || class == "xts" || class ==
"DataFrame" || is.list(obj) || is.data.frame(obj) ||
isS4(obj)) {
if (computeSize) {
contents <- .rs.valueContents(obj)
}
else {
val <- "NO_VALUE"
contents_deferred <- TRUE
}
}
}
}
list(name = .rs.scalar(objName), type = .rs.scalar(class),
clazz = c(class(obj), typeof(obj)), is_data = .rs.scalar(is.data.frame(obj)),
value = .rs.scalar(val), description = .rs.scalar(desc),
size = .rs.scalar(size), length = .rs.scalar(len), contents = contents,
contents_deferred = .rs.scalar(contents_deferred))
})(<environment>, "d", TRUE)
This github issue pointed out by #technocrat talks about a known bug in earlier versions of RStudio of disabling null external pointer checks, and has since been solved by adding an additional preference check in .rs.describeObject() of
.rs.readUiPref("check_null_external_pointers")
To check if code is run from within RStudio, and if that version is lower than that of before a certain version number (here I use the current official release), a check can be included in the function, or in the .OnAttach of a package:
if(!is.na(Sys.getenv("RSTUDIO", unset = NA)) && .rs.api.versionInfo()$version < "2021.9.1.372")){
# warning or action
}
I have following function as example:
myFunc <- function(x){
while(x < 100){
x <- x+10
cat( x )
cat("\n")
}
}
In the new R version 3.4.1 on Windows I want to source this function from the file myFunc.R like as below:
filepath <- "D:/"
l <- list.files(filepath, pattern = "my", full.names = TRUE)
source(l)
But am getting the following Error:
source(l) Error in source(l) : could not find symbol "exprs" in
environment of the generic function
I hope anyone can help. Thanks a lot
I've been trying to get a parallelized foreach loop running in R, it works fine for approximately ten iterations but then crashes, showing the error:
Error in { : task 7 failed - "missing value where TRUE/FALSE needed"
Calls: %dopar% -> <Anonymous>
Execution halted
I append the results of each loop to a file, which does show the output to be as expected. My script is as followed,using the combn_sub function from this post:
LBRA <- fread(
input = "LBRA.012",
data.table = FALSE)
str_bra <- nrow(LBRA)
br1sums <- colSums(LBRA)
b1non <- which(br1sums == 0)
LBRA_trim <- LBRA[,-b1non]
library(foreach)
library(doMC)
registerDoMC(28)
foreach(X = seq(2, (nrow(LBRA)-1))) %dopar% {
com <- combn_sub(
x = nrow(LBRA),
m = X,
nset = 1000)
out_in <- matrix(
ncol = 2,
nrow = 1)
colnames(out) <- c("SNPs", "k")
for (A in seq(1, ncol(com))){
rowselect <- com[, A]
sub <- LBRA_trim[rowselect, ]
subsum <- colSums(sub)
length <- length(which(subsum != 0)) - 1
out_in <- rbind(out_in, c(length, X))
}
write.table(
file = "plateau.csv",
sep = "\t",
x = out_in,
append = TRUE)
}
I had a similar problem with my foreach call...
tmpcol <- foreach(j = idxs:idxe, .combine=cbind) %dopar% { imp(j) }
Error in { : task 18 failed - "missing value where TRUE/FALSE needed"
Changing the .errorhandling parameter only ignores the error
tmpcol <- foreach(j = idxs:idxe, .combine=cbind, .errorhandling="pass") %dopar% { imp(j) }
Warning message:
In fun(accum, result.18) :
number of rows of result is not a multiple of vector length (arg 2)
I suggest running the function in your foreach call for X=7. The problem in my case was my function, imp(j), was throwing an error (for j=18, it was hitching on an NA calculation) which resulted in the vague output from foreach.
As #Roland points out, it's a very bad idea to write to a file within a foreach loop. Even writing in append mode, the individual cores will attempt to write to the file simultaneously and may clobber each other's input. Instead, capture the results of the foreach statement using the .combine="rbind" option and then write to file after the loop:
cluster <- makeCluster(28, outfile="MulticoreLogging.txt");
registerDoMc(cluster);
foreach_outcome_table <- foreach(X = seq(2, (nrow(LBRA)-1)), .combine="rbind") %dopar% {
print(cat(paste(Sys.info()[['nodename']], Sys.getpid(), sep='-'), "now performing loop", X, "\n"));
com <- combn_sub(x = nrow(LBRA), m = X, nset = 1000);
out_in <- matrix(ncol = 2,nrow = 1);
colnames(out_in) <- c("SNPs", "k");
for (A in seq(1, ncol(com))){
rowselect <- com[, A];
sub <- LBRA_trim[rowselect, ];
subsum <- colSums(sub);
length <- length(which(subsum != 0)) - 1;
out_in <- rbind(out_in, c(length, X));
}
out_in;
}
write.table(file = "plateau.csv",sep = "\t", x = foreach_outcome_table, append = TRUE);
Further, you could replace the inner for loop with a nested foreach loop which would probably be more efficient.
There could be many reasons for the error, "missing value where TRUE/FALSE needed".
What helped for me was to remove the %dopar% and run the same code on a single item. This revealed more/clearer error messages which, I think, get lost when running in parallel. My error had nothing to do with the %dopar% itself.
I come from a python background and am trying to get up to speed with R, so please bear with me
I have an R file - util.R with the following lines
util.add <- function(a,b) a + b
util.sub <- function(a,b) {
a - b
}
I need to write a function that returns the following:
findFunctions('path/util.R')
[1] "util.add" "util.sub"
I think sourcing will be necessary, but you don't need to clutter you global environment. I tested this locally and it seems to work:
find_functions = function(file) {
search_env = new.env()
source(file = file, local = search_env)
objects = ls(envir = search_env)
functions = objects[sapply(ls(envir = search_env), FUN = function(x) {
is.function(get(x, envir = search_env))
})]
return(functions)
}
I'm currently writing a program (full disclosure, it's "homework"). The program is designed to run through a series of files based on a range given, collate them into one large table sans NAs and find the mean of the pollutant provided (which is a column in the table).
I wrote the program previously, but wanted to play around with compartmentalising the functions a bit more, so I rewrote it.
Strangely, some ranges return the exact numeric as in the original program, while others return (relatively) radically different results.
For instance:
pollutantmean("specdata", "sulfate", 1:10)
Old Program: 4.064128
New Program: 4.064128
pollutantmean("specdata", "nitrate", 23)
Old Program: 1.280833
New Program: 1.280833
pollutantmean("specdata", "nitrate", 70:72)
Old Program: 1.706047
New Program: 1.732979
In that final example, the old program is producing the expected result, while the new program is producing a result not within the acceptable margin of error at all.
I'm simply at a loss, I've been trying to rewrite my new code so as to minimise differences with the old cold without simply reproducing the old program, and the current code will be below (with the original program). But nothing is working, I continue to receive the exact same (bad) result despite quite a few changes being made.
New Program:
concatTables <- function(directory, id, hasHeader = TRUE, keepNAs = FALSE) {
totalTable <- NULL
currentTable <- NULL
for (file in id) {
filename <- paste( sep ="",
directory,"/",formatC(file,width=3,format="d",flag="0"),".csv"
);
currentTable <- read.csv(file = filename, header = hasHeader);
if (!is.null(totalTable)) {
totalTable <- rbind(totalTable, currentTable);
}
else {
totalTable <- currentTable;
}
}
if (!keepNAs) {
totalTable <- completeRows(totalTable);
}
totalTable
}
completeRows <- function(table) {
table <- table[complete.cases(table),]
table
}
pollutantmean <- function(directory = paste(getwd(),"/specdata",sep = ""), pollutant, id = 1:332, hasHeader = TRUE, keepNAs = FALSE) {
table <- NULL
table <- concatTables(directory,id,hasHeader,keepNAs);
tableMean <- mean(table[[pollutant]]);
tableMean
}
Old Program
(Which produces better results)
dataFileName <- NULL
pollutantmean <- function(directory = "specdata", pollutant, id = 1:332, idWidth = 3, fullLoop = TRUE) {
dataFrame <- NULL
dataFrameTotal <- NULL
for (i in id) {
dataFileName <- paste(directory, "/", formatC(i, width = idWidth, flag = 0), ".csv", sep = "")
if (!is.null(dataFileName)) {
dataFileConnection <- file(dataFileName)
dataFrame <- read.csv(dataFileConnection, header = TRUE)
dataFrameTotal <- rbind(dataFrame, dataFrameTotal)
##close(dataFileConnection)
if (fullLoop == FALSE) {
break
}
}
else print("DATAFILENAME IS NULL!")
}
print(mean(dataFrameTotal[[pollutant]], na.rm = TRUE))
}
The difference is that complete.cases() returns TRUE on each row where one of the columns is NA, while na.rm arg inside mean func will remove rows where the selected column (vector) is NA.
Example:
x <- airquality[1:10, -1]
x[3,3] <- NA
> mean(x[complete.cases(x), "Temp"]) == mean(x[["Temp"]], na.rm = T)
[1] FALSE
Note that complete.cases() returns TRUE on rows 5, 6 where Solar.R column is NA, so you lose 2 observations not NA in Temp column