Troubleshooting R mapper script on Amazon Elastic MapReduce - Results not as expected - r

I am trying to use Amazon Elastic Map Reduce to run a series of simulations of several million cases. This is an Rscript streaming job with no reducer. I am using the Identity Reducer in my EMR call --reducer org.apache.hadoop.mapred.lib.IdentityReducer.
The script file works fine when tested and run locally from the command line on a Linux box when passing one line of string manually echo "1,2443,2442,1,5" | ./mapper.R and I get the one line of results that I am expecting. However, when I tested my simulation using about 10,000 cases (lines) from the input file on EMR, I only got output for a dozen lines or so out of 10k input lines. I've tried several times and I cannot figure out why. The Hadoop job runs fine without any errors. It seems like input lines are being skipped, or perhaps something is happening with the Identity reducer. The results are correct for the cases where there is output.
My input file is a csv with the following data format, a series of five integers separated by commas:
1,2443,2442,1,5
2,2743,4712,99,8
3,2443,861,282,3177
etc...
Here is my R script for mapper.R
#! /usr/bin/env Rscript
# Define Functions
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
# function to read in the relevant data from needed data files
get.data <- function(casename) {
list <- lapply(casename, function(x) {
read.csv(file = paste("./inputdata/",x, ".csv", sep = ""),
header = TRUE,
stringsAsFactors = FALSE)})
return(data.frame(list))
}
con <- file("stdin")
line <- readLines(con, n = 1, warn = FALSE)
line <- trimWhiteSpace(line)
values <- unlist(strsplit(line, ","))
lv <- length(values)
cases <- as.numeric(values[2:lv])
simid <- paste("sim", values[1], ":", sep = "")
l <- length(cases) # for indexing
## create a vector for the case names
names.vector <- paste("case", cases, sep = ".")
## read in metadata and necessary data columns using get.data function
metadata <- read.csv(file = "./inputdata/metadata.csv", header = TRUE,
stringsAsFactors = FALSE)
d <- cbind(metadata[,1:3], get.data(names.vector))
## Calculations that use df d and produce a string called 'output'
## in the form of "id: value1 value2 value3 ..." to be used at a
## later time for agregation.
cat(output, "\n")
close(con)
The (generalized) EMR call for this simulation is:
ruby elastic-mapreduce --create --stream --input s3n://bucket/project/input.txt --output s3n://bucket/project/output --mapper s3n://bucket/project/mapper.R --reducer org.apache.hadoop.mapred.lib.IdentityReducer --cache-archive s3n://bucket/project/inputdata.tar.gz#inputdata --name Simulation --num-instances 2
If anyone has any insights as to why I might be experiencing these issues, I am open to suggestions, as well as any changes/optimization to the R script.
My other option is to turn the script into a function and run a parallelized apply using R multicore packages, but I haven't tried it yet. I'd like to get this working on EMR. I used JD Long's and Pete Skomoroch's R/EMR examples as a basis for creating the script.

Nothing obvious jumps out. However, can you run the job using a simple input file of only 10 lines? Make sure these 10 lines are scenarios which did not run in your big test case. Try this to eliminate the possibility that your inputs are causing the R script to not produce an answer.
Debugging EMR jobs is a skill of its own.
EDIT:
This is a total fishing expedition, but fire up a EMR interactive pig session using the AWS GUI. "Interactive pig" sessions stay up and running so you can ssh into them. You could also do this from the command line tools, but it's a little easier from the GUI since, hopefully, you only need to do this once. Then ssh into the cluster, transfer over your test case infile your cachefiles and your mapper and then run this:
cat infile.txt | yourMapper.R > outfile.txt
This is just to test if your mapper can parse the infile in the EMR environment with no Hadoop bits in the way.
EDIT 2:
I'm leaving the above text there for posterity but the real issue is your script never goes back to stdin to pick up more data. Thus you get one run for each mapper then it ends. If you run the above one liner you will only get out one result, not a result for each line in infile.txt. If you had run the cat test even on your local machine the error should pop out!
So let's look at Pete's word count in R example:
#! /usr/bin/env Rscript
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
## **** could wo with a single readLines or in blocks
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
## **** can be done as cat(paste(words, "\t1\n", sep=""), sep="")
for (w in words)
cat(w, "\t1\n", sep="")
}
close(con)
The piece your script is missing is this bit:
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
#do your dance
#do your dance quick
#come on everybody tell me what's the word
#word up
}
you should, naturally, replace the lyrics of Cameo's Word Up! with your actual logic.
Keep in mind that proper debugging music makes the process less painful:
http://www.youtube.com/watch?v=MZjAantupsA

Related

monitor changes to a file in R (similar to tail "follow")

I'm wondering if there's a way to monitor the contents of a file from within R, similar to the behavior of tail -f (details here) in the Linux terminal.
Specifically, I want a function that you could pass a file path and it would
print the last n lines of the file to the console
hold the console
continue printing any new lines, as they are added
There are outstanding questions like "what if previously printed lines in the file get modified?" and honestly I'm not sure how tail -f handles that, but I'm interested in streaming a log file to the console, so it's kind of beside the point for my current usage.
I was looking around in the ?readLines and ?file docs and I feel like I'm getting close, but I can't quite figure it out. Plus, I can't imagine I'm the first one to want to do this, so maybe there's an established best practice (or even an existing function). Any help is greatly appreciated.
Thanks!
I made progress on this using the processx package. I created an R script which I named fswatch.R:
library(processx)
monitor <- function(fpath = "test.csv", wait_monitor = 1000 * 60 * 2){
system(paste0("touch ", fpath))
print_last <- function(fpath){
con <- file(fpath, "r", blocking = FALSE)
lines <- readLines(con)
print(lines[length(lines)])
close(con)
}
if(file.exists(fpath)){
print_last(fpath)
}
p <- process$new("fswatch", fpath, stdin = "|", stdout = "|", stderr = "|")
while(
# TRUE
p$is_alive() &
file.exists(fpath)
){
p$poll_io(wait_monitor)
p$read_output()
print_last(fpath)
# call poll_io twice otherwise endless loop :shrug:
p$poll_io(wait_monitor)
p$read_output()
}
p$kill()
}
monitor()
Then I ran the script as a "job" in RStudio.
Every time I wrote to test.csv the job printed the last line. I stopped monitoring by deleting the log file:
log_path <- "test.csv"
write.table('1', log_path, sep = ",", col.names = FALSE,
append = TRUE, row.names = FALSE)
write.table("2", log_path, sep = ",", col.names = FALSE,
append = TRUE, row.names = FALSE)
unlink(log_path)

How to simultaneously read and write a file line by line?

I would like to remove all lines from a file which start with a certain pattern. I would like to do this with R. It is good practice to not first read the whole file, then remove all matching lines and afterwards write the whole file, as the file can be huge. I am thus wondering if I can have both a read and a write connection (open all the time, one at a time?) to the same file. The following shows the idea (but 'hangs' and thus fails).
## Create an example file
fnm <- "foo.txt" # file name
sink(fnm)
cat("Hello\n## ----\nworld\n")
sink()
## Read the file 'fnm' one line at a time and write it back to 'fnm'
## if it does *not* contain the pattern 'pat'
pat <- "## ----" # pattern
while(TRUE) {
rcon <- file(fnm, "r") # read connection
line <- readLines(rcon, n = 1) # read one line
close(rcon)
if(length(line) == 0) { # end of file
break
} else {
if(!grepl(pat, line)) {
wcon <- file(fnm, "w")
writeLines(line, con = wcon)
close(wcon)
}
}
}
Note:
1) See here for an answer if one writes to a new file. One could then delete the old file and rename the new one to the old one, but that does not seem very elegant :-).
2) Update: The following MWE produces
Hello
world
-
world
See:
## Create an example file
fnm <- "foo.txt" # file name
sink(fnm)
cat("Hello\n## ----\nworld\n")
sink()
## Read the file 'fnm' one line at a time and write it back to 'fnm'
## if it does *not* contain the pattern 'pat'
pat <- "## ----" # pattern
con <- file(fnm, "r+") # read and write connection
while(TRUE) {
line <- readLines(con, n = 1L) # read one line
if(length(line) == 0) break # end of file
if(!grepl(pat, line))
writeLines(line, con = con)
}
close(con)
I think you just need open = 'r+'. From ?file:
Modes
"r+", "r+b" -- Open for reading and writing.
I don't have your sample file, so I'll instead just have the following minimal example:
take a file with a-z on 26 lines and replace them one by one with A-Z:
tmp = tempfile()
writeLines(letters, tmp)
f = file(tmp, 'r+')
while (TRUE) {
l = readLines(f, n = 1L)
if (!length(l)) break
writeLines(LETTERS[match(l, letters)], f)
}
close(f)
readLines(f) afterwards confirms this worked.
I understand you want to use R, but just in case you're not aware, there are some really simple scripting tools that excel in this type of task. E.g gawk is designed for pretty much exactly this type of operation and is simple enough to learn that you could write a script for this within minutes even without any prior knowledge.
Here's a one-liner to do this in gawk (or awk if you are on Unix):
gawk -i inplace '!/^pat/ {print}' foo.txt
Of course, it is trivial to do this from within R using
system(paste0("gawk -i inplace '!/^", pat, "/ {print}' ", fnm))

Running R notebook from command line

Ok, I'm trying to run this script through a batch file in Windows (server 2016), but it just starts to push out lineshifts and dots to the output screen:
"c:\Program Files\R\R-3.5.1\bin\rscript.exe" C:\projects\r\HentTsmReport.R
The script works like a charm in RStudio, it reads a html file (a TSM backup report) and transforms the content into a data frame, then it saves one of the html tables as a csv-file.
Why do I just get a gunk of nothing on the screen instead of an output to the csv when running through rscript.exe?
My goal is to run this script through a scheduled task each day to keep a history of the backup status in a table to keep track of failed backups through tivoli.
This is the script in the R-file:
library(XML)
library(RCurl)
#library(tidyverse)
library(rlist)
theurl <- getURL("file://\\\\chill\\um\\backupreport20181029.htm",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
head(tables)
test <- tables[5] # select table number 5
write.csv(test, file = "c:\\temp\\backupreport.csv")

Writing a loop and initializing various class of objects

I am processing a large file, I read in chucks of it and process it and save what I extract. Then after rm(list=ls()) to clear memory (sometime have to use .rs.restartR() as well but that is not of concern in this post), I run the same script after adding 1 in two numbers in my script.
This seemed like a opportunity to try writing a loop but - between trying to initialize all the object that are used in the loop and given that I am not very good with writing loops it got really confusing.
I posted this here to hear some suggestion, I apologize in advance if my question is too vague. Thanks.
####################### A:11
####################### B:12
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*11, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## First use colSums(), saves a numeric vector in `final_dfm_1`
## tib is the desired oject I will save with new name ea. time.
final_dfm_1 <- colSums(dfm_1)
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq12.Rda")
rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)
Below I will run the same script but change 11 to 12 in fread(), and change 12 to 13 in saveRDS() command.
####################### A:12
####################### b:13
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*12, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## Using colSums(), gives a numeric vector`final_dfm_1`
## tib is the desired oject I will save with new name each time.
final_dfm_1 <- colSums(dfm_1)
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq13.Rda")
rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)
Below is a list of all the objects (thanks this post) in my working environment, that are cleared from the working environment before running the the same chunk with A+1, and B+1.
Type Size Rows Columns
dfm_1 dfmSparse 174708600 166836 1731410
bi_tkn_one tokens 152494696 166836 NA
tib tbl_df 148109248 1731410 2
final_dfm_1 numeric 148108544 1731410 NA
text_tbl data.table 22485264 166836 1
I spent some time trying to figure out how to write this loop, found a post on SO about how to initialize a data.table with a character column, but there are still other objects that I think I need to initialize. I am unsure of how plausible it is to write such a loop.
I have copied and pasted the same script back-to-back as shown above and run it all at once. It's silly, since I am just adding one in two places.
Feel free comment on my approach, I would like to learn something out of this. Best
On a side note: I read about adding .rs.restartR() to the loop, and came across post that suggested using batch-files or scheduling tasks in R, I will have to pass on learning those for now.
This was very simple, I didn't have to initialize any objects , that is what I was trying to do. Only things I had to load was the required packages upon starting R and run the loop.
ls()
character(0)
From an empty environment, just a simple loop.
library(data.table)
library(quanteda)
library(dplyr)
for (i in 4:19){
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*i, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 3, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## Using colSums(), gives a numeric vector`final_dfm_1`
## tib is the desired oject I will save with new name each time.
final_dfm_1 <- colSums(dfm_1)
print(setNames(length(final_dfm_1), "no. N-grams in this batch"))
# no. N-grams
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
iplus = i+1
saveRDS(tib, file = paste0("titr",iplus,".Rda"))
rm(list=ls())
Sys.sleep(10)
gc()
Sys.sleep(10)
}
Without initializing any data.table, or other objects the result of above loop was 16 files saved in my working directory.
That makes me think, when do we need to initialize vectors, matrices and other objects that are used to our loop?

How can I read selected rows from a large file using the R "readLines" command and write them to a data frame?

I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows. This function seems to work.
I am now trying to read just the bad rows into a data frame, so far unsuccessfully.
My current approach is to use read.table on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.
I calculate skipVec as:
(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1
But for the moment I am just handing my function a skipVec vector of all zeros.
If my logic is correct, this should return all the rows. It does not. Instead I get an error:
"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep =
"") : no lines available in input"
My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.
My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply.
I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.
Here is the code that produces the error message above:
# Make a small small test data frame, write it to a file, and read it back in
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))
testThis.DF
# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF <- lapply(skipVec, FUN=function(pass){
read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)
The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.
If instead of running read.table through lapply you just run the first few iterations manually, you will see what is going on:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
nnn fff
1 2 aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
X2 X3 bb
1 3 5 cc
Because header = TRUE it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") :
no lines available in input
Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:
write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
line <- read.table(con, nrow = 1, header = FALSE, sep = "",
row.names = 1)
if (pass) NULL else line
})
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)
Some clues towards higher speeds:
use scan instead of read.table. Read data as character and only at the end, after you have put your data into a character matrix or data.frame, apply type.convert to each column.
Instead of looping over skipVec, loop over its rle if it is much shorter. So you'll be able to read or skip chunks of lines at a time.

Resources